skip to main content
10.1145/1102351.1102458acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Large scale genomic sequence SVM classifiers

Authors Info & Claims
Published:07 August 2005Publication History

ABSTRACT

In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences.

References

  1. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the Fisher kernel method to detect remote homologies. Intelligent Systems in Molecular Biology (pp. 149--158). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features (Technical Report 23). LS VIII, University of Dortmund.Google ScholarGoogle Scholar
  5. Joachims, T. (1999). Making large-scale SVM learning practical. Advances in Kernel Methods Support Vector Learning (pp. 169 184). Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing. Kaua'i, Hawaii.Google ScholarGoogle Scholar
  7. Leslie, C., Kuang, R., & Eskin, E. (2003). Inexact matching string kernels for protein classification. Kernel Methods in Computational Biology (pp. 95--112). MIT Press.Google ScholarGoogle Scholar
  8. Liao, L., & Noble, W. (2002). Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology (pp. 225--232). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Meinicke, P., Tech, M., Morgenstern, B., & Merkl, R. (2004). Oligo kernels for datamining on biological sequences: A case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5.Google ScholarGoogle Scholar
  10. Müller, K.-R., Mika. S., Rätsch, G., Tsuda, K., & Schöölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12, 181--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods --- Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rätsch, G., & Candela, J. (2005). Predicting siRNA efficacy. European Conference on Computational Biology, ECCB. (submitted).Google ScholarGoogle Scholar
  13. Rätsch, G., & Sonnenburg, S. (2004). Accurate splice site prediction for caenorhabditis elegans. 277--298. MIT Press series on Computational Molecular Biology. MIT Press.Google ScholarGoogle Scholar
  14. Rätsch, G., & Sonnenburg, S., & Schö& Schöölkopf, B. (2005). Rase: Recognition of alternatively spliced exons in c. elegans. ISMB 2005. (accepted).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Schölkopf, B. (1997). Support vector learning. Munich: Oldenbourg Verlag.Google ScholarGoogle Scholar
  16. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  17. Schölkopf, B., Tsuda, K., & Vert, J. (Eds.). (2003). Kernel methods in computational biology. MIT Press series on Computational Molecular Biology. MIT Press.Google ScholarGoogle Scholar
  18. Tsuda, K., Kawanabe, M., Rätsch, C., Sonnenburg, S., & Müüller, K. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14, 2397--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vert, J.-P., Saigo, H., & Akutsu, T. (2003). Local alignment kernels for biological sequences. Kernel Methods in Computational Biology (pp. 131--154). MIT Press.Google ScholarGoogle Scholar
  20. Vishwanathan, S., & Smola, A. (2003). Fast kernels for string and tree matching. Kernel Methods in Computational Biology (pp. 113--130). MIT Press.Google ScholarGoogle Scholar
  21. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., & Müüller, K.-R. (2000). Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16, 799--807.Google ScholarGoogle ScholarCross RefCross Ref
  1. Large scale genomic sequence SVM classifiers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICML '05: Proceedings of the 22nd international conference on Machine learning
          August 2005
          1113 pages
          ISBN:1595931805
          DOI:10.1145/1102351

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 August 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate140of548submissions,26%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader