Article

Large scale genomic sequence SVM classifiers

Authors:
Sören Sonnenburg

Fraunhofer Institute FIRST, Berlin, Germany

Fraunhofer Institute FIRST, Berlin, Germany
View Profile

,
Gunnar Rätsch

Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany

Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany
View Profile

,
Bernhard Schölkopf

Max Planck Institute for Biological Cybernetics, Tübingen, Germany

Max Planck Institute for Biological Cybernetics, Tübingen, Germany
View Profile

ICML '05: Proceedings of the 22nd international conference on Machine learningAugust 2005Pages 848–855https://doi.org/10.1145/1102351.1102458

Published:07 August 2005Publication History

ICML '05: Proceedings of the 22nd international conference on Machine learning

Pages 848–855

ABSTRACT

In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences.

References

Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273--297. Google ScholarDigital Library
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press. Google ScholarDigital Library
Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the Fisher kernel method to detect remote homologies. Intelligent Systems in Molecular Biology (pp. 149--158). Google ScholarDigital Library
Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features (Technical Report 23). LS VIII, University of Dortmund.Google Scholar
Joachims, T. (1999). Making large-scale SVM learning practical. Advances in Kernel Methods Support Vector Learning (pp. 169 184). Cambridge, MA: MIT Press. Google ScholarDigital Library
Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing. Kaua'i, Hawaii.Google Scholar
Leslie, C., Kuang, R., & Eskin, E. (2003). Inexact matching string kernels for protein classification. Kernel Methods in Computational Biology (pp. 95--112). MIT Press.Google Scholar
Liao, L., & Noble, W. (2002). Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology (pp. 225--232). Google ScholarDigital Library
Meinicke, P., Tech, M., Morgenstern, B., & Merkl, R. (2004). Oligo kernels for datamining on biological sequences: A case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5.Google Scholar
Müller, K.-R., Mika. S., Rätsch, G., Tsuda, K., & Schöölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12, 181--201. Google ScholarDigital Library
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods --- Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarDigital Library
Rätsch, G., & Candela, J. (2005). Predicting siRNA efficacy. European Conference on Computational Biology, ECCB. (submitted).Google Scholar
Rätsch, G., & Sonnenburg, S. (2004). Accurate splice site prediction for caenorhabditis elegans. 277--298. MIT Press series on Computational Molecular Biology. MIT Press.Google Scholar
Rätsch, G., & Sonnenburg, S., & Schö& Schöölkopf, B. (2005). Rase: Recognition of alternatively spliced exons in c. elegans. ISMB 2005. (accepted).Google ScholarDigital Library
Schölkopf, B. (1997). Support vector learning. Munich: Oldenbourg Verlag.Google Scholar
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.Google Scholar
Schölkopf, B., Tsuda, K., & Vert, J. (Eds.). (2003). Kernel methods in computational biology. MIT Press series on Computational Molecular Biology. MIT Press.Google Scholar
Tsuda, K., Kawanabe, M., Rätsch, C., Sonnenburg, S., & Müüller, K. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14, 2397--414. Google ScholarDigital Library
Vert, J.-P., Saigo, H., & Akutsu, T. (2003). Local alignment kernels for biological sequences. Kernel Methods in Computational Biology (pp. 131--154). MIT Press.Google Scholar
Vishwanathan, S., & Smola, A. (2003). Fast kernels for string and tree matching. Kernel Methods in Computational Biology (pp. 113--130). MIT Press.Google Scholar
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., & Müüller, K.-R. (2000). Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16, 799--807.Google ScholarCross Ref

Large scale genomic sequence SVM classifiers
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis
DTMBIO '14: Proceedings of the ACM 8th International Workshop on Data and Text Mining in Bioinformatics

E. coli sequence type 131 (ST131) is one of pathogens that causes resistant infections. Comparative genome analyses allow interpretations of the virulence factors of pathogens. Thus, in this study, we analysis the genomic differences between the ...
Read More
Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite ...
Read More
Genomic Sequence Fragment Identification using Quasi-Alignment
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Identification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICML '05: Proceedings of the 22nd international conference on Machine learning
August 2005
1113 pages
ISBN:1595931805
DOI:10.1145/1102351
General Chair:
Saso Dzeroski
Jozef Stefan Institute, Slovenia
,
Program Chairs:
Luc De Raedt,
Stefan Wrobel
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate140of548submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 297
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Large scale genomic sequence SVM classifiers

ICML '05: Proceedings of the 22nd international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Genomic Sequence Fragment Identification using Quasi-Alignment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Large scale genomic sequence SVM classifiers

ICML '05: Proceedings of the 22nd international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Genomic Sequence Fragment Identification using Quasi-Alignment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media