ABSTRACT
In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences.
- Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273--297. Google ScholarDigital Library
- Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press. Google ScholarDigital Library
- Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the Fisher kernel method to detect remote homologies. Intelligent Systems in Molecular Biology (pp. 149--158). Google ScholarDigital Library
- Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features (Technical Report 23). LS VIII, University of Dortmund.Google Scholar
- Joachims, T. (1999). Making large-scale SVM learning practical. Advances in Kernel Methods Support Vector Learning (pp. 169 184). Cambridge, MA: MIT Press. Google ScholarDigital Library
- Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing. Kaua'i, Hawaii.Google Scholar
- Leslie, C., Kuang, R., & Eskin, E. (2003). Inexact matching string kernels for protein classification. Kernel Methods in Computational Biology (pp. 95--112). MIT Press.Google Scholar
- Liao, L., & Noble, W. (2002). Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology (pp. 225--232). Google ScholarDigital Library
- Meinicke, P., Tech, M., Morgenstern, B., & Merkl, R. (2004). Oligo kernels for datamining on biological sequences: A case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5.Google Scholar
- Müller, K.-R., Mika. S., Rätsch, G., Tsuda, K., & Schöölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12, 181--201. Google ScholarDigital Library
- Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods --- Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarDigital Library
- Rätsch, G., & Candela, J. (2005). Predicting siRNA efficacy. European Conference on Computational Biology, ECCB. (submitted).Google Scholar
- Rätsch, G., & Sonnenburg, S. (2004). Accurate splice site prediction for caenorhabditis elegans. 277--298. MIT Press series on Computational Molecular Biology. MIT Press.Google Scholar
- Rätsch, G., & Sonnenburg, S., & Schö& Schöölkopf, B. (2005). Rase: Recognition of alternatively spliced exons in c. elegans. ISMB 2005. (accepted).Google ScholarDigital Library
- Schölkopf, B. (1997). Support vector learning. Munich: Oldenbourg Verlag.Google Scholar
- Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.Google Scholar
- Schölkopf, B., Tsuda, K., & Vert, J. (Eds.). (2003). Kernel methods in computational biology. MIT Press series on Computational Molecular Biology. MIT Press.Google Scholar
- Tsuda, K., Kawanabe, M., Rätsch, C., Sonnenburg, S., & Müüller, K. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14, 2397--414. Google ScholarDigital Library
- Vert, J.-P., Saigo, H., & Akutsu, T. (2003). Local alignment kernels for biological sequences. Kernel Methods in Computational Biology (pp. 131--154). MIT Press.Google Scholar
- Vishwanathan, S., & Smola, A. (2003). Fast kernels for string and tree matching. Kernel Methods in Computational Biology (pp. 113--130). MIT Press.Google Scholar
- Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., & Müüller, K.-R. (2000). Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16, 799--807.Google ScholarCross Ref
- Large scale genomic sequence SVM classifiers
Recommendations
Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis
DTMBIO '14: Proceedings of the ACM 8th International Workshop on Data and Text Mining in BioinformaticsE. coli sequence type 131 (ST131) is one of pathogens that causes resistant infections. Comparative genome analyses allow interpretations of the virulence factors of pathogens. Thus, in this study, we analysis the genomic differences between the ...
Genomic Sequence Fragment Identification using Quasi-Alignment
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical InformaticsIdentification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases ...
Comments