ABSTRACT
One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVM-based protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as Smith-Waterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, A basic local alignment search tool, Journal of Molecular Biology, 215:403--410, 1990.Google ScholarCross Ref
- S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389--3402, 1997.Google ScholarCross Ref
- T. L. Bailey and W. N. Grundy, Classifying proteins by family using the product ofGoogle Scholar
- P. Baldi, Y. Chauvin, T. Hunkapiller and M. A. McClure, Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences of the United States of America, 91(3):1059--1063, 1994.Google ScholarCross Ref
- C. Bishop, Neural Networks for Pattern Recognition, Oxford UP, Oxford, UK, 1995. Google ScholarDigital Library
- S. E. Brenner, P. Koehl and M. Levitt, The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research, 28:254--256, 2000.Google ScholarCross Ref
- M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, Jr. and D. Haussler, Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences of the United States of America, 97(1):262--267, 2000.Google ScholarCross Ref
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge UP, 2000. Google ScholarDigital Library
- S. R. Eddy, Multiple Alignment Using Hidden Markov Models, In C. Rawlings, editor, Proceedings of the Third Annual International Conference on Intelligent Systems for Molecular Biology, pages 114--120. AAAI Press, 1995.Google Scholar
- M. Gribskov, R. Lüthy and D. Eisenberg, Profile Analysis. Methods in Enzymology, 183:146--159, 1990.Google ScholarCross Ref
- M. Gribskov and N. L. Robinson, Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers and Chemistry, 20(1):25--33, 1996.Google ScholarCross Ref
- W. N. Grundy, Family-based Homology Detection via Pairwise Sequence Comparison. In S. Istrail, P. Pevzner and M. Waterman, editors, Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 94--100. ACM, April 1999. Google ScholarDigital Library
- D. Haussler, Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University of California, Santa Cruz, CA, July 1999.Google Scholar
- S. Henikoff and J. G. Henikoff, Embedding strategies for effective use of information from multiple sequence alignments. Protein Science, 6(3):698--705, 1997.Google ScholarCross Ref
- T. Jaakkola, M. Diekhans and D. Haussler, Using the Fisher kernel method to detect remote protein homologies. Proceedings of the Seventh Annual International Conference on Intelligent Systems for Molecular Biology, pages 149--1158, Menlo Park, CA, 1999. AAAI Press. Google ScholarDigital Library
- T. Jaakkola, M. Diekhans and D. Haussler, A discriminative framework for detecting remote protein homologies. Journal on Computational Biology, 7(1--2):95--114, 2000.Google ScholarCross Ref
- K. Karplus, C. Barrett and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10):846--56, 1998.Google ScholarCross Ref
- A. Krogh, M. Brown, I. Mian, K. Sjolander and D. Haussler, Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal on Molecular Biology, 235:1501--1531, 1994.Google ScholarCross Ref
- C. Leslie, E. Eskin and W. S. Noble, The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. To Appear.Google Scholar
- A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal on Molecular Biology, 247:536--540, 1995.Google ScholarCross Ref
- J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard and C. Chothia, Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal on Molecular Biology, 284:1201--1210, 1998.Google ScholarCross Ref
- W. R. Pearson, Rapid and sensitive sequence comparisions with FASTP and FASTA. Methods in Enzymology, 183:63--98, 1985.Google ScholarCross Ref
- S. L. Salzberg, On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:371--328, 1997. Google ScholarDigital Library
- T. Smith and M. Waterman, Identification of common molecular subsequences. Journal on Molecular Biology, 147:195--197, 1981.Google ScholarCross Ref
- J. D. Thompson, D. G. Higgins and T. J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673--4680, 1994.Google ScholarCross Ref
- V. N. Vapnik, Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998.Google Scholar
- C. Watkins, Dynamic alignment kernels. In A. J. Smola, P. Bartlett, B. Schölkopf and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999.Google Scholar
Index Terms
- Combining pairwise sequence similarity and support vector machines for remote protein homology detection
Recommendations
Sequence-Based Prediction of Protein Folding Rates Using Contacts, Secondary Structures and Support Vector Machines
BIBM '09: Proceedings of the 2009 IEEE International Conference on Bioinformatics and BiomedicinePredicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish ...
Remote protein homology detection and fold recognition using two-layer support vector machine classifiers
Remote protein homology detection and fold recognition refer to detection of structural homology in proteins where there are small or no similarities in the sequence. To detect protein structural classes from protein primary sequence information, ...
Comments