skip to main content
10.1145/565196.565225acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Published:18 April 2002Publication History

ABSTRACT

One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVM-based protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as Smith-Waterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.

References

  1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, A basic local alignment search tool, Journal of Molecular Biology, 215:403--410, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389--3402, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  3. T. L. Bailey and W. N. Grundy, Classifying proteins by family using the product ofGoogle ScholarGoogle Scholar
  4. P. Baldi, Y. Chauvin, T. Hunkapiller and M. A. McClure, Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences of the United States of America, 91(3):1059--1063, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Bishop, Neural Networks for Pattern Recognition, Oxford UP, Oxford, UK, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. E. Brenner, P. Koehl and M. Levitt, The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research, 28:254--256, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, Jr. and D. Haussler, Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences of the United States of America, 97(1):262--267, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  8. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge UP, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. R. Eddy, Multiple Alignment Using Hidden Markov Models, In C. Rawlings, editor, Proceedings of the Third Annual International Conference on Intelligent Systems for Molecular Biology, pages 114--120. AAAI Press, 1995.Google ScholarGoogle Scholar
  10. M. Gribskov, R. Lüthy and D. Eisenberg, Profile Analysis. Methods in Enzymology, 183:146--159, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Gribskov and N. L. Robinson, Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers and Chemistry, 20(1):25--33, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  12. W. N. Grundy, Family-based Homology Detection via Pairwise Sequence Comparison. In S. Istrail, P. Pevzner and M. Waterman, editors, Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 94--100. ACM, April 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Haussler, Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University of California, Santa Cruz, CA, July 1999.Google ScholarGoogle Scholar
  14. S. Henikoff and J. G. Henikoff, Embedding strategies for effective use of information from multiple sequence alignments. Protein Science, 6(3):698--705, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  15. T. Jaakkola, M. Diekhans and D. Haussler, Using the Fisher kernel method to detect remote protein homologies. Proceedings of the Seventh Annual International Conference on Intelligent Systems for Molecular Biology, pages 149--1158, Menlo Park, CA, 1999. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Jaakkola, M. Diekhans and D. Haussler, A discriminative framework for detecting remote protein homologies. Journal on Computational Biology, 7(1--2):95--114, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Karplus, C. Barrett and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10):846--56, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Krogh, M. Brown, I. Mian, K. Sjolander and D. Haussler, Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal on Molecular Biology, 235:1501--1531, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Leslie, E. Eskin and W. S. Noble, The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. To Appear.Google ScholarGoogle Scholar
  20. A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal on Molecular Biology, 247:536--540, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard and C. Chothia, Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal on Molecular Biology, 284:1201--1210, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  22. W. R. Pearson, Rapid and sensitive sequence comparisions with FASTP and FASTA. Methods in Enzymology, 183:63--98, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. L. Salzberg, On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:371--328, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Smith and M. Waterman, Identification of common molecular subsequences. Journal on Molecular Biology, 147:195--197, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. D. Thompson, D. G. Higgins and T. J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673--4680, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  26. V. N. Vapnik, Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998.Google ScholarGoogle Scholar
  27. C. Watkins, Dynamic alignment kernels. In A. J. Smola, P. Bartlett, B. Schölkopf and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999.Google ScholarGoogle Scholar

Index Terms

  1. Combining pairwise sequence similarity and support vector machines for remote protein homology detection

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
                April 2002
                341 pages
                ISBN:1581134983
                DOI:10.1145/565196

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 18 April 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader