Article

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Authors:
Li Liao

E. I. du Pont de Nemours Company

E. I. du Pont de Nemours Company
View Profile

,
William Stafford Noble

Columbia University

Columbia University
View Profile

RECOMB '02: Proceedings of the sixth annual international conference on Computational biologyApril 2002Pages 225–232https://doi.org/10.1145/565196.565225

Published:18 April 2002Publication History

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

Pages 225–232

ABSTRACT

One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVM-based protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as Smith-Waterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.

References

S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, A basic local alignment search tool, Journal of Molecular Biology, 215:403--410, 1990.Google ScholarCross Ref
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389--3402, 1997.Google ScholarCross Ref
T. L. Bailey and W. N. Grundy, Classifying proteins by family using the product ofGoogle Scholar
P. Baldi, Y. Chauvin, T. Hunkapiller and M. A. McClure, Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences of the United States of America, 91(3):1059--1063, 1994.Google ScholarCross Ref
C. Bishop, Neural Networks for Pattern Recognition, Oxford UP, Oxford, UK, 1995. Google ScholarDigital Library
S. E. Brenner, P. Koehl and M. Levitt, The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research, 28:254--256, 2000.Google ScholarCross Ref
M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, Jr. and D. Haussler, Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences of the United States of America, 97(1):262--267, 2000.Google ScholarCross Ref
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge UP, 2000. Google ScholarDigital Library
S. R. Eddy, Multiple Alignment Using Hidden Markov Models, In C. Rawlings, editor, Proceedings of the Third Annual International Conference on Intelligent Systems for Molecular Biology, pages 114--120. AAAI Press, 1995.Google Scholar
M. Gribskov, R. Lüthy and D. Eisenberg, Profile Analysis. Methods in Enzymology, 183:146--159, 1990.Google ScholarCross Ref
M. Gribskov and N. L. Robinson, Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers and Chemistry, 20(1):25--33, 1996.Google ScholarCross Ref
W. N. Grundy, Family-based Homology Detection via Pairwise Sequence Comparison. In S. Istrail, P. Pevzner and M. Waterman, editors, Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 94--100. ACM, April 1999. Google ScholarDigital Library
D. Haussler, Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University of California, Santa Cruz, CA, July 1999.Google Scholar
S. Henikoff and J. G. Henikoff, Embedding strategies for effective use of information from multiple sequence alignments. Protein Science, 6(3):698--705, 1997.Google ScholarCross Ref
T. Jaakkola, M. Diekhans and D. Haussler, Using the Fisher kernel method to detect remote protein homologies. Proceedings of the Seventh Annual International Conference on Intelligent Systems for Molecular Biology, pages 149--1158, Menlo Park, CA, 1999. AAAI Press. Google ScholarDigital Library
T. Jaakkola, M. Diekhans and D. Haussler, A discriminative framework for detecting remote protein homologies. Journal on Computational Biology, 7(1--2):95--114, 2000.Google ScholarCross Ref
K. Karplus, C. Barrett and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10):846--56, 1998.Google ScholarCross Ref
A. Krogh, M. Brown, I. Mian, K. Sjolander and D. Haussler, Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal on Molecular Biology, 235:1501--1531, 1994.Google ScholarCross Ref
C. Leslie, E. Eskin and W. S. Noble, The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. To Appear.Google Scholar
A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal on Molecular Biology, 247:536--540, 1995.Google ScholarCross Ref
J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard and C. Chothia, Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal on Molecular Biology, 284:1201--1210, 1998.Google ScholarCross Ref
W. R. Pearson, Rapid and sensitive sequence comparisions with FASTP and FASTA. Methods in Enzymology, 183:63--98, 1985.Google ScholarCross Ref
S. L. Salzberg, On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:371--328, 1997. Google ScholarDigital Library
T. Smith and M. Waterman, Identification of common molecular subsequences. Journal on Molecular Biology, 147:195--197, 1981.Google ScholarCross Ref
J. D. Thompson, D. G. Higgins and T. J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673--4680, 1994.Google ScholarCross Ref
V. N. Vapnik, Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998.Google Scholar
C. Watkins, Dynamic alignment kernels. In A. J. Smola, P. Bartlett, B. Schölkopf and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999.Google Scholar

Index Terms

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Recommendations

Improving protein remote homology detection using supervised and semi-supervised support vector machines
Read More
Sequence-Based Prediction of Protein Folding Rates Using Contacts, Secondary Structures and Support Vector Machines
BIBM '09: Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine

Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish ...
Read More
Remote protein homology detection and fold recognition using two-layer support vector machine classifiers

Remote protein homology detection and fold recognition refer to detection of structural homology in proteins where there are small or no similarities in the sequence. To detect protein structural classes from protein primary sequence information, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
April 2002
341 pages
ISBN:1581134983
DOI:10.1145/565196
Editors:
Gene Myers
Celera, USA
,
Sridhar Hannenhalli
Celera, USA
,
David Sankoff
University of Montréal, Canada
,
Sorin Istrail
Celera, USA
,
Pavel Pevzner
University of California at San Diego, USA
,
Michael Waterman
University of California, USA
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 896
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving protein remote homology detection using supervised and semi-supervised support vector machines

Sequence-Based Prediction of Protein Folding Rates Using Contacts, Secondary Structures and Support Vector Machines

Remote protein homology detection and fold recognition using two-layer support vector machine classifiers