skip to main content
research-article

A brief survey on sequence classification

Published:09 November 2010Publication History
Skip Abstract Section

Abstract

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

References

  1. Embl nucleotide sequence database homepage: http://www.ebi.ac.uk/embl/.Google ScholarGoogle Scholar
  2. Entrez protein database homepage: http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein.Google ScholarGoogle Scholar
  3. Genbank homepage: http://www.ncbi.nlm.nih.gov/Genbank/.Google ScholarGoogle Scholar
  4. Time series data library webpage: http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/.Google ScholarGoogle Scholar
  5. C. C. Aggarwal. On effective classification of strings with wavelets. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 163--172, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipmanl. Basic local alignment search tool.J.Mol.Biol., 215:403--410, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  7. Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In ICML '03: The Twentieth International Conference on Machine Learning, pages 3--10, 2003.Google ScholarGoogle Scholar
  8. B. Anibal, S. M. Aranzazu, and R. J. Jose. Early fault classification in dynamic systems using case-based reasoning. Lecture notes in computer science, 4177:211--220, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian. Traffic classification on the fly. Computer Communication Review, 36(2):23--26, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Blekas, D. I. Fotiadis, and A. Likas. Motif-based protein sequence classification using neural networks. Journal of Computational Biology, 12(1):64--82, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Cheng, J. Carbonell, and J.Klein-Seetharaman. Protein classification based on text document classification techniques. Proteins, 1(58):855--970, 2005.Google ScholarGoogle Scholar
  12. N. A. Chuzhanova, A. J. Jones, and S. Margetts. Feature selection for genetic sequence classification. Bioinformatics, 14(2):139--143, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Deshpande and G. Karypis. Evaluation of techniques for classifying biological sequences. In PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 417--431, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. J. R. Diez, C. A. González, and H. Boström. Boosting interval based literals. Intell. Data Anal., 5(3):245--262, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Dong and P. Jian. Sequence Data Mining, pages 47--65. Springer US, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Chapter 3. Markov Chain and Hidden Markov Model. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, pages 47--65. Cambridge University Press, 1998.Google ScholarGoogle Scholar
  18. O. Duskin and D. G. Feitelson. Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In WSCD09: Proceedings of the 2009 workshop on Web Search Click Data, pages 15--19, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119--139, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 369--376, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. P. Griffin and J. R. Moorman. Toward the early diagnosis of neonatal sepsis and sepsis-like illness using novel heart rate analysis. PEDIATRICS, 107(1):97--104, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  22. X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst., 11(3):259--286, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. W. Kadous. Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. W. Kadous and C. Sammut. Classification of multivariate time series and structured data using constructive induction. Machine Learning, 58(2-3):179--216, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Kaján, A. Kertész-Farkas, D. Franklin, N. Ivanova, A. Kocsor, and S. Pongor. Application of a simple likelihood ratio approximant to protein sequence classification. Bioinformatics, 22(23):2865--2869, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 102--111, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification and clustering homepage: http://www.cs.ucr.edu/~eamonn/time_series_data/, 2006.Google ScholarGoogle Scholar
  28. E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for datamining applications. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 285--289, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng. Some effective techniques for naive bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11):1457--1466, Nov. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Kudenko and H. Hirsh. Feature generation for sequence categorization. In AAAI '98/IAAI '98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pages 733--738, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. W. Lam, W.-K. Sung, S.-L. Tam, C.-K. Wong, and S.-M. Yiu. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791--797, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Lane and C. E. Brodley. Temporal sequence learning and data reduction for anomaly detection. ACM Trans. Inf. Syst. Secur., 2(3):295--331, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Lesh, M. J. Zaki, and M. Ogihara. Mining features for sequence classification. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 342--346, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Pacific Symposium on Biocomputing, pages 566--575, 2002.Google ScholarGoogle Scholar
  36. C. S. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5:1435--1455, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In ECML' 98: The 10th European Conference on Machine Learning, pages 4--15, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. Li, L. Khan, and B. Prabhakaran. Real-time classification of variable length multi-attribute motions. Knowl. Inf. Syst., 10(2):163--183, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Li and R. Sleep. A robust approach to sequence classification. In ICTAI '05: Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, pages 197--201, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2):107--144, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285--318, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Liu, P. Zhang, and D. Zeng. Sequence matching for suspicious activity detection in anti-money laundering. In PAISI, PACCF and SOCO '08: Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics, pages 50--61, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. J. C. H.Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419--444, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol., 48:443--453, 1970.Google ScholarGoogle ScholarCross RefCross Ref
  45. L. A. Newberg. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics, 24(16):1772--1778, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. L. Rabiner. A tutorial on HMM and selected applications in speech recognition. In IEEE, pages 257--286, 1998.Google ScholarGoogle Scholar
  48. C. A. Ratanamahatana and E. J. Keogh. Making time-series classification more accurate using learned constraints. In SDM '04: SIAM International Conference on Data Mining, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  49. H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682--1689, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology, pages 171--192. The MIT press, 2004.Google ScholarGoogle Scholar
  51. F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. R. She, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. L. Brinkman. Frequent-subsequence-based prediction of outer membrane proteins. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436--445, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. T. Smith and M. Waterman. Identification of common molecular subsequences. J.Mol.Biol., 147:195--197, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  54. S. Sonnenburg, G. Rätsch, and C. Schäfer. Learning interpretable SVMs for biological sequence classification. In RECOMB '05: The Ninth Annual International Conference on Research in Computational Molecular Biology, pages 389--407, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. S. Sonnenburg, G. Rätsch, and B. Schölkopf. Large scale genomic sequence svm classifiers. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 848--855, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. P. K. Srivastava, D. K. Desai, S. Nandi, and A. M. Lynn. HMM-ModE-Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics, 8(104), 2007.Google ScholarGoogle Scholar
  57. A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In ICDM, pages 521--528, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9--35, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. L. Wei and E. Keogh. Semi-supervised time series classification. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 748--753, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. Weston, C. S. Leslie, D. Zhou, A. Elisseeff, and W. S. Noble. Semi-supervised protein classification using cluster kernels. In NIPS, 2003.Google ScholarGoogle Scholar
  61. X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana. Fast time series classification using numerosity reduction. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 1033--1040, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Z. Xing, J. Pei, G. Dong, and P. S. Yu. Mining sequence classifiers for early prediction. In SDM'08: Proceedings of the 2008 SIAM international conference on data mining, pages 644--655, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  63. Z. Xing, J. Pei, and P. S. Yu. Early classification on time series: A nearest neighbor approach. In IJCAI'09: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1297--1302, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. O. Yakhnenko, A. Silvescu, and V. Honavar. Discriminatively trained markov model for sequence classification. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 498--505, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. L. Ye and E. Keogh. Time series shapeletes: A new primitive for data mining. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Zhong. Semi-supervised sequence classification with Hmms. IJPRAI, 19(2):165--182, 2005.Google ScholarGoogle Scholar
  67. S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled classification using maximum entropy method. In SIGIR, pages 274--281, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGKDD Explorations Newsletter
    ACM SIGKDD Explorations Newsletter  Volume 12, Issue 1
    June 2010
    77 pages
    ISSN:1931-0145
    EISSN:1931-0153
    DOI:10.1145/1882471
    Issue’s Table of Contents

    Copyright © 2010 Authors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 November 2010

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader