Abstract
Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.
- Embl nucleotide sequence database homepage: http://www.ebi.ac.uk/embl/.Google Scholar
- Entrez protein database homepage: http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein.Google Scholar
- Genbank homepage: http://www.ncbi.nlm.nih.gov/Genbank/.Google Scholar
- Time series data library webpage: http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/.Google Scholar
- C. C. Aggarwal. On effective classification of strings with wavelets. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 163--172, 2002. Google ScholarDigital Library
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipmanl. Basic local alignment search tool.J.Mol.Biol., 215:403--410, 1990.Google ScholarCross Ref
- Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In ICML '03: The Twentieth International Conference on Machine Learning, pages 3--10, 2003.Google Scholar
- B. Anibal, S. M. Aranzazu, and R. J. Jose. Early fault classification in dynamic systems using case-based reasoning. Lecture notes in computer science, 4177:211--220, 2005. Google ScholarDigital Library
- L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian. Traffic classification on the fly. Computer Communication Review, 36(2):23--26, 2006. Google ScholarDigital Library
- K. Blekas, D. I. Fotiadis, and A. Likas. Motif-based protein sequence classification using neural networks. Journal of Computational Biology, 12(1):64--82, 2005.Google ScholarCross Ref
- B. Cheng, J. Carbonell, and J.Klein-Seetharaman. Protein classification based on text document classification techniques. Proteins, 1(58):855--970, 2005.Google Scholar
- N. A. Chuzhanova, A. J. Jones, and S. Margetts. Feature selection for genetic sequence classification. Bioinformatics, 14(2):139--143, 1998.Google ScholarCross Ref
- M. Deshpande and G. Karypis. Evaluation of techniques for classifying biological sequences. In PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 417--431, 2002. Google ScholarDigital Library
- J. J. R. Diez, C. A. González, and H. Boström. Boosting interval based literals. Intell. Data Anal., 5(3):245--262, 2001. Google ScholarDigital Library
- H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarDigital Library
- G. Dong and P. Jian. Sequence Data Mining, pages 47--65. Springer US, 2007. Google ScholarDigital Library
- R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Chapter 3. Markov Chain and Hidden Markov Model. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, pages 47--65. Cambridge University Press, 1998.Google Scholar
- O. Duskin and D. G. Feitelson. Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In WSCD09: Proceedings of the 2009 workshop on Web Search Click Data, pages 15--19, 2009. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119--139, 1997. Google ScholarDigital Library
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 369--376, 2006. Google ScholarDigital Library
- M. P. Griffin and J. R. Moorman. Toward the early diagnosis of neonatal sepsis and sepsis-like illness using novel heart rate analysis. PEDIATRICS, 107(1):97--104, 2001.Google ScholarCross Ref
- X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst., 11(3):259--286, 2007. Google ScholarDigital Library
- M. W. Kadous. Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis, 2002. Google ScholarDigital Library
- M. W. Kadous and C. Sammut. Classification of multivariate time series and structured data using constructive induction. Machine Learning, 58(2-3):179--216, 2005. Google ScholarDigital Library
- L. Kaján, A. Kertész-Farkas, D. Franklin, N. Ivanova, A. Kocsor, and S. Pongor. Application of a simple likelihood ratio approximant to protein sequence classification. Bioinformatics, 22(23):2865--2869, 2006. Google ScholarDigital Library
- E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 102--111, 2002. Google ScholarDigital Library
- E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification and clustering homepage: http://www.cs.ucr.edu/~eamonn/time_series_data/, 2006.Google Scholar
- E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for datamining applications. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 285--289, 2000. Google ScholarDigital Library
- S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng. Some effective techniques for naive bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11):1457--1466, Nov. 2006. Google ScholarDigital Library
- D. Kudenko and H. Hirsh. Feature generation for sequence categorization. In AAAI '98/IAAI '98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pages 733--738, 1998. Google ScholarDigital Library
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
- T. W. Lam, W.-K. Sung, S.-L. Tam, C.-K. Wong, and S.-M. Yiu. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791--797, 2008. Google ScholarDigital Library
- T. Lane and C. E. Brodley. Temporal sequence learning and data reduction for anomaly detection. ACM Trans. Inf. Syst. Secur., 2(3):295--331, 1999. Google ScholarDigital Library
- N. Lesh, M. J. Zaki, and M. Ogihara. Mining features for sequence classification. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 342--346, 1999. Google ScholarDigital Library
- C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Pacific Symposium on Biocomputing, pages 566--575, 2002.Google Scholar
- C. S. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5:1435--1455, 2004. Google ScholarDigital Library
- D. D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In ECML' 98: The 10th European Conference on Machine Learning, pages 4--15, 1998. Google ScholarDigital Library
- C. Li, L. Khan, and B. Prabhakaran. Real-time classification of variable length multi-attribute motions. Knowl. Inf. Syst., 10(2):163--183, 2006. Google ScholarDigital Library
- M. Li and R. Sleep. A robust approach to sequence classification. In ICTAI '05: Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, pages 197--201, 2005. Google ScholarDigital Library
- J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2):107--144, 2007. Google ScholarDigital Library
- N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285--318, 1987. Google ScholarDigital Library
- X. Liu, P. Zhang, and D. Zeng. Sequence matching for suspicious activity detection in anti-money laundering. In PAISI, PACCF and SOCO '08: Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics, pages 50--61, 2008. Google ScholarDigital Library
- H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. J. C. H.Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419--444, 2002. Google ScholarDigital Library
- S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol., 48:443--453, 1970.Google ScholarCross Ref
- L. A. Newberg. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics, 24(16):1772--1778, 2008. Google ScholarDigital Library
- K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000. Google ScholarDigital Library
- L. Rabiner. A tutorial on HMM and selected applications in speech recognition. In IEEE, pages 257--286, 1998.Google Scholar
- C. A. Ratanamahatana and E. J. Keogh. Making time-series classification more accurate using learned constraints. In SDM '04: SIAM International Conference on Data Mining, 2004.Google ScholarCross Ref
- H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682--1689, 2004. Google ScholarDigital Library
- B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology, pages 171--192. The MIT press, 2004.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1--47, 2002. Google ScholarDigital Library
- R. She, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. L. Brinkman. Frequent-subsequence-based prediction of outer membrane proteins. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436--445, 2003. Google ScholarDigital Library
- T. Smith and M. Waterman. Identification of common molecular subsequences. J.Mol.Biol., 147:195--197, 1981.Google ScholarCross Ref
- S. Sonnenburg, G. Rätsch, and C. Schäfer. Learning interpretable SVMs for biological sequence classification. In RECOMB '05: The Ninth Annual International Conference on Research in Computational Molecular Biology, pages 389--407, 2005. Google ScholarDigital Library
- S. Sonnenburg, G. Rätsch, and B. Schölkopf. Large scale genomic sequence svm classifiers. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 848--855, 2005. Google ScholarDigital Library
- P. K. Srivastava, D. K. Desai, S. Nandi, and A. M. Lynn. HMM-ModE-Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics, 8(104), 2007.Google Scholar
- A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In ICDM, pages 521--528, 2001. Google ScholarDigital Library
- P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9--35, 2002. Google ScholarDigital Library
- L. Wei and E. Keogh. Semi-supervised time series classification. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 748--753, 2006. Google ScholarDigital Library
- J. Weston, C. S. Leslie, D. Zhou, A. Elisseeff, and W. S. Noble. Semi-supervised protein classification using cluster kernels. In NIPS, 2003.Google Scholar
- X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana. Fast time series classification using numerosity reduction. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 1033--1040, 2006. Google ScholarDigital Library
- Z. Xing, J. Pei, G. Dong, and P. S. Yu. Mining sequence classifiers for early prediction. In SDM'08: Proceedings of the 2008 SIAM international conference on data mining, pages 644--655, 2008.Google ScholarCross Ref
- Z. Xing, J. Pei, and P. S. Yu. Early classification on time series: A nearest neighbor approach. In IJCAI'09: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1297--1302, 2009. Google ScholarDigital Library
- O. Yakhnenko, A. Silvescu, and V. Honavar. Discriminatively trained markov model for sequence classification. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 498--505, 2005. Google ScholarDigital Library
- L. Ye and E. Keogh. Time series shapeletes: A new primitive for data mining. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. Google ScholarDigital Library
- S. Zhong. Semi-supervised sequence classification with Hmms. IJPRAI, 19(2):165--182, 2005.Google Scholar
- S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled classification using maximum entropy method. In SIGIR, pages 274--281, 2005. Google ScholarDigital Library
Recommendations
Protein sequence classification through relevant sequence mining and bayes classifiers
EPIA'05: Proceedings of the 12th Portuguese conference on Progress in Artificial IntelligenceWe tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is ...
A PSO-AB classifier for solving sequence classification problems
The proposed sequential pattern mining-based sequence classification method. A two-stage SPM-based sequence classification method is proposed.Compact sequential patterns can efficiently represent important features.A particle swarm optimization-AdaBoost ...
Pattern Based Sequence Classification
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the ...
Comments