Skip to main content
Log in

Data clustering in life sciences

  • Review
  • Published:
Molecular Biotechnology Aims and scope Submit manuscript

Abstract

Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary goal of this article is to provide an overview of the various issues involved in clustering large biological datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life sciences. We also provide a brief introduction to Cluto, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Linial, M., Linial, N., Tishby, J., and Golan, Y. (1997) Global self-organization of all known protein sequences reveals inherent biological structures. J. Mol. Biol. 268, 539–556.

    Article  PubMed  CAS  Google Scholar 

  2. Enright, A. J. and Ouzounis, C. A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16(5), 451–457.

    Article  PubMed  CAS  Google Scholar 

  3. Mian, I. S. and Dubchak, I. (2000) Representing and reasoning about protein families using generative and discriminative methods. J. Comput. Biol. 7(6), 849–862.

    Article  PubMed  CAS  Google Scholar 

  4. Spang, R. and Vingron, M. (2001) Limits of homology of detection by pairwise sequence comparison. Bioinformatics 17(4), 338–342.

    Article  PubMed  CAS  Google Scholar 

  5. Kriventseva, E., Biswas, M., and Apweiler, R. (2001) Clustering and analysis of protein families. Curr. Opin. Struct. Biol. 11, 334–339.

    Article  PubMed  CAS  Google Scholar 

  6. Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and stucture patterns. J. Comput. Biol. 7, 685–716.

    Article  PubMed  CAS  Google Scholar 

  7. Shatsky, M., Fligelman, Z. Y., Nussinov, R., and Wolfson, H. J. (2000) Alignment of flexible protein structures, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (Altman, R., et al., eds.), AAAI Press, Menlo, Park, CA, pp. 329–343.

    Google Scholar 

  8. Lee, R. C. T. (1981) Clustering analysis and its applications, in Advances in Information Systems Science (Toum, J. T., ed.), Plenum, New York.

    Google Scholar 

  9. Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall, 1998.

  10. Cheeseman, P. and Stutz, J. (1996) Bayesian classification (autoclass): theory and results, in Advances in Knowledge Discovery and Data Mining (Fayyard, U. M., Piatetsky-Shapiro, G., Smith, P., and Uthurusamy, R., eds.), AAAI/MIT Press, pp. 153–180.

  11. Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York.

    Google Scholar 

  12. Jackson, J. E. (1991) A User’s Guide to Principal Components, John Wiley & Sons, New York.

    Google Scholar 

  13. Ng, R. and Han, J. (1994) Efficient and effective clustering method for spatial data mining, in Proceedings of the 20th VLDB Conference (Bocca, J. B., Jarke, M., and Zaniolo, C., eds.), San Francisco, CA, pp. 144–155.

  14. Berry, M. W., Dumais, S. T., and O’Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 573–595.

    Article  Google Scholar 

  15. Zhang, T., Ramakrishnan, R., and Linvy, M. (1996) Birch: an efficient data clustering method for large databases, in Proceedings of 1996 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 103–114.

    Chapter  Google Scholar 

  16. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Simoudis, E., Han, J., and Fayyad, U., eds.), AAAI Press, Menlo Park, CA, pp. 226–231.

    Google Scholar 

  17. Wang, X., Wang, J. T. L., Shasha, D., Shapiro, B., Dikshitulu, S., Rigoutsos, I., and Zhang, K. (1997) Automated discovery of active motifs in three dimensional molecules, in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Heckerman, D., Mannila, H., Pregibon, D., and Uthurusamy, R., eds.), AAAI Press, Menlo Park, CA, pp. 89–95.

    Google Scholar 

  18. Guha, S., Rastogi, R., and Shim, K. (1998) CURE: an efficient clustering algorithm for large databases, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 73–84.

    Chapter  Google Scholar 

  19. Guha, S., Rastogi, R., and Shim, K. (1999) ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 512–521.

    Google Scholar 

  20. Karypis, G., Han, E. H., and Kumar, V. (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75.

    Google Scholar 

  21. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review. ACM Comput. Surv. 31(3), 264–323.

    Article  Google Scholar 

  22. Han, J., Kamber, M., and Tung, A. K. H. (2001) Spatial clustering methods in data mining: a survey, in Geographic Data Mining and Knowledge Discovery (Miller, H. and Han, J., eds.), Taylor & Francis.

  23. MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Symposium Math. Statist. Prob., pp. 281–297.

  24. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. 39.

  25. Zzhn, K. (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. (C-20), 68–86.

    Article  Google Scholar 

  26. Han, E. G., Karypis, G., Kumar, V., and Mobasher, B. (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bull. Tech. Committee Data Eng. 21(1).

  27. Strehl, A. and Ghosh, J. (2000) Scalable approach to balanced, high-dimensional clustering of market-baskets, in Proceedings of HiPC, Springer Verlag, pp. 525–536.

  28. Boley, D. (1998) Principal direction divisive partitioning. Data Mining Knowl. Discov. 2(4),

  29. Duda, R. O., Hart, P. E., and Stork, D. G. (2001) Pattern Classification, John Wiley & Sons, New York.

    Google Scholar 

  30. Zhao, Y. and Karypis, G. (2001) Criterion functions for document clustering: experiments and analysis. Technical report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis. Available at http://cs.umn.edu/~karypis/publications.

  31. Sneath, P. H. and Sokal, R. R. (1973) Numerical Taxonomy, Freeman, London, UK.

    Google Scholar 

  32. King, B. (1967) Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101.

    Article  Google Scholar 

  33. Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L. (1990) Molecular anatomy: phyletic relationships derived from 3-dimensional structures of proteins. J. Mol. Evol. 30, 43–59.

    Article  PubMed  CAS  Google Scholar 

  34. Taylor, W. R., Flores, T. P., and Orengo, C. A. (1994) Multiple protein structure alignment. Protein Sci. 3, 1858–1870.

    Article  PubMed  CAS  Google Scholar 

  35. Jonyer, I., Holder, L. B., and Cook, D. J. (2000) Graph-based hierarchical conceptual clustering in structural databases. Int. J. Artific. Intell. Tools.

  36. Eddy, S. R. (1996) Hidden markov models. Curr. Opin. Struct. Biol. 6, 361–365.

    Article  PubMed  CAS  Google Scholar 

  37. Eddy, S. R. (1998) Profile hidden markov models. Bioinformatics 14, 755–763.

    Article  PubMed  CAS  Google Scholar 

  38. Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999) On the merits of building categorization systems by supervised clustering, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 352–356.

    Chapter  Google Scholar 

  39. Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. (1992) Scatter/gather: a cluster-based approach to browsing large document collections, in Proceedings of the ACM SIGIR, ACM Press, New York, NY, pp. 318–329.

    Google Scholar 

  40. Larsen, B. and Aone, C. (1999) Fast and effective text mining using linear-time document clustering, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 16–22.

    Chapter  Google Scholar 

  41. Steinbach, M., Karypis, G., and Kumar, V. (2000) A comparison of document clustering techniques, in KDD Workshop on Text Mining, ACM Press, New York, NY, pp. 109–110.

    Google Scholar 

  42. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998) Automatic sub- space clustering of high dimensional data for data mining applications, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 94–105.

    Chapter  Google Scholar 

  43. Burdick, D., Calimlim, M., and Gehrke, J. (2001) Mafia: a maximal frequent itemset algorithm for transactional databases, in Proceedings of the 17th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 443–452.

    Chapter  Google Scholar 

  44. Nagesh, H., Goil, S., and Choudhary, A. (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report TR #9906-010, Northwestern University.

  45. Huang, Z. (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining, in Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona.

  46. Ganti, V., Gehrke, J., and Ramakrishnan, R., (1999) Cactus-clustering categorical data using summaries, in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 73–83.

    Chapter  Google Scholar 

  47. Gibson, D., Kleinberg, J., and Rahavan, P. (1998) Clustering categorical data: an approach based on dynamical systems, in Proceedings of the 24th International Conference on Very Large Databases (Gupta, A., Shmueli, O., and Widom, J., eds.), pp. 311–323, San Francisco, CA, Morgan Kaufmann.

    Google Scholar 

  48. Ryu, T. W. and Eick, C. F. (1998) A unified similarity measure for attributes with set or bag of values for database clustering, in The 6th International Workshop on Rough Sets, Data Mining and Granular Computing, Research Triangle Park, NC.

    Google Scholar 

  49. Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, NY.

    Google Scholar 

  50. Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352.

    Google Scholar 

  51. Schwartz, R. M. and Dayhoff, M. O. (1978) Matrices for detecting distant relationships. Atlas Protein Sequence Struct. 5, 353–358.

    Google Scholar 

  52. Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10,915–10,919.

    Article  CAS  Google Scholar 

  53. Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.

    Article  PubMed  CAS  Google Scholar 

  54. Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.

    Article  PubMed  CAS  Google Scholar 

  55. Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches. Science 227, 1435–1441.

    Article  PubMed  CAS  Google Scholar 

  56. Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.

    Article  PubMed  CAS  Google Scholar 

  57. Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

    PubMed  CAS  Google Scholar 

  58. Yona, G., Linial, N., and Linial, M. (2000) Protomap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 28, 49–55.

    Article  PubMed  CAS  Google Scholar 

  59. Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., and Schrader, R. (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17, 935–941.

    Article  PubMed  CAS  Google Scholar 

  60. Johnson, M. S. and Lehtonen, J. V. (2000) Comparison of protein three dimensional structure, in Bioinformatics: Sequences, Structure and Databanks (Higgins, D. and Taylor, W., eds.), Oxford University Press, 2000.

  61. Grindley, H. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. Biol. 229, 707–721.

    Article  PubMed  CAS  Google Scholar 

  62. Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1989) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151–166.

    Article  Google Scholar 

  63. Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123.

    Article  PubMed  CAS  Google Scholar 

  64. Kleywegt, G. J. and Jones, T. A. (1997) Detecting folding motifs and similarities in protein structures. Methods Enzymol. 277, 525–545.

    Article  PubMed  CAS  Google Scholar 

  65. Calinski, T. and Harabasz, J. (1974) A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.

    Google Scholar 

  66. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2000) A stability based method for discovering structure in clustered data, in Pacific Symposium on Biocomputing, vol. 7, World Scientific Press, Singapore, pp. 6–17.

    Google Scholar 

  67. Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001) Validating clustering for gene expression data. Bioinformatics 17(4), 309–318.

    Article  PubMed  CAS  Google Scholar 

  68. Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P., and Adams, C. L. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555, 556.

    Article  PubMed  CAS  Google Scholar 

  69. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270.

  70. Brazma, A., Robinson, A., Cameron, G., and Ashburner, M. (2000) One-stop shop for microarray data. Nature 403, 699–701.

    Article  PubMed  CAS  Google Scholar 

  71. Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001) Normalization for cDNA microarray data, in Proceedings of SPIE International Biomedical Optics Symposium (Bittner, M. L., Chen, Y., Dorsel, A. N., and Dougherty, E. R., eds.), vol. 4266, SPIE, Bellingham, WA, pp. 141–152.

    Google Scholar 

  72. D’Haeseleer, P., Fuhrman, S., Wen, X., and Somogyi, R. (1998) Mining the gene expression matrix: inferring gene relationships from large scale gene expression data, in, Information Processing in Cells and Tissues, Plenum, New York, pp. 203–212.

    Google Scholar 

  73. Eisen, M. (2000) Cluster 2.11 and treeview 1.50 http://rana.lbl.gov/Eisen Software.htm.

  74. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912.

    Article  PubMed  CAS  Google Scholar 

  75. Kohonen, T. Self-Organizing Maps, Springer-Verlag, Berlin, 1995.

    Google Scholar 

  76. Ben-Dor, A. and Yakhini, Z. (1999) Clustering gene expression patterns, in Proceedings of the 3rd Annual International Conference on Research in Computational Molecular Biology, ACM Press, New York, NY, pp. 33–42.

    Chapter  Google Scholar 

  77. Shamir, R. and Sharan, R. (2000) Click: a clustering algorithm for gene expression analysis, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (Altman, R., et al., eds.), AAAI Press, Menlo Park, CA, pp. 307–316.

    Google Scholar 

  78. Zhao, Y. and Karypis, G. (2002) Comparison of agglomerative and partitional document clustering algorithms, in SIAM (2002) Workshop on Clustering High-Dimensional Data and Its Applications. Arlington, VA. Also available as technical report #02-014, University of Minnesota.

  79. Karypis, G. and Kumar, V. (1998) hMeTiS 1.5: a hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/~karypis/metis

  80. Karypis, G. and Kumar, V. (1998) MeTiS4.0: unstructured graph partitioning and sparse matrix ordering system. Technical report, department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/∼karypis/metis

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Karypis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Y., Karypis, G. Data clustering in life sciences. Mol Biotechnol 31, 55–80 (2005). https://doi.org/10.1385/MB:31:1:055

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1385/MB:31:1:055

Index Entries

Navigation