Molecular Biotechnology

, Volume 31, Issue 1, pp 55–80 | Cite as

Data clustering in life sciences

  • Ying Zhao
  • George Karypis


Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary goal of this article is to provide an overview of the various issues involved in clustering large biological datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life sciences. We also provide a brief introduction to Cluto, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.

Index Entries

Clustering algorithms similarity between objects microarray data Cluto 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Linial, M., Linial, N., Tishby, J., and Golan, Y. (1997) Global self-organization of all known protein sequences reveals inherent biological structures. J. Mol. Biol. 268, 539–556.PubMedCrossRefGoogle Scholar
  2. 2.
    Enright, A. J. and Ouzounis, C. A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16(5), 451–457.PubMedCrossRefGoogle Scholar
  3. 3.
    Mian, I. S. and Dubchak, I. (2000) Representing and reasoning about protein families using generative and discriminative methods. J. Comput. Biol. 7(6), 849–862.PubMedCrossRefGoogle Scholar
  4. 4.
    Spang, R. and Vingron, M. (2001) Limits of homology of detection by pairwise sequence comparison. Bioinformatics 17(4), 338–342.PubMedCrossRefGoogle Scholar
  5. 5.
    Kriventseva, E., Biswas, M., and Apweiler, R. (2001) Clustering and analysis of protein families. Curr. Opin. Struct. Biol. 11, 334–339.PubMedCrossRefGoogle Scholar
  6. 6.
    Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and stucture patterns. J. Comput. Biol. 7, 685–716.PubMedCrossRefGoogle Scholar
  7. 7.
    Shatsky, M., Fligelman, Z. Y., Nussinov, R., and Wolfson, H. J. (2000) Alignment of flexible protein structures, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (Altman, R., et al., eds.), AAAI Press, Menlo, Park, CA, pp. 329–343.Google Scholar
  8. 8.
    Lee, R. C. T. (1981) Clustering analysis and its applications, in Advances in Information Systems Science (Toum, J. T., ed.), Plenum, New York.Google Scholar
  9. 9.
    Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall, 1998.Google Scholar
  10. 10.
    Cheeseman, P. and Stutz, J. (1996) Bayesian classification (autoclass): theory and results, in Advances in Knowledge Discovery and Data Mining (Fayyard, U. M., Piatetsky-Shapiro, G., Smith, P., and Uthurusamy, R., eds.), AAAI/MIT Press, pp. 153–180.Google Scholar
  11. 11.
    Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York.Google Scholar
  12. 12.
    Jackson, J. E. (1991) A User’s Guide to Principal Components, John Wiley & Sons, New York.Google Scholar
  13. 13.
    Ng, R. and Han, J. (1994) Efficient and effective clustering method for spatial data mining, in Proceedings of the 20th VLDB Conference (Bocca, J. B., Jarke, M., and Zaniolo, C., eds.), San Francisco, CA, pp. 144–155.Google Scholar
  14. 14.
    Berry, M. W., Dumais, S. T., and O’Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 573–595.CrossRefGoogle Scholar
  15. 15.
    Zhang, T., Ramakrishnan, R., and Linvy, M. (1996) Birch: an efficient data clustering method for large databases, in Proceedings of 1996 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 103–114.CrossRefGoogle Scholar
  16. 16.
    Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Simoudis, E., Han, J., and Fayyad, U., eds.), AAAI Press, Menlo Park, CA, pp. 226–231.Google Scholar
  17. 17.
    Wang, X., Wang, J. T. L., Shasha, D., Shapiro, B., Dikshitulu, S., Rigoutsos, I., and Zhang, K. (1997) Automated discovery of active motifs in three dimensional molecules, in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Heckerman, D., Mannila, H., Pregibon, D., and Uthurusamy, R., eds.), AAAI Press, Menlo Park, CA, pp. 89–95.Google Scholar
  18. 18.
    Guha, S., Rastogi, R., and Shim, K. (1998) CURE: an efficient clustering algorithm for large databases, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 73–84.CrossRefGoogle Scholar
  19. 19.
    Guha, S., Rastogi, R., and Shim, K. (1999) ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 512–521.Google Scholar
  20. 20.
    Karypis, G., Han, E. H., and Kumar, V. (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75.Google Scholar
  21. 21.
    Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review. ACM Comput. Surv. 31(3), 264–323.CrossRefGoogle Scholar
  22. 22.
    Han, J., Kamber, M., and Tung, A. K. H. (2001) Spatial clustering methods in data mining: a survey, in Geographic Data Mining and Knowledge Discovery (Miller, H. and Han, J., eds.), Taylor & Francis.Google Scholar
  23. 23.
    MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Symposium Math. Statist. Prob., pp. 281–297.Google Scholar
  24. 24.
    Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. 39.Google Scholar
  25. 25.
    Zzhn, K. (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. (C-20), 68–86.CrossRefGoogle Scholar
  26. 26.
    Han, E. G., Karypis, G., Kumar, V., and Mobasher, B. (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bull. Tech. Committee Data Eng. 21(1).Google Scholar
  27. 27.
    Strehl, A. and Ghosh, J. (2000) Scalable approach to balanced, high-dimensional clustering of market-baskets, in Proceedings of HiPC, Springer Verlag, pp. 525–536.Google Scholar
  28. 28.
    Boley, D. (1998) Principal direction divisive partitioning. Data Mining Knowl. Discov. 2(4),Google Scholar
  29. 29.
    Duda, R. O., Hart, P. E., and Stork, D. G. (2001) Pattern Classification, John Wiley & Sons, New York.Google Scholar
  30. 30.
    Zhao, Y. and Karypis, G. (2001) Criterion functions for document clustering: experiments and analysis. Technical report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis. Available at Scholar
  31. 31.
    Sneath, P. H. and Sokal, R. R. (1973) Numerical Taxonomy, Freeman, London, UK.Google Scholar
  32. 32.
    King, B. (1967) Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101.CrossRefGoogle Scholar
  33. 33.
    Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L. (1990) Molecular anatomy: phyletic relationships derived from 3-dimensional structures of proteins. J. Mol. Evol. 30, 43–59.PubMedCrossRefGoogle Scholar
  34. 34.
    Taylor, W. R., Flores, T. P., and Orengo, C. A. (1994) Multiple protein structure alignment. Protein Sci. 3, 1858–1870.PubMedCrossRefGoogle Scholar
  35. 35.
    Jonyer, I., Holder, L. B., and Cook, D. J. (2000) Graph-based hierarchical conceptual clustering in structural databases. Int. J. Artific. Intell. Tools. Google Scholar
  36. 36.
    Eddy, S. R. (1996) Hidden markov models. Curr. Opin. Struct. Biol. 6, 361–365.PubMedCrossRefGoogle Scholar
  37. 37.
    Eddy, S. R. (1998) Profile hidden markov models. Bioinformatics 14, 755–763.PubMedCrossRefGoogle Scholar
  38. 38.
    Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999) On the merits of building categorization systems by supervised clustering, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 352–356.CrossRefGoogle Scholar
  39. 39.
    Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. (1992) Scatter/gather: a cluster-based approach to browsing large document collections, in Proceedings of the ACM SIGIR, ACM Press, New York, NY, pp. 318–329.Google Scholar
  40. 40.
    Larsen, B. and Aone, C. (1999) Fast and effective text mining using linear-time document clustering, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 16–22.CrossRefGoogle Scholar
  41. 41.
    Steinbach, M., Karypis, G., and Kumar, V. (2000) A comparison of document clustering techniques, in KDD Workshop on Text Mining, ACM Press, New York, NY, pp. 109–110.Google Scholar
  42. 42.
    Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998) Automatic sub- space clustering of high dimensional data for data mining applications, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 94–105.CrossRefGoogle Scholar
  43. 43.
    Burdick, D., Calimlim, M., and Gehrke, J. (2001) Mafia: a maximal frequent itemset algorithm for transactional databases, in Proceedings of the 17th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 443–452.CrossRefGoogle Scholar
  44. 44.
    Nagesh, H., Goil, S., and Choudhary, A. (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report TR #9906-010, Northwestern University.Google Scholar
  45. 45.
    Huang, Z. (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining, in Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona.Google Scholar
  46. 46.
    Ganti, V., Gehrke, J., and Ramakrishnan, R., (1999) Cactus-clustering categorical data using summaries, in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 73–83.CrossRefGoogle Scholar
  47. 47.
    Gibson, D., Kleinberg, J., and Rahavan, P. (1998) Clustering categorical data: an approach based on dynamical systems, in Proceedings of the 24th International Conference on Very Large Databases (Gupta, A., Shmueli, O., and Widom, J., eds.), pp. 311–323, San Francisco, CA, Morgan Kaufmann.Google Scholar
  48. 48.
    Ryu, T. W. and Eick, C. F. (1998) A unified similarity measure for attributes with set or bag of values for database clustering, in The 6th International Workshop on Rough Sets, Data Mining and Granular Computing, Research Triangle Park, NC.Google Scholar
  49. 49.
    Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, NY.Google Scholar
  50. 50.
    Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352.Google Scholar
  51. 51.
    Schwartz, R. M. and Dayhoff, M. O. (1978) Matrices for detecting distant relationships. Atlas Protein Sequence Struct. 5, 353–358.Google Scholar
  52. 52.
    Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10,915–10,919.CrossRefGoogle Scholar
  53. 53.
    Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.PubMedCrossRefGoogle Scholar
  54. 54.
    Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.PubMedCrossRefGoogle Scholar
  55. 55.
    Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches. Science 227, 1435–1441.PubMedCrossRefGoogle Scholar
  56. 56.
    Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.PubMedCrossRefGoogle Scholar
  57. 57.
    Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410.PubMedGoogle Scholar
  58. 58.
    Yona, G., Linial, N., and Linial, M. (2000) Protomap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 28, 49–55.PubMedCrossRefGoogle Scholar
  59. 59.
    Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., and Schrader, R. (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17, 935–941.PubMedCrossRefGoogle Scholar
  60. 60.
    Johnson, M. S. and Lehtonen, J. V. (2000) Comparison of protein three dimensional structure, in Bioinformatics: Sequences, Structure and Databanks (Higgins, D. and Taylor, W., eds.), Oxford University Press, 2000.Google Scholar
  61. 61.
    Grindley, H. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. Biol. 229, 707–721.PubMedCrossRefGoogle Scholar
  62. 62.
    Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1989) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151–166.CrossRefGoogle Scholar
  63. 63.
    Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123.PubMedCrossRefGoogle Scholar
  64. 64.
    Kleywegt, G. J. and Jones, T. A. (1997) Detecting folding motifs and similarities in protein structures. Methods Enzymol. 277, 525–545.CrossRefPubMedGoogle Scholar
  65. 65.
    Calinski, T. and Harabasz, J. (1974) A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.Google Scholar
  66. 66.
    Ben-Hur, A., Elisseeff, A., and Guyon, I. (2000) A stability based method for discovering structure in clustered data, in Pacific Symposium on Biocomputing, vol. 7, World Scientific Press, Singapore, pp. 6–17.Google Scholar
  67. 67.
    Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001) Validating clustering for gene expression data. Bioinformatics 17(4), 309–318.PubMedCrossRefGoogle Scholar
  68. 68.
    Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P., and Adams, C. L. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555, 556.PubMedCrossRefGoogle Scholar
  69. 69.
    Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270.Google Scholar
  70. 70.
    Brazma, A., Robinson, A., Cameron, G., and Ashburner, M. (2000) One-stop shop for microarray data. Nature 403, 699–701.PubMedCrossRefGoogle Scholar
  71. 71.
    Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001) Normalization for cDNA microarray data, in Proceedings of SPIE International Biomedical Optics Symposium (Bittner, M. L., Chen, Y., Dorsel, A. N., and Dougherty, E. R., eds.), vol. 4266, SPIE, Bellingham, WA, pp. 141–152.Google Scholar
  72. 72.
    D’Haeseleer, P., Fuhrman, S., Wen, X., and Somogyi, R. (1998) Mining the gene expression matrix: inferring gene relationships from large scale gene expression data, in, Information Processing in Cells and Tissues, Plenum, New York, pp. 203–212.Google Scholar
  73. 73.
    Eisen, M. (2000) Cluster 2.11 and treeview 1.50 Software.htm.Google Scholar
  74. 74.
    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912.PubMedCrossRefGoogle Scholar
  75. 75.
    Kohonen, T. Self-Organizing Maps, Springer-Verlag, Berlin, 1995.Google Scholar
  76. 76.
    Ben-Dor, A. and Yakhini, Z. (1999) Clustering gene expression patterns, in Proceedings of the 3rd Annual International Conference on Research in Computational Molecular Biology, ACM Press, New York, NY, pp. 33–42.CrossRefGoogle Scholar
  77. 77.
    Shamir, R. and Sharan, R. (2000) Click: a clustering algorithm for gene expression analysis, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (Altman, R., et al., eds.), AAAI Press, Menlo Park, CA, pp. 307–316.Google Scholar
  78. 78.
    Zhao, Y. and Karypis, G. (2002) Comparison of agglomerative and partitional document clustering algorithms, in SIAM (2002) Workshop on Clustering High-Dimensional Data and Its Applications. Arlington, VA. Also available as technical report #02-014, University of Minnesota.Google Scholar
  79. 79.
    Karypis, G. and Kumar, V. (1998) hMeTiS 1.5: a hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota. Available at Scholar
  80. 80.
    Karypis, G. and Kumar, V. (1998) MeTiS4.0: unstructured graph partitioning and sparse matrix ordering system. Technical report, department of Computer Science, University of Minnesota. Available at∼karypis/metisGoogle Scholar

Copyright information

© Humana Press Inc 2005

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of Minnesota, and Digital Technology Center and Army HPC Research CenterMinneapolis

Personalised recommendations