Abstract
Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary goal of this article is to provide an overview of the various issues involved in clustering large biological datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life sciences. We also provide a brief introduction to Cluto, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.
Similar content being viewed by others
References
Linial, M., Linial, N., Tishby, J., and Golan, Y. (1997) Global self-organization of all known protein sequences reveals inherent biological structures. J. Mol. Biol. 268, 539–556.
Enright, A. J. and Ouzounis, C. A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16(5), 451–457.
Mian, I. S. and Dubchak, I. (2000) Representing and reasoning about protein families using generative and discriminative methods. J. Comput. Biol. 7(6), 849–862.
Spang, R. and Vingron, M. (2001) Limits of homology of detection by pairwise sequence comparison. Bioinformatics 17(4), 338–342.
Kriventseva, E., Biswas, M., and Apweiler, R. (2001) Clustering and analysis of protein families. Curr. Opin. Struct. Biol. 11, 334–339.
Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and stucture patterns. J. Comput. Biol. 7, 685–716.
Shatsky, M., Fligelman, Z. Y., Nussinov, R., and Wolfson, H. J. (2000) Alignment of flexible protein structures, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (Altman, R., et al., eds.), AAAI Press, Menlo, Park, CA, pp. 329–343.
Lee, R. C. T. (1981) Clustering analysis and its applications, in Advances in Information Systems Science (Toum, J. T., ed.), Plenum, New York.
Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall, 1998.
Cheeseman, P. and Stutz, J. (1996) Bayesian classification (autoclass): theory and results, in Advances in Knowledge Discovery and Data Mining (Fayyard, U. M., Piatetsky-Shapiro, G., Smith, P., and Uthurusamy, R., eds.), AAAI/MIT Press, pp. 153–180.
Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York.
Jackson, J. E. (1991) A User’s Guide to Principal Components, John Wiley & Sons, New York.
Ng, R. and Han, J. (1994) Efficient and effective clustering method for spatial data mining, in Proceedings of the 20th VLDB Conference (Bocca, J. B., Jarke, M., and Zaniolo, C., eds.), San Francisco, CA, pp. 144–155.
Berry, M. W., Dumais, S. T., and O’Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 573–595.
Zhang, T., Ramakrishnan, R., and Linvy, M. (1996) Birch: an efficient data clustering method for large databases, in Proceedings of 1996 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 103–114.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Simoudis, E., Han, J., and Fayyad, U., eds.), AAAI Press, Menlo Park, CA, pp. 226–231.
Wang, X., Wang, J. T. L., Shasha, D., Shapiro, B., Dikshitulu, S., Rigoutsos, I., and Zhang, K. (1997) Automated discovery of active motifs in three dimensional molecules, in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Heckerman, D., Mannila, H., Pregibon, D., and Uthurusamy, R., eds.), AAAI Press, Menlo Park, CA, pp. 89–95.
Guha, S., Rastogi, R., and Shim, K. (1998) CURE: an efficient clustering algorithm for large databases, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 73–84.
Guha, S., Rastogi, R., and Shim, K. (1999) ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 512–521.
Karypis, G., Han, E. H., and Kumar, V. (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review. ACM Comput. Surv. 31(3), 264–323.
Han, J., Kamber, M., and Tung, A. K. H. (2001) Spatial clustering methods in data mining: a survey, in Geographic Data Mining and Knowledge Discovery (Miller, H. and Han, J., eds.), Taylor & Francis.
MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Symposium Math. Statist. Prob., pp. 281–297.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. 39.
Zzhn, K. (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. (C-20), 68–86.
Han, E. G., Karypis, G., Kumar, V., and Mobasher, B. (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bull. Tech. Committee Data Eng. 21(1).
Strehl, A. and Ghosh, J. (2000) Scalable approach to balanced, high-dimensional clustering of market-baskets, in Proceedings of HiPC, Springer Verlag, pp. 525–536.
Boley, D. (1998) Principal direction divisive partitioning. Data Mining Knowl. Discov. 2(4),
Duda, R. O., Hart, P. E., and Stork, D. G. (2001) Pattern Classification, John Wiley & Sons, New York.
Zhao, Y. and Karypis, G. (2001) Criterion functions for document clustering: experiments and analysis. Technical report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis. Available at http://cs.umn.edu/~karypis/publications.
Sneath, P. H. and Sokal, R. R. (1973) Numerical Taxonomy, Freeman, London, UK.
King, B. (1967) Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101.
Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L. (1990) Molecular anatomy: phyletic relationships derived from 3-dimensional structures of proteins. J. Mol. Evol. 30, 43–59.
Taylor, W. R., Flores, T. P., and Orengo, C. A. (1994) Multiple protein structure alignment. Protein Sci. 3, 1858–1870.
Jonyer, I., Holder, L. B., and Cook, D. J. (2000) Graph-based hierarchical conceptual clustering in structural databases. Int. J. Artific. Intell. Tools.
Eddy, S. R. (1996) Hidden markov models. Curr. Opin. Struct. Biol. 6, 361–365.
Eddy, S. R. (1998) Profile hidden markov models. Bioinformatics 14, 755–763.
Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999) On the merits of building categorization systems by supervised clustering, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 352–356.
Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. (1992) Scatter/gather: a cluster-based approach to browsing large document collections, in Proceedings of the ACM SIGIR, ACM Press, New York, NY, pp. 318–329.
Larsen, B. and Aone, C. (1999) Fast and effective text mining using linear-time document clustering, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 16–22.
Steinbach, M., Karypis, G., and Kumar, V. (2000) A comparison of document clustering techniques, in KDD Workshop on Text Mining, ACM Press, New York, NY, pp. 109–110.
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998) Automatic sub- space clustering of high dimensional data for data mining applications, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 94–105.
Burdick, D., Calimlim, M., and Gehrke, J. (2001) Mafia: a maximal frequent itemset algorithm for transactional databases, in Proceedings of the 17th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 443–452.
Nagesh, H., Goil, S., and Choudhary, A. (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report TR #9906-010, Northwestern University.
Huang, Z. (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining, in Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona.
Ganti, V., Gehrke, J., and Ramakrishnan, R., (1999) Cactus-clustering categorical data using summaries, in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 73–83.
Gibson, D., Kleinberg, J., and Rahavan, P. (1998) Clustering categorical data: an approach based on dynamical systems, in Proceedings of the 24th International Conference on Very Large Databases (Gupta, A., Shmueli, O., and Widom, J., eds.), pp. 311–323, San Francisco, CA, Morgan Kaufmann.
Ryu, T. W. and Eick, C. F. (1998) A unified similarity measure for attributes with set or bag of values for database clustering, in The 6th International Workshop on Rough Sets, Data Mining and Granular Computing, Research Triangle Park, NC.
Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, NY.
Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352.
Schwartz, R. M. and Dayhoff, M. O. (1978) Matrices for detecting distant relationships. Atlas Protein Sequence Struct. 5, 353–358.
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10,915–10,919.
Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches. Science 227, 1435–1441.
Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Yona, G., Linial, N., and Linial, M. (2000) Protomap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 28, 49–55.
Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., and Schrader, R. (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17, 935–941.
Johnson, M. S. and Lehtonen, J. V. (2000) Comparison of protein three dimensional structure, in Bioinformatics: Sequences, Structure and Databanks (Higgins, D. and Taylor, W., eds.), Oxford University Press, 2000.
Grindley, H. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. Biol. 229, 707–721.
Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1989) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151–166.
Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123.
Kleywegt, G. J. and Jones, T. A. (1997) Detecting folding motifs and similarities in protein structures. Methods Enzymol. 277, 525–545.
Calinski, T. and Harabasz, J. (1974) A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.
Ben-Hur, A., Elisseeff, A., and Guyon, I. (2000) A stability based method for discovering structure in clustered data, in Pacific Symposium on Biocomputing, vol. 7, World Scientific Press, Singapore, pp. 6–17.
Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001) Validating clustering for gene expression data. Bioinformatics 17(4), 309–318.
Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P., and Adams, C. L. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555, 556.
Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270.
Brazma, A., Robinson, A., Cameron, G., and Ashburner, M. (2000) One-stop shop for microarray data. Nature 403, 699–701.
Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001) Normalization for cDNA microarray data, in Proceedings of SPIE International Biomedical Optics Symposium (Bittner, M. L., Chen, Y., Dorsel, A. N., and Dougherty, E. R., eds.), vol. 4266, SPIE, Bellingham, WA, pp. 141–152.
D’Haeseleer, P., Fuhrman, S., Wen, X., and Somogyi, R. (1998) Mining the gene expression matrix: inferring gene relationships from large scale gene expression data, in, Information Processing in Cells and Tissues, Plenum, New York, pp. 203–212.
Eisen, M. (2000) Cluster 2.11 and treeview 1.50 http://rana.lbl.gov/Eisen Software.htm.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912.
Kohonen, T. Self-Organizing Maps, Springer-Verlag, Berlin, 1995.
Ben-Dor, A. and Yakhini, Z. (1999) Clustering gene expression patterns, in Proceedings of the 3rd Annual International Conference on Research in Computational Molecular Biology, ACM Press, New York, NY, pp. 33–42.
Shamir, R. and Sharan, R. (2000) Click: a clustering algorithm for gene expression analysis, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (Altman, R., et al., eds.), AAAI Press, Menlo Park, CA, pp. 307–316.
Zhao, Y. and Karypis, G. (2002) Comparison of agglomerative and partitional document clustering algorithms, in SIAM (2002) Workshop on Clustering High-Dimensional Data and Its Applications. Arlington, VA. Also available as technical report #02-014, University of Minnesota.
Karypis, G. and Kumar, V. (1998) hMeTiS 1.5: a hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/~karypis/metis
Karypis, G. and Kumar, V. (1998) MeTiS4.0: unstructured graph partitioning and sparse matrix ordering system. Technical report, department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/∼karypis/metis
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, Y., Karypis, G. Data clustering in life sciences. Mol Biotechnol 31, 55–80 (2005). https://doi.org/10.1385/MB:31:1:055
Issue Date:
DOI: https://doi.org/10.1385/MB:31:1:055