Data clustering in life sciences

Zhao, Ying; Karypis, George

doi:10.1385/MB:31:1:055

Data clustering in life sciences

Review
Published: September 2005

Volume 31, pages 55–80, (2005)
Cite this article

Molecular Biotechnology Aims and scope Submit manuscript

Ying Zhao &
George Karypis¹

431 Accesses
71 Citations
Explore all metrics

Abstract

Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary goal of this article is to provide an overview of the various issues involved in clustering large biological datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life sciences. We also provide a brief introduction to Cluto, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Linial, M., Linial, N., Tishby, J., and Golan, Y. (1997) Global self-organization of all known protein sequences reveals inherent biological structures. J. Mol. Biol. 268, 539–556.
Article PubMed CAS Google Scholar
Enright, A. J. and Ouzounis, C. A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16(5), 451–457.
Article PubMed CAS Google Scholar
Mian, I. S. and Dubchak, I. (2000) Representing and reasoning about protein families using generative and discriminative methods. J. Comput. Biol. 7(6), 849–862.
Article PubMed CAS Google Scholar
Spang, R. and Vingron, M. (2001) Limits of homology of detection by pairwise sequence comparison. Bioinformatics 17(4), 338–342.
Article PubMed CAS Google Scholar
Kriventseva, E., Biswas, M., and Apweiler, R. (2001) Clustering and analysis of protein families. Curr. Opin. Struct. Biol. 11, 334–339.
Article PubMed CAS Google Scholar
Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and stucture patterns. J. Comput. Biol. 7, 685–716.
Article PubMed CAS Google Scholar
Shatsky, M., Fligelman, Z. Y., Nussinov, R., and Wolfson, H. J. (2000) Alignment of flexible protein structures, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (Altman, R., et al., eds.), AAAI Press, Menlo, Park, CA, pp. 329–343.
Google Scholar
Lee, R. C. T. (1981) Clustering analysis and its applications, in Advances in Information Systems Science (Toum, J. T., ed.), Plenum, New York.
Google Scholar
Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall, 1998.
Cheeseman, P. and Stutz, J. (1996) Bayesian classification (autoclass): theory and results, in Advances in Knowledge Discovery and Data Mining (Fayyard, U. M., Piatetsky-Shapiro, G., Smith, P., and Uthurusamy, R., eds.), AAAI/MIT Press, pp. 153–180.
Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York.
Google Scholar
Jackson, J. E. (1991) A User’s Guide to Principal Components, John Wiley & Sons, New York.
Google Scholar
Ng, R. and Han, J. (1994) Efficient and effective clustering method for spatial data mining, in Proceedings of the 20th VLDB Conference (Bocca, J. B., Jarke, M., and Zaniolo, C., eds.), San Francisco, CA, pp. 144–155.
Berry, M. W., Dumais, S. T., and O’Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 573–595.
Article Google Scholar
Zhang, T., Ramakrishnan, R., and Linvy, M. (1996) Birch: an efficient data clustering method for large databases, in Proceedings of 1996 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 103–114.
Chapter Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Simoudis, E., Han, J., and Fayyad, U., eds.), AAAI Press, Menlo Park, CA, pp. 226–231.
Google Scholar
Wang, X., Wang, J. T. L., Shasha, D., Shapiro, B., Dikshitulu, S., Rigoutsos, I., and Zhang, K. (1997) Automated discovery of active motifs in three dimensional molecules, in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Heckerman, D., Mannila, H., Pregibon, D., and Uthurusamy, R., eds.), AAAI Press, Menlo Park, CA, pp. 89–95.
Google Scholar
Guha, S., Rastogi, R., and Shim, K. (1998) CURE: an efficient clustering algorithm for large databases, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 73–84.
Chapter Google Scholar
Guha, S., Rastogi, R., and Shim, K. (1999) ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 512–521.
Google Scholar
Karypis, G., Han, E. H., and Kumar, V. (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75.
Google Scholar
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review. ACM Comput. Surv. 31(3), 264–323.
Article Google Scholar
Han, J., Kamber, M., and Tung, A. K. H. (2001) Spatial clustering methods in data mining: a survey, in Geographic Data Mining and Knowledge Discovery (Miller, H. and Han, J., eds.), Taylor & Francis.
MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Symposium Math. Statist. Prob., pp. 281–297.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. 39.
Zzhn, K. (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. (C-20), 68–86.
Article Google Scholar
Han, E. G., Karypis, G., Kumar, V., and Mobasher, B. (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bull. Tech. Committee Data Eng. 21(1).
Strehl, A. and Ghosh, J. (2000) Scalable approach to balanced, high-dimensional clustering of market-baskets, in Proceedings of HiPC, Springer Verlag, pp. 525–536.
Boley, D. (1998) Principal direction divisive partitioning. Data Mining Knowl. Discov. 2(4),
Duda, R. O., Hart, P. E., and Stork, D. G. (2001) Pattern Classification, John Wiley & Sons, New York.
Google Scholar
Zhao, Y. and Karypis, G. (2001) Criterion functions for document clustering: experiments and analysis. Technical report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis. Available at http://cs.umn.edu/~karypis/publications.
Sneath, P. H. and Sokal, R. R. (1973) Numerical Taxonomy, Freeman, London, UK.
Google Scholar
King, B. (1967) Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101.
Article Google Scholar
Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L. (1990) Molecular anatomy: phyletic relationships derived from 3-dimensional structures of proteins. J. Mol. Evol. 30, 43–59.
Article PubMed CAS Google Scholar
Taylor, W. R., Flores, T. P., and Orengo, C. A. (1994) Multiple protein structure alignment. Protein Sci. 3, 1858–1870.
Article PubMed CAS Google Scholar
Jonyer, I., Holder, L. B., and Cook, D. J. (2000) Graph-based hierarchical conceptual clustering in structural databases. Int. J. Artific. Intell. Tools.
Eddy, S. R. (1996) Hidden markov models. Curr. Opin. Struct. Biol. 6, 361–365.
Article PubMed CAS Google Scholar
Eddy, S. R. (1998) Profile hidden markov models. Bioinformatics 14, 755–763.
Article PubMed CAS Google Scholar
Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999) On the merits of building categorization systems by supervised clustering, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 352–356.
Chapter Google Scholar
Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. (1992) Scatter/gather: a cluster-based approach to browsing large document collections, in Proceedings of the ACM SIGIR, ACM Press, New York, NY, pp. 318–329.
Google Scholar
Larsen, B. and Aone, C. (1999) Fast and effective text mining using linear-time document clustering, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 16–22.
Chapter Google Scholar
Steinbach, M., Karypis, G., and Kumar, V. (2000) A comparison of document clustering techniques, in KDD Workshop on Text Mining, ACM Press, New York, NY, pp. 109–110.
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998) Automatic sub- space clustering of high dimensional data for data mining applications, in Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data, ACM Press, New York, NY, pp. 94–105.
Chapter Google Scholar
Burdick, D., Calimlim, M., and Gehrke, J. (2001) Mafia: a maximal frequent itemset algorithm for transactional databases, in Proceedings of the 17th International Conference on Data Engineering, IEEE, Washington-Brussels-Tokyo, pp. 443–452.
Chapter Google Scholar
Nagesh, H., Goil, S., and Choudhary, A. (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report TR #9906-010, Northwestern University.
Huang, Z. (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining, in Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona.
Ganti, V., Gehrke, J., and Ramakrishnan, R., (1999) Cactus-clustering categorical data using summaries, in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (Chaudhuri, S. and Madigan, D., eds.), ACM Press, New York, NY, pp. 73–83.
Chapter Google Scholar
Gibson, D., Kleinberg, J., and Rahavan, P. (1998) Clustering categorical data: an approach based on dynamical systems, in Proceedings of the 24th International Conference on Very Large Databases (Gupta, A., Shmueli, O., and Widom, J., eds.), pp. 311–323, San Francisco, CA, Morgan Kaufmann.
Google Scholar
Ryu, T. W. and Eick, C. F. (1998) A unified similarity measure for attributes with set or bag of values for database clustering, in The 6th International Workshop on Rough Sets, Data Mining and Granular Computing, Research Triangle Park, NC.
Google Scholar
Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, NY.
Google Scholar
Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352.
Google Scholar
Schwartz, R. M. and Dayhoff, M. O. (1978) Matrices for detecting distant relationships. Atlas Protein Sequence Struct. 5, 353–358.
Google Scholar
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10,915–10,919.
Article CAS Google Scholar
Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
Article PubMed CAS Google Scholar
Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Article PubMed CAS Google Scholar
Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches. Science 227, 1435–1441.
Article PubMed CAS Google Scholar
Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.
Article PubMed CAS Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
PubMed CAS Google Scholar
Yona, G., Linial, N., and Linial, M. (2000) Protomap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 28, 49–55.
Article PubMed CAS Google Scholar
Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., and Schrader, R. (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17, 935–941.
Article PubMed CAS Google Scholar
Johnson, M. S. and Lehtonen, J. V. (2000) Comparison of protein three dimensional structure, in Bioinformatics: Sequences, Structure and Databanks (Higgins, D. and Taylor, W., eds.), Oxford University Press, 2000.
Grindley, H. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. Biol. 229, 707–721.
Article PubMed CAS Google Scholar
Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1989) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151–166.
Article Google Scholar
Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123.
Article PubMed CAS Google Scholar
Kleywegt, G. J. and Jones, T. A. (1997) Detecting folding motifs and similarities in protein structures. Methods Enzymol. 277, 525–545.
Article PubMed CAS Google Scholar
Calinski, T. and Harabasz, J. (1974) A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.
Google Scholar
Ben-Hur, A., Elisseeff, A., and Guyon, I. (2000) A stability based method for discovering structure in clustered data, in Pacific Symposium on Biocomputing, vol. 7, World Scientific Press, Singapore, pp. 6–17.
Google Scholar
Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001) Validating clustering for gene expression data. Bioinformatics 17(4), 309–318.
Article PubMed CAS Google Scholar
Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P., and Adams, C. L. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555, 556.
Article PubMed CAS Google Scholar
Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270.
Brazma, A., Robinson, A., Cameron, G., and Ashburner, M. (2000) One-stop shop for microarray data. Nature 403, 699–701.
Article PubMed CAS Google Scholar
Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001) Normalization for cDNA microarray data, in Proceedings of SPIE International Biomedical Optics Symposium (Bittner, M. L., Chen, Y., Dorsel, A. N., and Dougherty, E. R., eds.), vol. 4266, SPIE, Bellingham, WA, pp. 141–152.
Google Scholar
D’Haeseleer, P., Fuhrman, S., Wen, X., and Somogyi, R. (1998) Mining the gene expression matrix: inferring gene relationships from large scale gene expression data, in, Information Processing in Cells and Tissues, Plenum, New York, pp. 203–212.
Google Scholar
Eisen, M. (2000) Cluster 2.11 and treeview 1.50 http://rana.lbl.gov/Eisen Software.htm.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912.
Article PubMed CAS Google Scholar
Kohonen, T. Self-Organizing Maps, Springer-Verlag, Berlin, 1995.
Google Scholar
Ben-Dor, A. and Yakhini, Z. (1999) Clustering gene expression patterns, in Proceedings of the 3rd Annual International Conference on Research in Computational Molecular Biology, ACM Press, New York, NY, pp. 33–42.
Chapter Google Scholar
Shamir, R. and Sharan, R. (2000) Click: a clustering algorithm for gene expression analysis, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (Altman, R., et al., eds.), AAAI Press, Menlo Park, CA, pp. 307–316.
Google Scholar
Zhao, Y. and Karypis, G. (2002) Comparison of agglomerative and partitional document clustering algorithms, in SIAM (2002) Workshop on Clustering High-Dimensional Data and Its Applications. Arlington, VA. Also available as technical report #02-014, University of Minnesota.
Karypis, G. and Kumar, V. (1998) hMeTiS 1.5: a hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/~karypis/metis
Karypis, G. and Kumar, V. (1998) MeTiS4.0: unstructured graph partitioning and sparse matrix ordering system. Technical report, department of Computer Science, University of Minnesota. Available at www-users.cs.umn.edu/∼karypis/metis

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, and Digital Technology Center and Army HPC Research Center, 55455, Minneapolis, MN
George Karypis

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Karypis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Y., Karypis, G. Data clustering in life sciences. Mol Biotechnol 31, 55–80 (2005). https://doi.org/10.1385/MB:31:1:055

Download citation

Issue Date: September 2005
DOI: https://doi.org/10.1385/MB:31:1:055

Index Entries

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering in life sciences

Abstract

Access this article

Similar content being viewed by others

Guiding biomedical clustering with ClustEval

Comparing the performance of biomedical clustering methods

Clustering Clinical Data in R

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Index Entries

Navigation

Data clustering in life sciences

Abstract

Access this article

Similar content being viewed by others

Guiding biomedical clustering with ClustEval

Comparing the performance of biomedical clustering methods

Clustering Clinical Data in R

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Index Entries

Search

Navigation