An Overview of Clustering Applied to Molecular Biology

Nugent, Rebecca; Meila, Marina

doi:10.1007/978-1-60761-580-4_12

Rebecca Nugent⁵ &
Marina Meila⁶

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

6654 Accesses
39 Citations

Abstract

In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.
Article CAS Google Scholar
Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.
Article Google Scholar
Gottardo, R. and Lo, K. (2008). flowClust Bioconductor package
Google Scholar
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137.
Article Google Scholar
Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york.
Google Scholar
Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588.
Article Google Scholar
McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York
Google Scholar
Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14.
Article Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38.
Google Scholar
Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821.
Article Google Scholar
Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311.
Google Scholar
Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New York
Google Scholar
Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76:388–394.
Article Google Scholar
Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2:63–76.
Article Google Scholar
Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99.
Google Scholar
Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York.
Google Scholar
Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445.
Google Scholar
Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47.
Article Google Scholar
Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409
Google Scholar
Comăaniciu, D. and Meer, P. (1999). Mean-shift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799.
Google Scholar
Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40.
Article Google Scholar
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799.
Article Google Scholar
Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776.
Article PubMed Google Scholar
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265.
Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics).
Google Scholar
Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.
Book Google Scholar
Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403.
Article PubMed Google Scholar
Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida.
Google Scholar
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA.
Google Scholar
Meilăa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879.
Google Scholar
Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman.
Google Scholar
Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001.
Google Scholar
Meilăa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05).
Google Scholar
Shortreed, S. and Meilăa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544.
Google Scholar
Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976.
Article Google Scholar
Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103.
CAS Google Scholar
Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86.
Google Scholar
Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849.
Article Google Scholar
Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.
Article CAS Google Scholar
Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.
Article Google Scholar
Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17.
Google Scholar
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.
Article Google Scholar
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850.
Article Google Scholar
Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576.
Google Scholar
Meilăa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York.
Google Scholar
Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error.
Google Scholar
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
Book Google Scholar
Meilăa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA
Rebecca Nugent
Department of Statistics, University of Washington, Seattle, WA, USA
Marina Meila

Authors

Rebecca Nugent
View author publications
You can also search for this author in PubMed Google Scholar
Marina Meila
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Heejung Bang
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Xi Kathy Zhou
Journal of Experimental Medicine, Rockefeller University Press, First Ave. 1114, New York, 10021, New York, USA
Heather L. van Epps
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Madhu Mazumdar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Nugent, R., Meila, M. (2010). An Overview of Clustering Applied to Molecular Biology. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_12

Download citation

DOI: https://doi.org/10.1007/978-1-60761-580-4_12
Published: 15 December 2009
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-578-1
Online ISBN: 978-1-60761-580-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics