Abstract
In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.
Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.
Gottardo, R. and Lo, K. (2008). flowClust Bioconductor package
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137.
Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york.
Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588.
McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York
Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38.
Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821.
Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311.
Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New York
Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76:388–394.
Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2:63–76.
Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99.
Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York.
Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445.
Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47.
Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409
Comăaniciu, D. and Meer, P. (1999). Mean-shift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799.
Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40.
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799.
Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics).
Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.
Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403.
Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida.
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA.
Meilăa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879.
Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman.
Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001.
Meilăa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05).
Shortreed, S. and Meilăa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544.
Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976.
Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103.
Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86.
Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849.
Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.
Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.
Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17.
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850.
Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576.
Meilăa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York.
Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error.
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
Meilăa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Nugent, R., Meila, M. (2010). An Overview of Clustering Applied to Molecular Biology. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_12
Download citation
DOI: https://doi.org/10.1007/978-1-60761-580-4_12
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-578-1
Online ISBN: 978-1-60761-580-4
eBook Packages: Springer Protocols