Statistical Methods in Molecular Biology pp 369-404 | Cite as
An Overview of Clustering Applied to Molecular Biology
- 38 Citations
- 5.8k Downloads
Abstract
In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.
Key words
Cluster analysis K-means model-based clustering EM algorithm similarity-based clustering spectral clustering nonparametric clustering hierarchical clustering biclustering comparing partitionsReferences
- 1.Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.CrossRefGoogle Scholar
- 2.Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.CrossRefGoogle Scholar
- 3.Gottardo, R. and Lo, K. (2008). flowClust Bioconductor packageGoogle Scholar
- 4.Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137.CrossRefGoogle Scholar
- 5.Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york.Google Scholar
- 6.Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588.CrossRefGoogle Scholar
- 7.McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New YorkGoogle Scholar
- 8.Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14.CrossRefGoogle Scholar
- 9.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38.Google Scholar
- 10.Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821.CrossRefGoogle Scholar
- 11.Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311.Google Scholar
- 12.Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New YorkGoogle Scholar
- 13.Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76:388–394.CrossRefGoogle Scholar
- 14.Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2:63–76.CrossRefGoogle Scholar
- 15.Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99.Google Scholar
- 16.Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York.Google Scholar
- 17.Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445.Google Scholar
- 18.Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47.CrossRefGoogle Scholar
- 19.Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409Google Scholar
- 20.Comăaniciu, D. and Meer, P. (1999). Mean-shift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799.Google Scholar
- 21.Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40.CrossRefGoogle Scholar
- 22.Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799.CrossRefGoogle Scholar
- 23.Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776.PubMedCrossRefGoogle Scholar
- 24.Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265.Google Scholar
- 25.Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics).Google Scholar
- 26.Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.CrossRefGoogle Scholar
- 27.Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403.PubMedCrossRefGoogle Scholar
- 28.Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida.Google Scholar
- 29.Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA.Google Scholar
- 30.Meilăa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879.Google Scholar
- 31.Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman.Google Scholar
- 32.Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001.Google Scholar
- 33.Meilăa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05).Google Scholar
- 34.Shortreed, S. and Meilăa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544.Google Scholar
- 35.Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976.CrossRefGoogle Scholar
- 36.Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103.Google Scholar
- 37.Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86.Google Scholar
- 38.Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849.CrossRefGoogle Scholar
- 39.Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.CrossRefGoogle Scholar
- 40.Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.CrossRefGoogle Scholar
- 41.Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17.Google Scholar
- 42.Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.CrossRefGoogle Scholar
- 43.Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850.CrossRefGoogle Scholar
- 44.Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576.Google Scholar
- 45.Meilăa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York.Google Scholar
- 46.Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error.Google Scholar
- 47.Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.CrossRefGoogle Scholar
- 48.Meilăa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.CrossRefGoogle Scholar