Skip to main content

An Overview of Clustering Applied to Molecular Biology

  • Protocol
  • First Online:
Statistical Methods in Molecular Biology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

Abstract

In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.

    Article  CAS  Google Scholar 

  2. Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.

    Article  Google Scholar 

  3. Gottardo, R. and Lo, K. (2008). flowClust Bioconductor package

    Google Scholar 

  4. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137.

    Article  Google Scholar 

  5. Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york.

    Google Scholar 

  6. Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588.

    Article  Google Scholar 

  7. McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York

    Google Scholar 

  8. Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14.

    Article  Google Scholar 

  9. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38.

    Google Scholar 

  10. Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821.

    Article  Google Scholar 

  11. Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311.

    Google Scholar 

  12. Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New York

    Google Scholar 

  13. Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76:388–394.

    Article  Google Scholar 

  14. Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2:63–76.

    Article  Google Scholar 

  15. Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99.

    Google Scholar 

  16. Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York.

    Google Scholar 

  17. Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445.

    Google Scholar 

  18. Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47.

    Article  Google Scholar 

  19. Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409

    Google Scholar 

  20. Comăaniciu, D. and Meer, P. (1999). Mean-shift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799.

    Google Scholar 

  21. Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40.

    Article  Google Scholar 

  22. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799.

    Article  Google Scholar 

  23. Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776.

    Article  PubMed  Google Scholar 

  24. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265.

    Google Scholar 

  25. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics).

    Google Scholar 

  26. Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.

    Book  Google Scholar 

  27. Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403.

    Article  PubMed  Google Scholar 

  28. Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida.

    Google Scholar 

  29. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA.

    Google Scholar 

  30. Meilăa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879.

    Google Scholar 

  31. Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman.

    Google Scholar 

  32. Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001.

    Google Scholar 

  33. Meilăa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05).

    Google Scholar 

  34. Shortreed, S. and Meilăa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544.

    Google Scholar 

  35. Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976.

    Article  Google Scholar 

  36. Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103.

    CAS  Google Scholar 

  37. Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86.

    Google Scholar 

  38. Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849.

    Article  Google Scholar 

  39. Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.

    Article  CAS  Google Scholar 

  40. Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.

    Article  Google Scholar 

  41. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17.

    Google Scholar 

  42. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.

    Article  Google Scholar 

  43. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850.

    Article  Google Scholar 

  44. Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576.

    Google Scholar 

  45. Meilăa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York.

    Google Scholar 

  46. Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error.

    Google Scholar 

  47. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.

    Book  Google Scholar 

  48. Meilăa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Nugent, R., Meila, M. (2010). An Overview of Clustering Applied to Molecular Biology. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-60761-580-4_12

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-60761-578-1

  • Online ISBN: 978-1-60761-580-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics