Clustering of Gene Expression Data Via Normal Mixture Models

  • G. J. McLachlanEmail author
  • L. K. Flack
  • S. K. Ng
  • K. Wang
Part of the Methods in Molecular Biology book series (MIMB, volume 972)


There are two distinct but related clustering problems with microarray data. One problem concerns the clustering of the tissue samples (gene signatures) on the basis of the genes; the other concerns the clustering of the genes on the basis of the tissues (gene profiles). The clusters of tissues so obtained in the first problem can play a useful role in the discovery and understanding of new subclasses of diseases. The clusters of genes obtained in the second problem can be used to search for genetic pathways or groups of genes that might be regulated together. Also, in the first problem, we may wish first to summarize the information in the very large number of genes by clustering them into groups (of hyperspherical shape), which can be represented by some metagenes, such as the group sample means. We can then carry out the clustering of the tissues in terms of these metagenes. We focus here on mixtures of normals to provide a model-based clustering of tissue samples (gene signatures) and of gene profiles.

Key words

Clustering of tissue samples Clustering of gene profiles Model-based methods Normal mixture models Mixtures of common factor analyzers Mixtures of linear mixed-effects models 


  1. 1.
    Alizadeh A, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511PubMedCrossRefGoogle Scholar
  2. 2.
    Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868PubMedCrossRefGoogle Scholar
  3. 3.
    Boutros PC, Okey AB (2005) Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering. Brief Bioinform 6:331–343PubMedCrossRefGoogle Scholar
  4. 4.
    Clare A, King RD (2002) Machine learning of functional class from phenotype data. Bioinformatics 18:160–166PubMedCrossRefGoogle Scholar
  5. 5.
    Gibbons FD, Roth FP (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res 12:1574–1581PubMedCrossRefGoogle Scholar
  6. 6.
    DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680–686PubMedCrossRefGoogle Scholar
  7. 7.
    Reilly C, Wang C, Rutherford R (2005) A rapid method for the comparison of cluster analyses. Statistica Sinica 15:19–33Google Scholar
  8. 8.
    Coleman D, Dong XP, Hardin J, Rocke DM, Woodruff DL (1999) Some computational issues in cluster analysis with no a priori metric. Comput Stat Data Anal 31:1–11CrossRefGoogle Scholar
  9. 9.
    Marriott FHC (1974) The interpretation of multiple observations. Academic, LondonGoogle Scholar
  10. 10.
    Everitt BS (1993) Cluster analysis, 3rd edn. Edward Arnold, LondonGoogle Scholar
  11. 11.
    Cormack RM (1971) A review of classification (with discussion). J R Stat Soc A 134:321–367CrossRefGoogle Scholar
  12. 12.
    Hand DJ, Heard NA (2005) Finding groups in gene expression data. J Biomed Biotechnol 2005:215–225PubMedCrossRefGoogle Scholar
  13. 13.
    Hartigan JA (1975) Statistical theory in clustering. J Classification 2:63–76CrossRefGoogle Scholar
  14. 14.
    Ganesalingham S, McLachlan GJ (1978) The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65:658–665CrossRefGoogle Scholar
  15. 15.
    McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York, NYGoogle Scholar
  16. 16.
    Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821CrossRefGoogle Scholar
  17. 17.
    Fraley C, Raferty AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588CrossRefGoogle Scholar
  18. 18.
    Fraley C, Raferty AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631CrossRefGoogle Scholar
  19. 19.
    McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York, NYCrossRefGoogle Scholar
  20. 20.
    Scott AJ, Symons MJ (1971) Clustering methods based on likelihood ratio criteria. Biometrics 27:387–397CrossRefGoogle Scholar
  21. 21.
    Bryant P, Williamson JA (1978) Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65:273–281CrossRefGoogle Scholar
  22. 22.
    McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah PI, Kanal I (eds) Handbook of statistics, vol 2. North-Holland, Amsterdam, pp 199–208Google Scholar
  23. 23.
    Wolfe JH (1965) A computer program for the computation of maximum likelihood analysis of types. Research Memo SRM 65-12. U.S. Naval Personnel Research Activity, San DiegoGoogle Scholar
  24. 24.
    Day NE (1969) Estimating the components of a mixture of two normal distributions. Biometrika 56:463–474CrossRefGoogle Scholar
  25. 25.
    Böhning D (1999) Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others. Chapman & Hall/CRC, New York, NYGoogle Scholar
  26. 26.
    Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York, NYGoogle Scholar
  27. 27.
    Everitt BS, Hand DJ (1981) Finite mixture distributions. Chapman & Hall, LondonCrossRefGoogle Scholar
  28. 28.
    Titterington DM, Smith AFM, Markov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York, NYGoogle Scholar
  29. 29.
    Lindsay BG (1995) Mixture models: theory, geometry and applications. In: NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5. Institute of Mathematical Statistics and the American Statistical Association, Alexandria, VAGoogle Scholar
  30. 30.
    Aitkin M, Anderson D, Hinde J (1981) Statistical modelling of data on teaching styles (with discussion). J R Stat Soc A 144:419–461CrossRefGoogle Scholar
  31. 31.
    Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17:977–987PubMedCrossRefGoogle Scholar
  32. 32.
    Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464CrossRefGoogle Scholar
  33. 33.
    Baek J, McLachlan GJ (2008). Mixtures of factor analyzers with common factor loadings for the clustering and visualization of high-dimensional data. Technical Report NI08020-HOP, Preprint Series of the Isaac Newton Institute for Mathematical Sciences, CambridgeGoogle Scholar
  34. 34.
    McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422PubMedCrossRefGoogle Scholar
  35. 35.
    Pollard KS, van der Laan MJ (2002) Statistical inference for simultaneous clustering of gene expression data. Math Biosci 176:99–121PubMedCrossRefGoogle Scholar
  36. 36.
    Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes (with discussion). J R Stat Soc B 66:815–849CrossRefGoogle Scholar
  37. 37.
    Belitskaya-Levy I (2006) A generalized clustering problem, with application to DNA microarrays. Stat Appl Genet Mol Biol 5, Article 2.Google Scholar
  38. 38.
    Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, D’Amico A, Richie J (2002) Gene expression correlates of clinical prostate cancer behaviour. Cancer Cell 1:203–209PubMedCrossRefGoogle Scholar
  39. 39.
    Ng SK, McLachlan GJ, Wang K, Ben-Tovim JL, Ng S-W (2006) A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics 22:1745–1752PubMedCrossRefGoogle Scholar
  40. 40.
    Cho RJ, Huang M, Campbell MJ, Dong H, Steinmetz L, Sapinoso L, Hampton G, Elledge SJ, Davis RW, Lockhart DJ (2001) Transcriptional regulation and function during the human cell cycle. Nat Genet 27:48–54PubMedGoogle Scholar
  41. 41.
    Wong DSV, Wong FK, Wood GR (2007) A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 23:998–1005PubMedCrossRefGoogle Scholar
  42. 42.
    Booth JG, Casella G, Cooke JEK, Davis JM (2004) Clustering periodically-expressed genes using microarray data: a statistical analysis of the yeast cell cycle data. Technical Report. Department of Biological Statistics and Computational Biology, Cornell University, IthacaGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • G. J. McLachlan
    • 1
    Email author
  • L. K. Flack
    • 2
  • S. K. Ng
    • 3
  • K. Wang
    • 4
  1. 1.Department of MathematicsUniversity of QueenslandBrisbaneAustralia
  2. 2.Department of RheumatologyUniversity of NSWRydeAustralia
  3. 3.School of MedicineGriffith UniversityMeadowbrookAustralia
  4. 4.University of QueenslandMelbourneAustralia

Personalised recommendations