Skip to main content
Log in

A new clustering method of gene expression data based on multivariate Gaussian mixture models

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Clustering gene expression data are an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Many clustering methods have been used in the field of gene clustering. This paper proposed a new method for gene expression data clustering based on an improved expectation maximization(EM) method of multivariate Gaussian mixture models. To solve the problem of over-reliance on the initialization, we propose a remove and add initialization for the classical EM, and make a random perturbation on the solution before continuing EM iterations. The number of clusters is estimated with the Quasi Akaike’s information criterion in this paper. The improved EM method is tested and compared with some other clustering methods; the performance of our clustering algorithm has been extensively compared over several simulated and real gene expression data sets. Our results indicated that improved EM clustering method is superior than other clustering algorithms and can be widely used for gene clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Pirim, H., Ekşioğlu, B., Perkins, A.D., Yüceer, Ç.: Clustering of high throughput gene expression data. Comput. Op. Res. 39(12), 3046–3061 (2012)

    Article  Google Scholar 

  2. Sun, J., Chen, W., Fang, W., Wun, X.J., Xu, W.B.: Gene expression data analysis with the clustering method based on an improved quantum-behaved Particle Swarm Optimization. Eng. Appl. Artif. Intell. 25(2), 376–391 (2012)

    Article  Google Scholar 

  3. Mukhopadhyay, A., Maulik, U.: Towards improving fuzzy clustering using support vector machine: application to gene expression data. Pattern Recognit. 42(11), 2744–2763 (2009)

    Article  MATH  Google Scholar 

  4. Zhang, W.F., Liu, C.C., Yan, H.: Clustering of temporal gene expression data by regularized spline regression and an energy based similarity measure. Pattern Recognit. 43(12), 3969–3976 (2010)

    Article  MATH  Google Scholar 

  5. Kerr, G., Ruskin, H.J., Crane, M., Doolan, P.: Techniques for clustering gene expression data. Comput. Biol. Med. 38(3), 283–293 (2008)

    Article  Google Scholar 

  6. Seal, S., Komarina, S., Aluru, S.: An optimal hierarchical clustering algorithm for gene expression data. Inform. Process Lett. 93(3), 143–147 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  7. Szeto, L.K., Wee-Chung Liew, A., Yan, Hong, Tang, Sy-sen: Gene expression data clustering and visualization based on a binary hierarchical clustering framework. J. Visual. Lang. Comput. 14(4), 341–362 (2003)

    Article  Google Scholar 

  8. Chan, Zeke S.H., Lesley Collins, Kasabov, N.: An efficient greedy K-means algorithm for global gene trajectory clustering. Expert Syst. Appl. 30(1), 137–141 (2006)

    Article  Google Scholar 

  9. Lam, Yau King, Tsang, Peter W.M.: Exploratory K-Means: a new simple and efficient algorithm for gene clustering. Appl. Soft Comput. 12(3), 1149–1157 (2012)

    Article  Google Scholar 

  10. Ghouila, Amel, Yahia, Sadok Ben, Malouche, Dhafer, et al.: Application of Multi-SOM clustering approach to macrophage gene expression analysis. Infect. Genet. Evol. 9(3), 328–336 (2009)

    Article  Google Scholar 

  11. Niciura, Simone Cristina Méo, Ibelli, Adriana Mércia Guaratini, Gouveia, Gisele Veneroni: Polymorphism and parent-of-origin effects on gene expression of CAST, leptin and DGAT1 in cattle. Meat Sci. 90(2), 507–510 (2012)

    Article  Google Scholar 

  12. Saha, Indrajit, Maulik, Ujjwal, Bandyopadhyay, Sanghamitra, Plewczynski, Dariusz: Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis. Expert Syst. Appl. 38(12), 15122–15133 (2011)

    Article  Google Scholar 

  13. Zeng, Y.J., Javier, G.F.: A novel HMM-based clustering algorithm for the analysis of gene expression time-course data. Comput. Stat. Data Anal. 50(9), 2472–2494 (2006)

    Article  MATH  Google Scholar 

  14. McNicholas, Paul D., Subedi, Sanjeena: Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 142(5), 1114–1127 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  15. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)

  16. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  17. Yeung, K.Y., Fraley, C., Murua, A., et al.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)

  18. Qu, Y., Xu, S.Z.: Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20(12), 1905–1913 (2004)

    Article  Google Scholar 

  19. Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52(1), 502–519 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  20. McNicholas, P.D.: Model-based classification using latent Gaussian mixture models. J. Stat. Plan. Inference 140(5), 1175–1181 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  21. Yao, W.: A note on EM algorithm for mixture models. Stat. Probabil. Lett. 83(2), 519–526 (2013)

    Article  MATH  Google Scholar 

  22. Lee, G., Scott, C.: EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput. Stat. Data Anal. 56(9), 2816–2829 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  23. Yang, M., Lai, C., Lin, C.: A robust EM clustering algorithm for Gaussian mixture models. Pattern. Recognit. 45(11), 3950–3961 (2012)

    Article  MATH  Google Scholar 

  24. Jacques, J., Preda, C.: Model-based clustering for multivariate functional data. Comput. Stat. Data. Anal. 71, 92–106 (2014)

    Article  MathSciNet  Google Scholar 

  25. Maraziotis, I.A.: A semi-supervised fuzzy clustering algorithm applied to gene expression data. Pattern Recognit. 45(1), 637–648 (2012)

    Article  MATH  Google Scholar 

  26. Akaike, H.: A new look at statistical model identification. IEEE Trans. Autom. Control. 19, 716–723 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  27. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 2907–2912 (1978)

    Article  Google Scholar 

  28. Lebreton, J.D., Burnham, K.P., Clobert, J., Anderson, D.R.: Modelling survival and testing biological hypotheses using marked animals:a unified approach with case studies. Ecol. Monogr. 62, 67–118 (1992)

    Article  Google Scholar 

  29. McNicholas, P.D., Subedi, S.: Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 142, 1114–1127 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  30. Dembele, D., Kastner, P.: Fuzzy C-means method for clustering microarray data. Bioinformatics 19, 973–980 (2003)

    Article  Google Scholar 

  31. Tavazoie, S., Hughes, J.D., Campbell, M.J., et al.: Systematic determination of genetic network architecture. Nat. Genet. 22, 281–285 (1999)

    Article  Google Scholar 

  32. Wen, X.L., Fuhman, S., Michaels, G.S., et al.: Larger-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA 95(1), 334–339 (1998)

    Article  Google Scholar 

  33. Iyer, V.R., et al.: The transcriptional program in the response of the human fibroblasts to serum. Science 283, 83–87 (1999)

    Article  Google Scholar 

  34. Eisen, M.B., Spellman, P.T., Brown, P.O., et al.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(14), 863–14868 (1998)

    Google Scholar 

  35. Tavazoie, S., Hughes, J.D., Campbell, M.J., et al.: Systematic determination of genetic network architecture. Nat. Genet. 22, 218–285 (1999)

    Google Scholar 

  36. Weizmann Institute of Science, GeneCards: The Human Gene Compendium. Accessed February 9, 2011. (1996)

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61402204), the Jiangsu Province Natural Science Foundation (Nos. BK20130529, BK2012209), the Research Fund for the Doctoral Program of Higher Education of China (No. 20113227110010), the research foundation for talented scholars, Jiangsu University (No. 14JDG141), and the science and technology program of Zhenjiang city (SH20140110).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Song, Yq., Xie, Ch. et al. A new clustering method of gene expression data based on multivariate Gaussian mixture models. SIViP 10, 359–368 (2016). https://doi.org/10.1007/s11760-015-0749-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-015-0749-5

Keywords

Navigation