Skip to main content

Advertisement

Log in

Hybrid Mixture Model for Subpopulation Identification

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

Personalized medicine aims to identify those patients who have good or poor prognosis for overall disease outcomes or therapeutic efficacy for a specific treatment. A well-established approach is to identify a set of biomarkers using statistical methods with a classification algorithm to identify patient subgroups for treatment selection. However, there are potential false positives and false negatives in classification resulting in incorrect patient treatment assignment. In this paper, we propose a hybrid mixture model taking uncertainty in class labels into consideration, where the class labels are modeled by a Bernoulli random variable. An EM algorithm was developed to estimate the model parameters, and a parametric bootstrap method was used to test the significance of the predictive variables that were associated with subgroup memberships. Simulation experiments showed that the proposed method averagely had higher accuracy in identifying the subpopulations than the Naïve Bayes classifier and logistic regression. A breast cancer dataset was analyzed to illustrate the proposed hybrid mixture model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. van ‘t Veer LJ, Dai H, van de Vijver MJ et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536

    Article  Google Scholar 

  2. van de Vijver MJ, He YD, van’t Veer LJ et al (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009

    Article  Google Scholar 

  3. Paik S et al (2004) A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351:2817–2826

    Article  Google Scholar 

  4. Sparano JA, Paik S (2008) Development of the 21-gene assay and its application in clinical practice and clinical trials. J Clin Oncol 26:721–728

    Article  Google Scholar 

  5. Wang M, Chen JY (2010) A GMM-IG framework for selecting genes as expression panel biomarkers. Artif. Intell. Med. 48:75–82

    Article  Google Scholar 

  6. Pui CH, Evans WE (1999) Acute lymphoblastic leukemia in infants. J. Clin. Oncol. 17(2):438–440

    Google Scholar 

  7. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  8. Yeoh EJ, Ross ME, Shurtleff SA et al (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143

    Article  Google Scholar 

  9. Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96:6745–6750

    Article  Google Scholar 

  10. Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511

    Article  Google Scholar 

  11. Hastie T, Tibshirani RT, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Book  MATH  Google Scholar 

  12. Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  13. Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422

    Article  MATH  Google Scholar 

  14. Brieman L, Friedman JH, Olshen RA, Stone CJ, Steinberg D, Colla P (1995) CART: classification and regression trees. Salford Systems, Stanford

    Google Scholar 

  15. Pearson K (1894) Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond A 185:71–110

    Article  MATH  Google Scholar 

  16. Uquia ML, Moineddin R, Frank JW (2012) A mixture model to correct misclassification of gestational age. Ann Epidemiol 22:151–9

    Google Scholar 

  17. McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422

    Article  Google Scholar 

  18. McLachlan GJ, Bean RW, Jones LBT (2006) A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22:1608–1615

    Article  Google Scholar 

  19. McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26:2705–2712

    Article  Google Scholar 

  20. Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  21. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  22. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  23. Jiao S, Zhang S (2008) The t-mixture model approach for detecting differentially expressed genes in microarrays. Funct Integr Genomics 8:181–186

    Article  Google Scholar 

  24. Chernoff H, Lander E (1995) Asymptotic distribution of the likelihood ratio test that a mixture of two binomial is a single binomial. J Stat Plan Inference 43:19–40

    Article  MathSciNet  MATH  Google Scholar 

  25. Lo Y, Mendell NR, Rubin DB (2001) Testing the number of components in a normal mixture. Biometrika 88:767–778

    Article  MathSciNet  MATH  Google Scholar 

  26. Lo Y (2005) Likelihood ratio test of the number of components in a normal mixture with unequal variances. Stat Probab Lett 71:225–235

    Article  MATH  Google Scholar 

  27. Chen H, Chen J (2001) Large sample distribution of the likelihood ratio test for normal mixtures. Stat Probab Lett 52:125–133

    Article  MathSciNet  MATH  Google Scholar 

  28. Feng ZD, McCulloch CE (1994) On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 50:1158–1162

    Article  MATH  Google Scholar 

  29. Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B Stat Methodol 66:95–115

    Article  MathSciNet  MATH  Google Scholar 

  30. Chen J, Li P (2009) Hypothesis test for normal mixture model the EM approach. Ann Stat 37:2523–2542

    Article  MATH  Google Scholar 

  31. Hatigan JA (1985) A failure of likelihood asymptotics for normal mixtures. In: Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (L. LeCam and R. A. Olshen, eds) 2 870–810. Wadsworth, Monterey, CA

  32. Liu X, Pasaric C, Shao Y (2003) Testing homogeneity in gamma mixture models. Scand J Stat 30:227–239

    Article  MathSciNet  MATH  Google Scholar 

  33. Liu X, Shao YZ (2004) Asymtotics for the likelihood ratio test in a two-component normal mixture model. J Stat Plan Inference 123:61–81

    Article  MathSciNet  MATH  Google Scholar 

  34. McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324

    Article  Google Scholar 

  35. Feng ZD, McCulloch CE (1996) Using bootstrap likelihood ratios in finite mixture models. J R Stat Soc Ser B 58:609–617

    MATH  Google Scholar 

  36. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300

    MathSciNet  MATH  Google Scholar 

  37. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121

    Article  MATH  Google Scholar 

  38. Storey JD (2002) A direct approach to false discovery rates. J. R. Stat. Soc. B 64:479–498

    Article  MathSciNet  MATH  Google Scholar 

  39. Tsai CA, Hsueh HM, Chen JJ (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59:1071–1081

    Article  MathSciNet  MATH  Google Scholar 

  40. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648

    Article  Google Scholar 

  41. Tibshirani R, Hastie T (2007) Outlier sums differential gene expression analysis. Biostatistics 8:2–8

    Article  MATH  Google Scholar 

  42. Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–75

    Article  MATH  Google Scholar 

  43. Chen LA, Chen DT, Chan W (2010) The distribution-based p-value for the outlier sum in differential gene expression analysis. Biometrika 97:246–253

    Article  MathSciNet  MATH  Google Scholar 

  44. Chen DT, Hsu YL, Fulp WJ, Coppola D, Haura EB, Yeatman TJ et al (2011) Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. J Natl Cancer Inst 103:1859–1870

    Article  Google Scholar 

  45. Chen HC, Kodell RL, Cheng KF, Chen JJ (2012) Assessment of performance of survival prediction models for cancer prognosis. BMC Med Res Method 12:102

    Article  Google Scholar 

  46. Lin YH, Friederichs J, Black MA et al (2007) Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin Cancer Res 13:498–507

    Article  Google Scholar 

  47. Zhu ZH, Sun BY, Ma Y, Shao JY et al (2009) Three immunomarker support vector machines-based prognostic classifiers for stage IB non-small-cell lung cancer. J Clin Oncol 27:1091–1099

    Article  Google Scholar 

  48. Schramm A, Schulte JH, Klein-Hitpass L et al (2005) Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling. Oncogene 24:7902–7912

    Article  Google Scholar 

  49. Yu SJ, Yu JK, Ge WT et al (2011) SPARCL1, Shp2, MSH2, E-cadherin, p53, ADCY-2 and MAPK are prognosis-related in colorectal cancer. World J Gastroenterol 17:2028–36

    Article  Google Scholar 

  50. Klein J, Moeschberger M (2003) Survival analysis: techniques for censored and truncated data, 2nd edn. Springer, New York

    MATH  Google Scholar 

  51. Sotiriou C, Wirapati P, Loi S et al (2006) Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 15:262–272

    Article  Google Scholar 

  52. Sa Tomlins, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R et al (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648

    Article  Google Scholar 

  53. Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–575

    Article  MATH  Google Scholar 

  54. Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8:2–8

    Article  MATH  Google Scholar 

  55. Mpindi JP, Sara H, Haapa-Paananen S, Kilpinen S, Pisto T et al (2011) GTI: a novel algorithm for identifying outlier gene expression profiles from integrated microarray datasets. PLoS One 6:e17259

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James J. Chen.

Additional information

The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, HC., Chen, J.J. Hybrid Mixture Model for Subpopulation Identification. Stat Biosci 8, 28–42 (2016). https://doi.org/10.1007/s12561-015-9131-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-015-9131-y

Keywords

Navigation