Abstract
Personalized medicine aims to identify those patients who have good or poor prognosis for overall disease outcomes or therapeutic efficacy for a specific treatment. A well-established approach is to identify a set of biomarkers using statistical methods with a classification algorithm to identify patient subgroups for treatment selection. However, there are potential false positives and false negatives in classification resulting in incorrect patient treatment assignment. In this paper, we propose a hybrid mixture model taking uncertainty in class labels into consideration, where the class labels are modeled by a Bernoulli random variable. An EM algorithm was developed to estimate the model parameters, and a parametric bootstrap method was used to test the significance of the predictive variables that were associated with subgroup memberships. Simulation experiments showed that the proposed method averagely had higher accuracy in identifying the subpopulations than the Naïve Bayes classifier and logistic regression. A breast cancer dataset was analyzed to illustrate the proposed hybrid mixture model.
Similar content being viewed by others
References
van ‘t Veer LJ, Dai H, van de Vijver MJ et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
van de Vijver MJ, He YD, van’t Veer LJ et al (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009
Paik S et al (2004) A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351:2817–2826
Sparano JA, Paik S (2008) Development of the 21-gene assay and its application in clinical practice and clinical trials. J Clin Oncol 26:721–728
Wang M, Chen JY (2010) A GMM-IG framework for selecting genes as expression panel biomarkers. Artif. Intell. Med. 48:75–82
Pui CH, Evans WE (1999) Acute lymphoblastic leukemia in infants. J. Clin. Oncol. 17(2):438–440
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Yeoh EJ, Ross ME, Shurtleff SA et al (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143
Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96:6745–6750
Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Hastie T, Tibshirani RT, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Brieman L, Friedman JH, Olshen RA, Stone CJ, Steinberg D, Colla P (1995) CART: classification and regression trees. Salford Systems, Stanford
Pearson K (1894) Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond A 185:71–110
Uquia ML, Moineddin R, Frank JW (2012) A mixture model to correct misclassification of gestational age. Ann Epidemiol 22:151–9
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422
McLachlan GJ, Bean RW, Jones LBT (2006) A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22:1608–1615
McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26:2705–2712
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Jiao S, Zhang S (2008) The t-mixture model approach for detecting differentially expressed genes in microarrays. Funct Integr Genomics 8:181–186
Chernoff H, Lander E (1995) Asymptotic distribution of the likelihood ratio test that a mixture of two binomial is a single binomial. J Stat Plan Inference 43:19–40
Lo Y, Mendell NR, Rubin DB (2001) Testing the number of components in a normal mixture. Biometrika 88:767–778
Lo Y (2005) Likelihood ratio test of the number of components in a normal mixture with unequal variances. Stat Probab Lett 71:225–235
Chen H, Chen J (2001) Large sample distribution of the likelihood ratio test for normal mixtures. Stat Probab Lett 52:125–133
Feng ZD, McCulloch CE (1994) On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 50:1158–1162
Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B Stat Methodol 66:95–115
Chen J, Li P (2009) Hypothesis test for normal mixture model the EM approach. Ann Stat 37:2523–2542
Hatigan JA (1985) A failure of likelihood asymptotics for normal mixtures. In: Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (L. LeCam and R. A. Olshen, eds) 2 870–810. Wadsworth, Monterey, CA
Liu X, Pasaric C, Shao Y (2003) Testing homogeneity in gamma mixture models. Scand J Stat 30:227–239
Liu X, Shao YZ (2004) Asymtotics for the likelihood ratio test in a two-component normal mixture model. J Stat Plan Inference 123:61–81
McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324
Feng ZD, McCulloch CE (1996) Using bootstrap likelihood ratios in finite mixture models. J R Stat Soc Ser B 58:609–617
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121
Storey JD (2002) A direct approach to false discovery rates. J. R. Stat. Soc. B 64:479–498
Tsai CA, Hsueh HM, Chen JJ (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59:1071–1081
Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648
Tibshirani R, Hastie T (2007) Outlier sums differential gene expression analysis. Biostatistics 8:2–8
Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–75
Chen LA, Chen DT, Chan W (2010) The distribution-based p-value for the outlier sum in differential gene expression analysis. Biometrika 97:246–253
Chen DT, Hsu YL, Fulp WJ, Coppola D, Haura EB, Yeatman TJ et al (2011) Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. J Natl Cancer Inst 103:1859–1870
Chen HC, Kodell RL, Cheng KF, Chen JJ (2012) Assessment of performance of survival prediction models for cancer prognosis. BMC Med Res Method 12:102
Lin YH, Friederichs J, Black MA et al (2007) Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin Cancer Res 13:498–507
Zhu ZH, Sun BY, Ma Y, Shao JY et al (2009) Three immunomarker support vector machines-based prognostic classifiers for stage IB non-small-cell lung cancer. J Clin Oncol 27:1091–1099
Schramm A, Schulte JH, Klein-Hitpass L et al (2005) Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling. Oncogene 24:7902–7912
Yu SJ, Yu JK, Ge WT et al (2011) SPARCL1, Shp2, MSH2, E-cadherin, p53, ADCY-2 and MAPK are prognosis-related in colorectal cancer. World J Gastroenterol 17:2028–36
Klein J, Moeschberger M (2003) Survival analysis: techniques for censored and truncated data, 2nd edn. Springer, New York
Sotiriou C, Wirapati P, Loi S et al (2006) Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 15:262–272
Sa Tomlins, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R et al (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648
Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–575
Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8:2–8
Mpindi JP, Sara H, Haapa-Paananen S, Kilpinen S, Pisto T et al (2011) GTI: a novel algorithm for identifying outlier gene expression profiles from integrated microarray datasets. PLoS One 6:e17259
Author information
Authors and Affiliations
Corresponding author
Additional information
The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.
Rights and permissions
About this article
Cite this article
Chen, HC., Chen, J.J. Hybrid Mixture Model for Subpopulation Identification. Stat Biosci 8, 28–42 (2016). https://doi.org/10.1007/s12561-015-9131-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-015-9131-y