Hybrid Mixture Model for Subpopulation Identification

Chen, Hung-Chia; Chen, James J.

doi:10.1007/s12561-015-9131-y

Hybrid Mixture Model for Subpopulation Identification

Published: 01 May 2015

Volume 8, pages 28–42, (2016)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Hung-Chia Chen¹ &
James J. Chen¹

179 Accesses
Explore all metrics

Abstract

Personalized medicine aims to identify those patients who have good or poor prognosis for overall disease outcomes or therapeutic efficacy for a specific treatment. A well-established approach is to identify a set of biomarkers using statistical methods with a classification algorithm to identify patient subgroups for treatment selection. However, there are potential false positives and false negatives in classification resulting in incorrect patient treatment assignment. In this paper, we propose a hybrid mixture model taking uncertainty in class labels into consideration, where the class labels are modeled by a Bernoulli random variable. An EM algorithm was developed to estimate the model parameters, and a parametric bootstrap method was used to test the significance of the predictive variables that were associated with subgroup memberships. Simulation experiments showed that the proposed method averagely had higher accuracy in identifying the subpopulations than the Naïve Bayes classifier and logistic regression. A breast cancer dataset was analyzed to illustrate the proposed hybrid mixture model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mixture survival trees for cancer risk classification

Article 29 April 2022

The g3mclass is a practical software for multiclass classification on biomarkers

Article Open access 05 November 2022

Research on expansion and classification of imbalanced data based on SMOTE algorithm

Article Open access 15 December 2021

References

van ‘t Veer LJ, Dai H, van de Vijver MJ et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Article Google Scholar
van de Vijver MJ, He YD, van’t Veer LJ et al (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009
Article Google Scholar
Paik S et al (2004) A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351:2817–2826
Article Google Scholar
Sparano JA, Paik S (2008) Development of the 21-gene assay and its application in clinical practice and clinical trials. J Clin Oncol 26:721–728
Article Google Scholar
Wang M, Chen JY (2010) A GMM-IG framework for selecting genes as expression panel biomarkers. Artif. Intell. Med. 48:75–82
Article Google Scholar
Pui CH, Evans WE (1999) Acute lymphoblastic leukemia in infants. J. Clin. Oncol. 17(2):438–440
Google Scholar
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Yeoh EJ, Ross ME, Shurtleff SA et al (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143
Article Google Scholar
Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96:6745–6750
Article Google Scholar
Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Article Google Scholar
Hastie T, Tibshirani RT, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Book MATH Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Article MATH Google Scholar
Brieman L, Friedman JH, Olshen RA, Stone CJ, Steinberg D, Colla P (1995) CART: classification and regression trees. Salford Systems, Stanford
Google Scholar
Pearson K (1894) Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond A 185:71–110
Article MATH Google Scholar
Uquia ML, Moineddin R, Frank JW (2012) A mixture model to correct misclassification of gestational age. Ann Epidemiol 22:151–9
Google Scholar
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422
Article Google Scholar
McLachlan GJ, Bean RW, Jones LBT (2006) A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22:1608–1615
Article Google Scholar
McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26:2705–2712
Article Google Scholar
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Article MATH Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Jiao S, Zhang S (2008) The t-mixture model approach for detecting differentially expressed genes in microarrays. Funct Integr Genomics 8:181–186
Article Google Scholar
Chernoff H, Lander E (1995) Asymptotic distribution of the likelihood ratio test that a mixture of two binomial is a single binomial. J Stat Plan Inference 43:19–40
Article MathSciNet MATH Google Scholar
Lo Y, Mendell NR, Rubin DB (2001) Testing the number of components in a normal mixture. Biometrika 88:767–778
Article MathSciNet MATH Google Scholar
Lo Y (2005) Likelihood ratio test of the number of components in a normal mixture with unequal variances. Stat Probab Lett 71:225–235
Article MATH Google Scholar
Chen H, Chen J (2001) Large sample distribution of the likelihood ratio test for normal mixtures. Stat Probab Lett 52:125–133
Article MathSciNet MATH Google Scholar
Feng ZD, McCulloch CE (1994) On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 50:1158–1162
Article MATH Google Scholar
Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B Stat Methodol 66:95–115
Article MathSciNet MATH Google Scholar
Chen J, Li P (2009) Hypothesis test for normal mixture model the EM approach. Ann Stat 37:2523–2542
Article MATH Google Scholar
Hatigan JA (1985) A failure of likelihood asymptotics for normal mixtures. In: Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (L. LeCam and R. A. Olshen, eds) 2 870–810. Wadsworth, Monterey, CA
Liu X, Pasaric C, Shao Y (2003) Testing homogeneity in gamma mixture models. Scand J Stat 30:227–239
Article MathSciNet MATH Google Scholar
Liu X, Shao YZ (2004) Asymtotics for the likelihood ratio test in a two-component normal mixture model. J Stat Plan Inference 123:61–81
Article MathSciNet MATH Google Scholar
McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324
Article Google Scholar
Feng ZD, McCulloch CE (1996) Using bootstrap likelihood ratios in finite mixture models. J R Stat Soc Ser B 58:609–617
MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
MathSciNet MATH Google Scholar
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121
Article MATH Google Scholar
Storey JD (2002) A direct approach to false discovery rates. J. R. Stat. Soc. B 64:479–498
Article MathSciNet MATH Google Scholar
Tsai CA, Hsueh HM, Chen JJ (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59:1071–1081
Article MathSciNet MATH Google Scholar
Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648
Article Google Scholar
Tibshirani R, Hastie T (2007) Outlier sums differential gene expression analysis. Biostatistics 8:2–8
Article MATH Google Scholar
Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–75
Article MATH Google Scholar
Chen LA, Chen DT, Chan W (2010) The distribution-based p-value for the outlier sum in differential gene expression analysis. Biometrika 97:246–253
Article MathSciNet MATH Google Scholar
Chen DT, Hsu YL, Fulp WJ, Coppola D, Haura EB, Yeatman TJ et al (2011) Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. J Natl Cancer Inst 103:1859–1870
Article Google Scholar
Chen HC, Kodell RL, Cheng KF, Chen JJ (2012) Assessment of performance of survival prediction models for cancer prognosis. BMC Med Res Method 12:102
Article Google Scholar
Lin YH, Friederichs J, Black MA et al (2007) Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin Cancer Res 13:498–507
Article Google Scholar
Zhu ZH, Sun BY, Ma Y, Shao JY et al (2009) Three immunomarker support vector machines-based prognostic classifiers for stage IB non-small-cell lung cancer. J Clin Oncol 27:1091–1099
Article Google Scholar
Schramm A, Schulte JH, Klein-Hitpass L et al (2005) Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling. Oncogene 24:7902–7912
Article Google Scholar
Yu SJ, Yu JK, Ge WT et al (2011) SPARCL1, Shp2, MSH2, E-cadherin, p53, ADCY-2 and MAPK are prognosis-related in colorectal cancer. World J Gastroenterol 17:2028–36
Article Google Scholar
Klein J, Moeschberger M (2003) Survival analysis: techniques for censored and truncated data, 2nd edn. Springer, New York
MATH Google Scholar
Sotiriou C, Wirapati P, Loi S et al (2006) Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 15:262–272
Article Google Scholar
Sa Tomlins, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R et al (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310:644–648
Article Google Scholar
Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–575
Article MATH Google Scholar
Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8:2–8
Article MATH Google Scholar
Mpindi JP, Sara H, Haapa-Paananen S, Kilpinen S, Pisto T et al (2011) GTI: a novel algorithm for identifying outlier gene expression profiles from integrated microarray datasets. PLoS One 6:e17259
Article Google Scholar

Download references

Author information

Authors and Affiliations

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, HFT-20, Jefferson, AR, 72079, USA
Hung-Chia Chen & James J. Chen

Authors

Hung-Chia Chen
View author publications
You can also search for this author in PubMed Google Scholar
James J. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James J. Chen.

Additional information

The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, HC., Chen, J.J. Hybrid Mixture Model for Subpopulation Identification. Stat Biosci 8, 28–42 (2016). https://doi.org/10.1007/s12561-015-9131-y

Download citation

Received: 11 January 2014
Revised: 09 March 2015
Accepted: 20 April 2015
Published: 01 May 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s12561-015-9131-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid Mixture Model for Subpopulation Identification

Abstract

Access this article

Similar content being viewed by others

Mixture survival trees for cancer risk classification

The g3mclass is a practical software for multiclass classification on biomarkers

Research on expansion and classification of imbalanced data based on SMOTE algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hybrid Mixture Model for Subpopulation Identification

Abstract

Access this article

Similar content being viewed by others

Mixture survival trees for cancer risk classification

The g3mclass is a practical software for multiclass classification on biomarkers

Research on expansion and classification of imbalanced data based on SMOTE algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation