Skip to main content
Log in

Can matching improve the performance of boosting for identifying important genes in observational studies?

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

When two groups of individuals are to be compared with respect to gene expression there will often be some potentially confounding variables that differ between the groups. Matching is an established approach for obtaining comparable groups and enabling subsequent univariate tests for each gene. Alternatively, the confounders might be incorporated directly into a multivariable regression model for adjustment. In contrast to univariate tests, such models can consider all genes simultaneously. Aiming to combine the advantages of both approaches, matching and multivariable modeling, we consider a matching-based boosting procedure for fitting risk prediction models in two-group settings. This possibly allows to identify and automatically remove problematic observations that might negatively affect the regression model. Therefore, we compare the ability to identify important covariates for this combination of matching and boosting with only boosting for different covariate correlation structures in a simulation study. Furthermore, we analyze the prediction performance of these approaches on two gene expression microarray studies. The first study comprises patients with B-cell and T-cell type acute lymphoblastic leukemia and the second patients with acute megakaryoblastic leukemia. While the matching component can in principle guard against problematic observations, the combined approach is seen to neither improve identification of important covariates nor to improve prediction performance. Therefore, a combination of the two approaches cannot be recommended. Adjustment for potential confounders is seen to provide the best performance, i.e. a pure multivariable regression modeling strategy seems to be promising even in presence of considerable heterogeneity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Binder H, Schumacher M (2008a) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinf 9: 14

    Article  Google Scholar 

  • Binder H, Schumacher M (2008b) Comment on ’network-constrained regularization and variable selection for analysis of genomic data’. Bioinformatics 24(21): 2566–2568

    Article  Google Scholar 

  • Binder H, Schumacher M (2008c) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 7(1): 12

    MathSciNet  Google Scholar 

  • Binder H, Porzelius C, Schumacher M (2009) Rank-based p-values for sparse high-dimensional risk prediction models fitted by componentwise boosting, FDM-Preprint Nr.101

  • Boulesteix A-L, Hothorn T (2010) Testing the additional predictive value of high-dimensional data. BMC Bioinf 11: 78

    Article  Google Scholar 

  • Bourquin J et al (2006) Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling. PNAS 103(9): 3339–3344

    Article  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45: 5–32

    Article  MATH  Google Scholar 

  • Brier G (1950) Verification of forecast expressed in terms of probability. Mon Weather Rev 78(1): 1–3

    Article  Google Scholar 

  • Cepeda MS et al (2003) Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiol 56: 230–237

    Article  Google Scholar 

  • Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103: 2771–2778

    Article  Google Scholar 

  • Cochran W, Rubin D (1973) Controlling bias in observational studies: a review. Indian J Stat Ser A 35(4): 417–446

    MATH  Google Scholar 

  • Cristianini N, Shawe-Taylor J (1999) An introduction to SVM. Cambridge University Press, Cambridge

    Google Scholar 

  • Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19: 1061–1069

    Article  Google Scholar 

  • Gu X, Rosenbaum P (1993) Comparison of multivariable matching methods: structures, distances and algorithms. J Comput Graph Stat 2: 405–420

    Google Scholar 

  • Hansen B (2004) Full matching in an observational study coaching for the SAT. J Am Stat Assoc 99(467): 609–618

    Article  MATH  Google Scholar 

  • Heller R et al (2009) Matching methods for observational microarray studies. Bioinformatics 25(7): 904–909

    Article  MathSciNet  Google Scholar 

  • Hummel M et al (2008) GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24(1): 78–85

    Article  Google Scholar 

  • Ming K, Rosenbaum P (2000) Substantial gains in bias reduction from matching with a variable number of controls. Biometrics 56(1): 118–124

    Article  MATH  Google Scholar 

  • Rosenbaum P, Rubin D (1985) The bias due to incomplete matching. Biometrics 41: 103–116

    Article  MathSciNet  MATH  Google Scholar 

  • Rosenbaum P (1989) Optimal matching for observational studies. J Am Stat Assoc 84(408): 1024–1032

    Article  Google Scholar 

  • Rubin D (1973) Matching to remove bias in observational studies. Biometrics 29(1): 159–183

    Article  Google Scholar 

  • Rubin D (1979) Using multivariable matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc 74: 318–324

    MATH  Google Scholar 

  • Rubin D (1980) Bias reduction using Mahalanobis metric matching. Biometrics 36: 293–298

    Article  MATH  Google Scholar 

  • Simon R et al (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18

    Article  Google Scholar 

  • Smith H (1997) Matching with multiple controls to estimate treatment effects in observational studies. Sociol Methodol 27(1): 325–353

    Article  Google Scholar 

  • Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3 (Article 3)

  • Thomas JG et al (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genom Res 11: 1227–1236

    Article  Google Scholar 

  • Tusher VG et al (2001) Significant analysis of microarrays applied to the ioonizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121

    Article  MATH  Google Scholar 

  • Tutz G, Binder H (2007) Boosting ridge regression. Comput Stat Data Anal 51(12): 6044–6059

    Article  MathSciNet  MATH  Google Scholar 

  • Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Veronika Reiser.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reiser, V., Porzelius, C., Stampf, S. et al. Can matching improve the performance of boosting for identifying important genes in observational studies?. Comput Stat 28, 37–49 (2013). https://doi.org/10.1007/s00180-012-0306-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-012-0306-4

Keywords

Navigation