Skip to main content
Log in

A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

The presence of different batches is routinely observed in microarray studies and is well known that non-biological variability potentially confounding biological differences is commonly related to such batches. The removal of these undesired effects for a non-biased inference is often accomplished either with normalization methods that do not take into account all the available information, or with models that rely on strong parametric assumptions. We have developed a new method for the batch effects removal, named ber, which is based on a two-stage procedure for the estimation of location and scale parameters. Batch effects and biological differences are estimated using a regression approach and bagging, therefore mild distributional assumptions are required. We have compared ber with other commonly employed methods and we have shown that ber can bring to a higher power in detecting differentially expressed genes. The application of ber to a real microarray study led to interpretable biological results. The method is implemented in the R package ber, available through CRAN repository.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Barnett S (1990) Matrices: methods and applications. Oxford University Press, Oxford

    MATH  Google Scholar 

  2. Benito M, Parker J, Du Q, Xiang D, Perou CM, Marron JS (2004) Adjustment of systematic microarray data biases. Bioinformatics 20(1):105–114. doi:10.1093/bioinformatics/btg385

    Article  Google Scholar 

  3. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350

    MATH  MathSciNet  Google Scholar 

  4. Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS ONE 6(2):e17238. 10.1371/journal.pone.0017238

    Article  Google Scholar 

  5. Davis S, Meltzer PS (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and bioconductor. Bioinformatics 23(14):1846–1847. doi:10.1093/bioinformatics/btm254

    Article  Google Scholar 

  6. Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99(465):96–104. doi:10.1198/016214504000000089

    Article  MATH  MathSciNet  Google Scholar 

  7. Glejser H (1969) A new test for heteroskedasticity. J Am Stat Assoc 64(325):316–323

    Article  Google Scholar 

  8. Huber WE, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(1):S96–S104

    Article  Google Scholar 

  9. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264

    Article  MATH  Google Scholar 

  10. Joe H (2006) Generating random correlation matrices based on partial correlations. J Multivar Anal 97:2177–2189

    Article  MATH  MathSciNet  Google Scholar 

  11. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127. doi:10.1093/biostatistics/kxj037

    Article  MATH  Google Scholar 

  12. Kohlmann A, Bullinger L, Thiede C, Schaich M, Schnittger S, Döhner K, Dugas M, Klein HU, Döhner H, Ehninger G, Haferlach T (2010) Gene expression profiling in AML with normal karyotype can predict mutations for molecular markers and allows novel insights into perturbed biological pathways. Leukemia 24:1216

    Article  Google Scholar 

  13. Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, Shi T, Tong W, Shi L, Hong H, Zhao C, Elloumi F, Shi W, Thomas R, Lin S, Tillinghast G, Liu G, Zhou Y, Herman D, Li Y, Deng Y, Fang H, Bushel P, Woods M, Zhang J (2010) A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10:278–291

    Article  Google Scholar 

  14. McCall MN, Bolstad BM, Irizarry MA (2010) Frozen robust multiarray analysis (fRMA). Biostatistics 11(2):242–253. doi:10.1093/biostatistics/kxp059

    Article  Google Scholar 

  15. McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA (2011) The gene expression barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res 39:D1011–D1015. doi:10.1093/nar/gkq1259

    Article  Google Scholar 

  16. Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics 26(10):1308–1315. doi:10.1093/bioinformatics/btq118

    Article  Google Scholar 

  17. Schäfer J, Strimmer K (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21(6):754–764. doi:10.1093/bioinformatics/bti062

    Article  Google Scholar 

  18. Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4(1):32. doi:10.2202/1544-6115.1175

    MathSciNet  Google Scholar 

  19. Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB (2008) Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24(9):1154–1160. doi:10.1093/bioinformatics/btn083

    Article  Google Scholar 

  20. Slawski M, Daumer M, Boulesteix AL (2008) CMA—a comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinform 9:439. doi:10.1186/1471-2105-9-439

    Article  Google Scholar 

  21. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, de Vijver MJV, Bergh J, Piccart M, Delorenzi M G (2006) Expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272. doi:10.1093/jnci/djj052

    Article  Google Scholar 

  22. Strimmer K (2008) A unified approach to false discovery rate estimation. BMC Bioinform 9:303. doi:10.1186/1471-2105-9-303

    Article  Google Scholar 

  23. Zilliox MJ, Irizarry RA (2007) A gene expression barcode for microarray data. Nat Methods 4:911–913. doi:10.1038/nmeth1102

    Article  Google Scholar 

Download references

Acknowledgements

The author is grateful to two referees for the helpful comments and valuable suggestions. The author wants to thank Mahmoodi Pezhman, Andrea Zangrando and Pietro Franceschi for the careful reading of the manuscript. This work was supported by Fondazione Cittá della Speranza.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Giordan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giordan, M. A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies. Stat Biosci 6, 73–84 (2014). https://doi.org/10.1007/s12561-013-9081-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-013-9081-1

Keywords

Navigation