Statistics in Biosciences

, Volume 6, Issue 1, pp 73–84 | Cite as

A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies

  • Marco GiordanEmail author


The presence of different batches is routinely observed in microarray studies and is well known that non-biological variability potentially confounding biological differences is commonly related to such batches. The removal of these undesired effects for a non-biased inference is often accomplished either with normalization methods that do not take into account all the available information, or with models that rely on strong parametric assumptions. We have developed a new method for the batch effects removal, named ber, which is based on a two-stage procedure for the estimation of location and scale parameters. Batch effects and biological differences are estimated using a regression approach and bagging, therefore mild distributional assumptions are required. We have compared ber with other commonly employed methods and we have shown that ber can bring to a higher power in detecting differentially expressed genes. The application of ber to a real microarray study led to interpretable biological results. The method is implemented in the R package ber, available through CRAN repository.


High dimensional data Normalization Gene expression profiling Bagging 



The author is grateful to two referees for the helpful comments and valuable suggestions. The author wants to thank Mahmoodi Pezhman, Andrea Zangrando and Pietro Franceschi for the careful reading of the manuscript. This work was supported by Fondazione Cittá della Speranza.


  1. 1.
    Barnett S (1990) Matrices: methods and applications. Oxford University Press, Oxford zbMATHGoogle Scholar
  2. 2.
    Benito M, Parker J, Du Q, Xiang D, Perou CM, Marron JS (2004) Adjustment of systematic microarray data biases. Bioinformatics 20(1):105–114. doi: 10.1093/bioinformatics/btg385 CrossRefGoogle Scholar
  3. 3.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi: 10.1023/A:1018054314350 zbMATHMathSciNetGoogle Scholar
  4. 4.
    Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS ONE 6(2):e17238.  10.1371/journal.pone.0017238 CrossRefGoogle Scholar
  5. 5.
    Davis S, Meltzer PS (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and bioconductor. Bioinformatics 23(14):1846–1847. doi: 10.1093/bioinformatics/btm254 CrossRefGoogle Scholar
  6. 6.
    Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99(465):96–104. doi: 10.1198/016214504000000089 CrossRefzbMATHMathSciNetGoogle Scholar
  7. 7.
    Glejser H (1969) A new test for heteroskedasticity. J Am Stat Assoc 64(325):316–323 CrossRefGoogle Scholar
  8. 8.
    Huber WE, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(1):S96–S104 CrossRefGoogle Scholar
  9. 9.
    Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264 CrossRefzbMATHGoogle Scholar
  10. 10.
    Joe H (2006) Generating random correlation matrices based on partial correlations. J Multivar Anal 97:2177–2189 CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127. doi: 10.1093/biostatistics/kxj037 CrossRefzbMATHGoogle Scholar
  12. 12.
    Kohlmann A, Bullinger L, Thiede C, Schaich M, Schnittger S, Döhner K, Dugas M, Klein HU, Döhner H, Ehninger G, Haferlach T (2010) Gene expression profiling in AML with normal karyotype can predict mutations for molecular markers and allows novel insights into perturbed biological pathways. Leukemia 24:1216 CrossRefGoogle Scholar
  13. 13.
    Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, Shi T, Tong W, Shi L, Hong H, Zhao C, Elloumi F, Shi W, Thomas R, Lin S, Tillinghast G, Liu G, Zhou Y, Herman D, Li Y, Deng Y, Fang H, Bushel P, Woods M, Zhang J (2010) A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10:278–291 CrossRefGoogle Scholar
  14. 14.
    McCall MN, Bolstad BM, Irizarry MA (2010) Frozen robust multiarray analysis (fRMA). Biostatistics 11(2):242–253. doi: 10.1093/biostatistics/kxp059 CrossRefGoogle Scholar
  15. 15.
    McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA (2011) The gene expression barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res 39:D1011–D1015. doi: 10.1093/nar/gkq1259 CrossRefGoogle Scholar
  16. 16.
    Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics 26(10):1308–1315. doi: 10.1093/bioinformatics/btq118 CrossRefGoogle Scholar
  17. 17.
    Schäfer J, Strimmer K (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21(6):754–764. doi: 10.1093/bioinformatics/bti062 CrossRefGoogle Scholar
  18. 18.
    Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4(1):32. doi: 10.2202/1544-6115.1175 MathSciNetGoogle Scholar
  19. 19.
    Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB (2008) Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24(9):1154–1160. doi: 10.1093/bioinformatics/btn083 CrossRefGoogle Scholar
  20. 20.
    Slawski M, Daumer M, Boulesteix AL (2008) CMA—a comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinform 9:439. doi: 10.1186/1471-2105-9-439 CrossRefGoogle Scholar
  21. 21.
    Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, de Vijver MJV, Bergh J, Piccart M, Delorenzi M G (2006) Expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272. doi: 10.1093/jnci/djj052 CrossRefGoogle Scholar
  22. 22.
    Strimmer K (2008) A unified approach to false discovery rate estimation. BMC Bioinform 9:303. doi: 10.1186/1471-2105-9-303 CrossRefGoogle Scholar
  23. 23.
    Zilliox MJ, Irizarry RA (2007) A gene expression barcode for microarray data. Nat Methods 4:911–913. doi: 10.1038/nmeth1102 CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2013

Authors and Affiliations

  1. 1.Department for Woman and Child’s HealthUniversity of PaduaPadovaItaly

Personalised recommendations