Abstract
The presence of different batches is routinely observed in microarray studies and is well known that non-biological variability potentially confounding biological differences is commonly related to such batches. The removal of these undesired effects for a non-biased inference is often accomplished either with normalization methods that do not take into account all the available information, or with models that rely on strong parametric assumptions. We have developed a new method for the batch effects removal, named ber, which is based on a two-stage procedure for the estimation of location and scale parameters. Batch effects and biological differences are estimated using a regression approach and bagging, therefore mild distributional assumptions are required. We have compared ber with other commonly employed methods and we have shown that ber can bring to a higher power in detecting differentially expressed genes. The application of ber to a real microarray study led to interpretable biological results. The method is implemented in the R package ber, available through CRAN repository.
Similar content being viewed by others
References
Barnett S (1990) Matrices: methods and applications. Oxford University Press, Oxford
Benito M, Parker J, Du Q, Xiang D, Perou CM, Marron JS (2004) Adjustment of systematic microarray data biases. Bioinformatics 20(1):105–114. doi:10.1093/bioinformatics/btg385
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350
Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS ONE 6(2):e17238. 10.1371/journal.pone.0017238
Davis S, Meltzer PS (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and bioconductor. Bioinformatics 23(14):1846–1847. doi:10.1093/bioinformatics/btm254
Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99(465):96–104. doi:10.1198/016214504000000089
Glejser H (1969) A new test for heteroskedasticity. J Am Stat Assoc 64(325):316–323
Huber WE, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(1):S96–S104
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264
Joe H (2006) Generating random correlation matrices based on partial correlations. J Multivar Anal 97:2177–2189
Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127. doi:10.1093/biostatistics/kxj037
Kohlmann A, Bullinger L, Thiede C, Schaich M, Schnittger S, Döhner K, Dugas M, Klein HU, Döhner H, Ehninger G, Haferlach T (2010) Gene expression profiling in AML with normal karyotype can predict mutations for molecular markers and allows novel insights into perturbed biological pathways. Leukemia 24:1216
Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, Shi T, Tong W, Shi L, Hong H, Zhao C, Elloumi F, Shi W, Thomas R, Lin S, Tillinghast G, Liu G, Zhou Y, Herman D, Li Y, Deng Y, Fang H, Bushel P, Woods M, Zhang J (2010) A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10:278–291
McCall MN, Bolstad BM, Irizarry MA (2010) Frozen robust multiarray analysis (fRMA). Biostatistics 11(2):242–253. doi:10.1093/biostatistics/kxp059
McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA (2011) The gene expression barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res 39:D1011–D1015. doi:10.1093/nar/gkq1259
Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics 26(10):1308–1315. doi:10.1093/bioinformatics/btq118
Schäfer J, Strimmer K (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21(6):754–764. doi:10.1093/bioinformatics/bti062
Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4(1):32. doi:10.2202/1544-6115.1175
Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB (2008) Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24(9):1154–1160. doi:10.1093/bioinformatics/btn083
Slawski M, Daumer M, Boulesteix AL (2008) CMA—a comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinform 9:439. doi:10.1186/1471-2105-9-439
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, de Vijver MJV, Bergh J, Piccart M, Delorenzi M G (2006) Expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272. doi:10.1093/jnci/djj052
Strimmer K (2008) A unified approach to false discovery rate estimation. BMC Bioinform 9:303. doi:10.1186/1471-2105-9-303
Zilliox MJ, Irizarry RA (2007) A gene expression barcode for microarray data. Nat Methods 4:911–913. doi:10.1038/nmeth1102
Acknowledgements
The author is grateful to two referees for the helpful comments and valuable suggestions. The author wants to thank Mahmoodi Pezhman, Andrea Zangrando and Pietro Franceschi for the careful reading of the manuscript. This work was supported by Fondazione Cittá della Speranza.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Giordan, M. A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies. Stat Biosci 6, 73–84 (2014). https://doi.org/10.1007/s12561-013-9081-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-013-9081-1