Background

Gene expression microarray technology measures the expression of thousands of genes in one single array by using multiple probes to assay each transcript. It is a widely-used tool for identifying genes whose expression changes in response to specific treatments. There are several concerns with regard to the reliability of DNA microarray technology in the study of diseases.

Microarray experiments are costly and time-consuming. Many studies use multiple arrays, with experiments performed at different times, on different array charges or even with different microarray platforms. The resulting gene expression data could thus be affected by non-biological variables. These systematic differences between the measurements are commonly referred to as “batch effects”. There are several causes for batch effects, as outlined below:

• Ambient conditions during the sample preparation and handling, such as room temperature and ozone levels

• Storage and shipment conditions of the biological samples and/or the arrays

cRNA/cDNA synthesis

• Amplification, labeling, and hybridization protocol: Use of reagents from different batches

• Sites/laboratories: Different laboratories have different operating procedures

• Chip type/charge/platform: The array quality varies from batch to batch

• Washing conditions: Temperature, ionic strength

• Scanner: Type, settings, and the calibration drift [1].

Although some of these batch effects can be minimized or even prevented with appropriate precautions and experimental design, certain batch effects are unavoidable. Some studies require large sample sizes and have to be carried out over many months or sometimes years. In clinical experiments, experiments are often driven by the availability of the samples, which cannot be specifically controlled for in the original study design. Combining data from different batches without removing batch effects can give rise to misleading results, since the bias introduced by the non-biological nature of the batch effects can be strong enough to mask and/or confound true biological differences. Thus, there is a need to identify and remove these masking effects before further processing.

Microarray signal intensity normalization has been widely used to adjust for experimental artifacts between the samples. The effect of the normalization is to increase the precision of multi-array measurements through calibration of the signal intensity distributions. There are several methods for normalization including MAS5, Robust Multi-Array Analysis (RMA), and dChip for Affymetrix chips, median scaling for GE-CodeLink microarrays, and LOWLESS-based methods for cDNA two-color microarrays. Common to all normalization methods is that they are not specifically designed to remove batch effects reflected by systematic differences between two or more groups of samples. Consequently, batch effects may often remain after normalization. However, of thousands of papers dealing with DNA microarrays published in the last 5 years (>32,000), only few address the potential existence of batch effects and/or their correction. Of the 219 papers using microarray data published from January 1 to July 1, 2010, not even ten percent took this issue into account (NCBI GEO database, studies with more than 30 samples) [2].

There are several published approaches to identify and remove batch effects [1, 3]. An Empirical Bayes method called “Combating Batch Effects When Combining Batches of Gene Expression Microarray Data” (ComBat) estimates parameters for location and scale adjustment of each batch for each gene independently [46]. In the present study, this method was applied to a data subset of 24 arrays (of a total of 120) in order to remove batch effects due to different acquiry dates of the microarray analyses. In contrast to the hierarchical clustering dendrogram before batch correction, in which rheumatoid arthritis (RA) and osteoarthritis (OA) patients clustered randomly, batch correction led to a clear separation of the RA and OA groups. Batch-corrected data then allowed an unequivocal identification of differentially expressed genes discriminating between RA and OA patients.

Materials and methods

Patients

Synovial membrane samples were obtained following tissue excision upon joint replacement/synovectomy from RA and OA patients (n = 6 each; all Caucasian; Tables 1 and 2) at the Clinic of Orthopedics, Waldkrankenhaus “Rudolf Elle” (Eisenberg, Germany). Informed patient consent was obtained and the study was approved by the ethics committee of the University Hospital Jena. RA patients were classified according to the American College of Rheumatology (ACR) criteria valid in the sample assessment period [7], OA patients according to the respective criteria for osteoarthritis [8]. Negative purification of primary synovial fibroblasts (SFB) from RA and OA patients (purity: > 98%) was performed as previously described [9].

Table 1 Clinical data of patients
Table 2 Sample features (incl. acquiry date) and clinical diagnosis of RA and OA patients

Cell stimulation and isolation of total RNA

At the end of the fourth passage, the SFB were washed in serum-free DMEM and then stimulated by 10 ng/ml TNF-α or TGF-β1 (PeproTech, Hamburg, Germany) in serum-free DMEM for 0, 1, 2, 4, or 12 hours. At each time point, the medium was removed and the cells were harvested after treatment with trypsin (0.25% in versene; Invitrogen, Karlsruhe, Germany). After washing with phosphate-buffered saline, they were lysed with RLT buffer (Qiagen, Hilden, Germany) and frozen at −70°C. Total RNA was isolated using the RNeasy Kit (Qiagen) according to the supplier's recommendation.

Microarray analysis

Analysis of gene expression was performed using U133 Plus 2.0 RNA microarrays (Affymetrix, Santa Clara, CA, USA; total of 120 microarrays). Labeling of RNA probes, hybridization, and washing were carried out according to the supplier's instructions. Microarrays were analyzed by laser scanning (Hewlett-Packard Gene Scanner). Original data from microarray analysis are accessible through Gene Expression Omnibus series accession number GSE13837 [10].

For annotating the genes, the alternative Chip Definition File (CDF) of Ferrari et al. was used to resolve the problem of choosing reliable and non-contradictory probesets for each transcript [11]. Several publications demonstrated the advantage of such alternative CDFs for the removal of cross-hybridization and other system-based biases.

The microarray data were preprocessed using RMA in the default configuration for background adjustment and normalization.

ComBat

For Batch correction of the patient data (Table 2), the Empirical Bayes' (EB) method ComBat was used (non-parametric prior method) [5]. EB methods are very appealing in microarray analyses because of their ability to robustly handle high-dimensional data derived from small sample sizes. EB methods are primarily designed to “borrow information” from a certain number of genes and/or experimental conditions in order to obtain better estimates or more stable inferences for the expression of all genes. In several papers, EB methods were designed to stabilize the expression values/ratios for genes with extreme values or else the variance of genes or gene groups by shrinking variances across all other genes, possibly diminishing the effects of artifacts in the data [6, 1219]. Johnson et al. extended the EB methods to the problem of adjusting for batch effects in microarray data, which are not addressed by the use of one or several normalization procedures [5]. Johnson et al. published a location and scale (L/S) adjustment method for batch correction, which is available as R-package ComBat at the developer's homepage [20]. In contrast to other L/S methods, this method is the only procedure currently known to robustly adjust batches with small sample sizes.

As other L/S adjustments, ComBat assumes that the batch effects can be modeled by standardizing means and variances across batches. It uses a straightforward L/S adjustment to independently center the mean and standardize the variance for each gene in each batch. This method incorporates systematic batch biases common across several genes to make adjustments on the assumption that phenomena resulting in batch effects often affect many genes in a similar way (i.e., increased expression, higher variability, etc.). To determine the data parameters which describe the particular L/S model [5], ComBat estimates the L/S model parameters that best represent the batch effects by “pooling information” across some or all genes in each batch in order to “shrink” the parameter estimates and thereby reduce the influence of batch effects.

In the present study, a modified method of ComBat was used to correct for batch effects among arrays generated at different dates. The algorithm was modified in order to allow processing of RMA-normalized data instead of dChip-normalized data. The Sample Information File was created as described in the ComBat manual. The creation date was tagged as “batch effect” and the parameters time point (total of 5), disease group (RA and OA), and stimulation (TNF-α and TGF-β1) were marked as covariates for every array.

Limma

To identify differentially expressed genes, the software Limma was used. Limma is an R-package for differential expression analysis of data arising from microarray experiments [21]. The package is designed to analyze complex experiments involving simultaneous comparisons between many RNA targets. The basic concept is to fit a linear model to the expression data for each gene. The expression data from experiments are represented as a data matrix with rows corresponding to probe sets and columns to arrays. For the analysis with linear models, the approach requires two matrices. First, the design matrix was created, which provides a representation of the different RNA targets. Secondly, the contrast matrix was generated, which allows to assign the coefficients defined by the design matrix to contrasts of interests (in our case time point 0 hours, RA versus OA patients, Table 2). Afterwards, the linear model was fitted using the lmFit function, which fully models the systematic part of the data. The actual analysis was analogous to the classic t-test except that the eBayes-function was used which employs the shrunken empirical Bayes estimates [21]. Differentially expressed genes (DEG) were obtained by using the toptable function and user-specific thresholds (i.e., 2-fold changes in gene expression and a q-value with a threshold of 0.05; [22]). The q-value (FDR adjusted p-value) was calculated using the Benjamini and Hochberg’s method [21].

GOstats

The Bioconductor package GOstats shows substantial improvements for testing the association between Gene Ontology (GO) terms and a given gene list [23]. Falcon et al. implemented a conditional hypergeometric (hg) test that uses the relationships among GO terms to address the hierarchical structure of GO. To use this hg test, the gene universe (in this case containing all genes on the microarray) and a list of selected genes from that universe was defined (DEG). After setting a cutoff for the adjusted p-value of the hg test, GOstats returns a GOHyperGResult with a summary of the test performed and the number of significant GO terms. Furthermore, the GOHyperGResult contains the expected gene count and actual gene count for each GO term.

GPower

The Tool GPower 3.1.3 was used for post-hoc calculation of the power (1- ß error probability) of the t-test with the error probability alpha = 0.05, where the sample sizes of the two groups (RA, OA) were 12 each and the respective mean values and SD of the two groups were derived from Additional file 1: Table S1 [24, 25].

Results

Clustering of RA and OA patients before and after batch correction

Starting with RMA-normalized data of the 24 arrays for time point 0, standard hierarchical clustering dendrograms (generated using the R function hclust with euclidean distances) were employed to monitor the effect of the batch correction by ComBat. These dendrograms measure inter-cluster distances between the arrays. The resulting distances can be interpreted as the dissimilarity of the arrays

Before batch correction, the normalized data formed clear clusters for the different batches (7 different acquiry dates for groups of 2 – 4 individual Affymetrix chips; Figure 1a). Except for 2 patients (extreme left and extreme right), two main clusters were formed, separating the microarrays created in 2006 (red shades) from those generated in 2009 (blue shades). Indeed, the company providing the hybridization agents for the microarrays confirmed a change in the chemical components used in the production process in the year 2007. As a consequence of the batch effect, several unique clusters were observed for the RA and OA patients (right and left branches of the dendrogram), but other clusters were mixed and contained members of both groups. In addition, patient EB87 (extreme right cluster) showed the highest dissimilarity with all arrays and was interpreted as an outlier, as also confirmed by Principal Component Analysis (PCA; not shown).

Figure 1
figure 1

Hierarchical clustering of uncorrected and batch-corrected data from time point 0 hours: a) The uncorrected data form clusters reflecting the 7 different acquiry dates (red shades for arrays generated in 2006; blue shades for those generated in 2009; for precise definition of the individual acquiry dates see Table 2 ). In contrast, RA and OA are not grouped. b) The ComBat-corrected data (7 batches) form clusters reflecting the diseases (RA and OA) instead of the acquiry dates.

Following batch correction for the 7 different acquiry dates, the arrays instead formed two distinct clusters for RA and OA in the hierarchical clustering dendrogram (Figure 1b). The effects of the different creation dates were completely removed and the outlier EB87 was integrated, as also confirmed by PCA (not shown). This unequivocal clustering of RA and OA patients was only achieved when correcting for 7 batches and not for 2 yearly batches only (2006 and 2009; Additional file 2: Figure S1).

Strikingly, distinct clustering for RA and OA following batch correction for the 7 different acquiry dates was not only observed at the 0 hour time point (i.e., before stimulation with TNF-α/TGF-β1), but also at all time points after stimulation except for the TGF-β1 late 12 hour time point (Additional file 3: Figure S2).

Comparison of differentially expressed genes before and after batch correction

Limma was used to obtain differentially expressed genes (DEG; ≥ 2-fold change; p-value of ≤ 0.05) for the key question concerning a generic difference between RA and OA patients. For the uncorrected data, 87 genes were extracted (DEG_woBC). The batch-corrected data contained 181 genes (DEG_wBC), with a total overlap of 65.51% (in total 57 genes; DEG_over) between the 2 gene sets. This was illustrated using a Venn plot (Figure 2).

Figure 2
figure 2

Venn Plot for genes differentially expressed between RA and OA at time point 0 hours. BC results in a doubling of differentially expressed genes (87 → 181). A total of 57 genes (65.51%) were represented in the intersection of the two gene lists (DEG_over).

For both data sets (DEG_woBC and DEG_wBC), a gene enrichment analysis was done with GOstats (p-value ≤ 0.05). In both data sets, the highest ranked GO term was “extracellular matrix structural constituent” with a p-value of 5.87E-03 and 3/74 genes for the uncorrected and a p-value of 7.60E-07 and 8/74 genes for the corrected data set (Table 3). In addition, 5 of 7 GO categories identified before batch correction were also overrepresented in the data set after batch correction, indicating that no major information was lost upon batch correction (Table 3). Furthermore, the p-value of the 3 genes from the top-ranked GO term contained in both data sets largely improved following batch correction (by up to 20 orders of magnitude), underlining the validity of the differential expression (Table 4).

Table 3 GO categories overrepresented in the gene set DEG_wBC (time point 0 hours) top-ranked with b) or without BC a)
Table 4 Genes of the top-ranked GO category (time point 0 hours) identified with b) or without BC a)

To analyze the magnitude of changes in either mean or standardized variance induced by the batch correction, variance/mean plots were generated to illustrate the “shrinkage” of the expression values. For this purpose, the variances were standardized (range between 0 and 1) to prevent negative values. Analyzing the 181 genes from the DEG_wBC data set differentially expressed between RA and OA, only marginal changes of the means, but moderate to substantial reductions of the variances were observed for all genes (Figure 3).

Figure 3
figure 3

Means and variances of 181 differentially expressed genes from the DEG_wBC set (time point 0 hours) in RA (a; n = 12; i.e. 6 patients with two replicates each) and OA (b; n = 12; i.e. 6 patients with two replicates each) with (red dots) or without BC (blue dots). There are generally only marginal changes of the means, but moderate to substantial reductions of the variances, as indicated by an exclusively horizontal shift.

The same was true when the genes lost for analysis following batch correction (DEG_lost; Figure 4a,b), the genes of interest from the GO term “extracellular matrix structural constituent” (DEG_interest; Figure 4c,d) or all genes identified without batch correction (DEG_woBC; Additional file 4: Figure S3) were displayed in the variance/mean plot.

Figure 4
figure 4

Means and variances of differentially expressed genes from the DEG_lost set (a,b; for definition see Figure 2 ) and the DEG_interest set (c,d; for definition see Figure 2 ) in RA (a, c; n = 12) and OA (b, d; n = 12) patients with (red dots) or without BC (blue dots). There are generally only marginal changes of the means, but moderate to substantial reductions of the variances.

To further analyze the magnitude of changes in the mean induced by the batch correction, the mean values in both data sets were compared and the changes expressed as percentages of the initial value; in both RA and OA, very limited changes were observed (maximal change 4.087%), either at the time point 0 hours (Table 5) or at any time point following stimulation with TNF-α or TGF-β1 (Additional file 1: Table S1). Despite the limited changes of the means induced by BC, BC resulted in a substantial improvement of the power for the differentiation of RA and OA for all genes of interest (Additional file 5: Table S2).

Table 5 Mean expression values of differentially expressed genes (time point 0 hours)

There was no indication for any type-1 error concerning differential expression of the genes of interest, as demonstrated by permuting the disease status (RA and OA) 20 times with subsequent BC; applying the thresholds 2-FC and p < 0.05, none of the 8 genes of interest were contained in the list of differentially expressed genes for any of the permutations (Additional file 6: Table S3). Also, a complete lack of clustering for RA and OA was observed and, strikingly, the pairs of one patient for the time point 0 were separated in some cases (Additional file 7: Figure S4)

In addition to the differential expression between RA and OA after BC for the 8 genes of interest from the top-ranked GO term at time point 0 hours (Table 4b), these genes also remained differentially expressed between RA and OA at all time points following stimulation with either TNF-α or TGF-β1 (Figure 5, Additional file 8: Table S4).

Figure 5
figure 5

Time courses of genes of interest (DEG_interest; see Table 4 b) in synovial fibroblasts from RA patients (blue and purple) or OA (red and green) stimulated with TNF-α (red and blue) or TGF-β1 (green and purple). There were only marginal differences for the gene expression values with or without BC (see also Additional file 6: Table S1). As expected, there was a clearly different regulation of the expression of 6 of 8 genes (BGN, COL1A1, COL27A1, COL5A2, COL1A2 and COL3A1; for definition of the abbreviations see Table 4) following stimulation of with either TNF-α or TGF-β1; this differential regulation was common for SFB from RA and OA patients (see Additional file 6: Table S1). However, 2 of 8 genes (COL11A1 and TINAGL) were regulated in a similar fashion by TNF-α and TGF-β1 in both RA and OA patients. Strikingly, significant differences between RA and OA patients were observed for all genes of interest already at the time point 0 hours (see Additional file 8: Table S4a). These differences were unaffected by stimulation with either TNF-α or TGF-β1.

Discussion

In the present study, the ComBat method was highly effective in removing batch effects due to different acquiry dates of the microarrays; this was demonstrated by i) unequivocal clustering of the RA and OA patients in the batch-corrected data for almost all time points investigated (shown by hierarchical clustering dendrograms and PCA); ii) integration of the outlier EB87, resulting in reduced standardized variances for a number of genes; iii) identification of a large number of genes differentially expressed between RA and OA with highly significant p-values (up to 5.31E-25); iv) identification of numerous overrepresented GO terms (with an increased number of members and strongly improved p-values).

The 7 different batches (aquiry dates) were clearly grouped into the two main clusters representing microarray analyses from the years 2006 and 2009. These 2 main clusters are a clear example of inevitable batch effects since the supplier of the hybridization agents changed the chemical components used in the production process in the year 2007 (personal communication; Affymetrix). The only possibility to avoid such batch effects would have been to collect all the samples for simultaneous analysis, an approach technically impossible for a total of 120 samples at the time point of analysis (and possibly even today). In addition, unequivocal clustering of RA and OA patients was only achieved when correcting for 7 batches and not only for 2 batches (2006 and 2009), indicating that both the technical change of the supplier and a basic variance among the acquiry dates contributed to the dissimilarity of the arrays. This should be taken into consideration for future microarray analyses.

The batch correction had no major influence on the underlying data, as demonstrated by i) almost unaffected means (maximal change 4.087%) in the variance/mean plots for a number of different gene sets; ii) marginally changed, but mostly reduced standardized variances for the same gene sets; iii) substantial overlap (> 65%) of the genes differentially expressed between RA and OA at time point 0 hours before and after batch correction; iv) preservation of 5/7 of the top-ranked GO categories after batch correction. These findings show that ComBat is highly suitable for batch correction of data derived from small sample sizes and does not lead to inappropriate modification of the underlying data as previously published [2].

The genes up-regulated at time point 0 hours (identified following batch correction) reflected a key feature of RA, i.e., SFB-driven fibrosis of the affected joints [26]. In particular, the enhanced expression of collagens type I and III (COL1A1, COL1A2, COL3A1), as well as biglycan (BGN) by RA-SFB has been previously published [2729]. In the context of RA, collagens types V and XI are less well known as mediators of fibrosis, but are targets of matrix metalloproteinase-mediated proteolytic activity [30, 31]. However, together with collagens type I, II, and III, these collagens are defined as (fibrillar) interstitial collagens [32] and represent major components of the extracellular matrix [33], thus suggesting a potential role in fibrosis also for these proteins. The fibrillar type XXVII collagen (COL27A1) and lipocalin-7 (also known as tubulointerstitial nephritis antigen-like 1; TINAGL1), which were identified as up-regulated molecules in RA-SFB in this study for the first time, are not intrinsic extracellular matrix molecules, but both exhibit differentiation potential. Type XXVII collagen, for example, is predominantly expressed in cartilaginous tissues and generally involved in skeletogenesis [34], whereas matricellular lipocalin-7 appears to be a (positive) regulator of angiogenesis [35], potentially influencing enhanced angiogenesis in the synovial membrane [36]. Taken together, the enhanced formation of these matrix components may contribute to joint fibrosis in an attempt to counteract the progressive destruction of cartilage and bone.

Strikingly, differential expression of the 8 genes of interest from the top-ranked GO term “extracellular matrix structural constituent” (DEG_interest) was not only observed at the time point 0 hours, but also at all time points following stimulation with either TNF-α or TGF-β1. This suggests that RA-SFB show a constitutively altered “rheumatic phenotype”, which is preserved upon stimulation with TNF-α and TGF-β1 (“spacer effect”). Possible reasons for such an “imprinted” RA phenotype may include both genomic changes, e.g., mutations or chromosome aberrations, or epigenetic modifications, e.g. methylation or histone acetylation status [3739].

Conclusion

The present study clearly underscores the necessity of correction and removal of batch effects in the analysis of microarray data. The application of batch correction allowed the unequivocal identification of genes differentially expressed between RA and OA and returned the top-ranked GO category “extracellular matrix structural constituent” (8/74 genes; p-value decreased by 4 orders of magnitude). Batch correction strongly reduced the variance, but only marginally influenced the mean expression levels, i.e., led to reliable results without falsification of the underlying data.

RA-SFB show a constitutively altered “rheumatic phenotype”, which is preserved upon stimulation with TNF-α and TGF-β1, suggesting an “imprinted” RA phenotype.