Methods of Analysis and Meta-Analysis for Identifying Differentially Expressed Genes
Microarray approaches are widely used high-throughput techniques to assess simultaneously the expression of thousands of genes under certain conditions and study the effects of certain treatments, diseases, and developmental stages. The traditional way to perform such experiments is to design oligonucleotide hybridization probes that correspond to specific genes and then measure the expression of the genes in order to determine which of them are up- or down-regulated compared to a condition that is used as a control. Hitherto, individual experiments cannot capture the bigger picture of how a biological system works and, therefore, data integration from multiple experimental studies and external data repositories is necessary to understand the function of genes and their expression patterns under certain conditions. Therefore, the development of methods for handling, integrating, comparing, interpreting and visualizing microarray data is necessary. The selection of an appropriate method for analysing microarray datasets is not an easy task. In this chapter, we provide an overview of the various methods developed for microarray data analysis, as well as suggestions for choosing the appropriate method for microarray meta-analysis.
Key wordsGene expression Microarrays Differentially expressed genes Meta-analysis Statistical tests Multiple comparisons
Gene expression microarrays have been used in various applications, including the identification of novel genes associated with certain diseases (most notably cancers), tumor classification, and prediction of patient outcome.
In a microarray experiment, the mRNA levels of thousands of genes are measured simultaneously in tissue samples. One basic method for preparing microarrays is spotting arrays on plates. Each spot on a microarray plate is designed to contain multiple identical copies of single DNA strands, fragments or oligonucleotides that represent specific gene coding regions, and are referred to as “probes.” Each spot or a set of spots corresponds to one gene. The order of the probes on the chip is stored in a computer database so that results can be obtained easily. Probes are designed in such a way that they are uniquely complementary to purified RNA or DNA fragments which are fluorescently or radioactively prelabeled. The probes are then hybridized to their corresponding target sequences. The more RNA or DNA fragments get attached to a spot, the higher the radioactive signal; thus, the intensity of a set of spots represents the expression of a gene. After thorough washing to remove non-specific binding sequences, the raw microarray data are obtained by laser scanning or autoradiographic imaging . Microarrays can be fabricated using various technologies.
In spotted microarrays , the probes, which are oligonucleotides, cDNA or small fragments of PCR products that correspond to mRNAs, are “spotted” onto the surface. Spotted microarrays are “customizable”, since the researcher can choose the probes for each experimental study . In oligonucleotide microarrays , with Agilent and Affymetrix being the most popular platforms, the probes are short sequences designed to be complementary to parts of the target sequence so that a gene is represented by a set of probes (probe-set) instead of a single probe. Contrary to spotted microarrays, the probes are synthesized directly onto the surface. The length of the oligo sequences depends on the specific experimental needs .
Dual channel (or two-color) microarrays are typically hybridized with cDNA prepared from two samples to be compared (e.g., diseased tissue versus healthy tissue). These samples are labeled with two different fluorophores (e.g., Cy3 and Cy5) with different emission intensity. Relative signal intensities of each fluorophore are used to measure differential gene expression . In single-channel (or one-color) microarrays , contrary to dual-channel microarrays, the samples to be compared are labeled with a single fluorophore. Relative signal intensities for each probe or probe-set reflect the expression level of the labeled target sequence. The main representatives of single-channel microarray platforms are Affymetrix, Illumina and Agilent .
The selection of a microarray platform is done on the basis of cost, chip availability for the species under analysis, genome coverage, the starting amount of RNA quality array manufacturing, the validity and availability of software tools for image analysis and intra platform variability [5, 6].
After hybridization, image analysis is performed , followed by pre-filtering/masking for microarray signal correction. Background signal adjustment is also recommended before scaling. Normalization is performed to adjust microarray data for effects that are attributed to technology variations . Typical normalization methods include the rank invariant normalization , quantile  and LOWESS/LOESS methods . For many types of commercial arrays, suites of R-BioConductor-based packages , such as RMA (Robust Multi-array Average expression measure )  and MAS 5.0 Algorithm , are used to perform consecutive background adjustments and data normalization.
After pre-processing and normalization (and potentially some other steps, such as filtering, imputation of missing values and standardization), we usually end up with an expression matrix that contains the expression values for each probe. The objectives of an analysis may be classified into three broad classes: identification of differentially expressed genes (DEGs), classification /class prediction and clustering.
Clustering of microarray data seeks to group genes based on specific features in a biologically meaningful manner. Clustering operates in an unsupervised manner. There are clustering methods that require the number of clusters to be defined beforehand and methods where the number of clusters is automatically defined. There are several clustering methods available, the most popular being the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), as well as other hierarchical clustering methods for tree-based representations. Evolutionary tree-based algorithms such as Neighbor Joining could be also applied in clustering. In the widely used k-means algorithm , the number of clusters is pre-defined. Another popular clustering algorithm, the Self-Organizing Map (SOM) , produces ordered low-dimensional representations of an input data space, and is particularly well suited for exploratory data analysis. Most of the aforementioned methods are implemented in BioConductor , Expander  and Hierarchical Clustering Explorer (HCE) .
Classification of microarray data refers to class prediction from gene expression patterns. In classification, the classes, two or more, (e.g., healthy individuals vs. diseased), are predefined and a classifier is built to discriminate between the classes in future applications [17, 18], most notably screening and diagnosis . A wide variety of supervised methods have been designed for classification, including Neural Networks , Support Vector Machines , Graphical Models , genetic algorithms , nearest neighbour classifiers and many other statistical methods such as shrunken centroids  and Partial Least Squares and Discriminant analysis . Due to the large number of features given as input to the various classifiers, a subsequent problem is to select the subset of features that can be used efficiently by the classifier. This problem is known as feature selection in machine learning and statistics . A great number of feature selection methods tailored for microarray studies have been developed. Comparison of such methods in gene expression classification can be found in several excellent reviews and evaluation studies [27, 28, 29].
The topic of this review is the description of methodologies for the identification of DEGs . The main objective is to identify which genes are differentially expressed, that is, up- or down-regulated, under different conditions. Ideally, the identification of DEGs is a simple procedure reduced to a statistical test for the equality of means (e.g., t-test, see below). Microarray datasets, however, are characterised by several key distinctive features such as small number of samples, large number of variables and excessive amount of noise; therefore, several advanced statistical methods have been proposed to handle these issues efficiently. Moreover, the generation of similar datasets from various laboratories highlight the need for combining these datasets in order to increase the sample size. This approach, which is termed “meta-analysis” in the medical literature, has been increasingly popular during the last years, and a variety of meta-analysis methods have been developed. In this review, we explore the statistical methods for analysis and meta-analysis of DEGs arising from microarray experiments by focusing on the simple case of comparing only two classes (disease-healthy, treated-non treated etc.). First, the analysis methods for detecting DEGs, starting with the well-known t-test, as well as the various modifications of these methods proposed for different microarray datasets, are described. Afterwards, the methods for meta-analysis of microarray datasets are presented. Moreover, the related software implementation methods are listed, as well as the novel variants of these methods. Examples of microarray data analysis and meta-analysis are also presented.
2.1 Methods of Analysis of Differentially Expressed Genes
Earlier microarray publications assessed differential expression merely in terms of fold-change (FC), with a FC ± 2 being considered as a reliable cut-off value. However, FC cut-off values do not take intra-dataset variability into account or ensure reproducibility. Moreover, the FC-based ranking is not adequate, since a gene with larger variance in expression values has a higher probability of having a larger statistic value. It is also suggested that FC-based methods result in lists of DEGs that are more reproducible. However, reproducibility does not signify accuracy, and the question of whether to use FC to identify DEGs is essentially biological, rather than statistical .
In this analysis, subscripts 1 and 2 denote the two conditions, the mean difference of which is assumed to be zero according to the null hypothesis. The t-statistic is compared against a t-distribution with n − 1 degrees of freedom.
The t-statistic is compared against a t-distribution with n1 + n2 − 2 degrees of freedom.
A drawback of the t-test in microarray data analysis is that most microarray experiments contain only a few samples in each group (n1 and n2) the assumption of normality does not hold. Thus, several alternatives to the t-test have been proposed in the literature.
2.1.2 Resampling Methods
Bootstrap [31, 32] is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample. Bootstrap provides an ideal alternative method when no formula for the sampling distribution is available or when available formulas make inappropriate assumptions (e.g., small sample size, non-normal distribution). The accuracy of bootstrapping depends on the number of observations in the original sample and the number of replications. A crudely estimated sampling distribution is adequate to calculate, for instance, a standard error; a better estimate is needed for constructing a 95% confidence interval. There are various methods for constructing a Bootstrap confidence interval from the resampled statistics, such as the normal approximation method, the bias corrected method, the percentile method and the t-percentile method . Generally, replications of the order of 1000 produce very accurate estimates, although more may be needed for the accurate estimation of p-values. Only 50–200 replications are needed for estimating standard errors, though this may have implications in meta-analysis (see below). Various methods have been proposed for estimating the adequate number of replications [34, 35]. The Bootstrap has been applied in microarray experiments and empirical evidence suggests that it produces accurate estimates, at least for moderate sample sizes . For really small sample sizes (i.e., <10), various modifications to the standard bootstrap method have been proposed [37, 38].
A conceptually different resampling method is the permutation test . This is a type of statistical significance test where the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic following rearrangements of the labels on the observations. If under the null hypothesis the labels are exchangeable, then the resulting tests generate exact significance levels. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Fisher and Pitman in the 1930s (reviewed in Kaiser ). For small samples, all possible permutations can be evaluated; however, for sample sizes >15, a random sample of the permutation is used instead, hence the name Monte Carlo permutation. An important assumption underlying a permutation test is that under the null hypothesis the observations are exchangeable. Thus, a consequence of this is that tests of difference in location (e.g., t-test) require equal variance. In this respect, the permutation t-test has the same weakness as the classical Student’s t-test (i.e., the Behrens–Fisher problem). In general, since the permutation computes a p-value by counting the times the test statistic is larger than the observed one, a large number of replications are required (typically of the order of 1000 or more). Permutation tests have been used for the analysis of microarray data . In the case sample sizes are very small, the number of distinct permutations can be severely restricted, and combining the permutation-derived test statistics for each gene has been proposed. However, since the null distribution of the test statistics under permutation is not the same for all genes, this can have a negative impact on p-value estimation .
Bootstrap and permutation methods are readily available in major statistical packages like Stata  and R . There are various implementations of the Bootstrap available in Stata (bootstrap command) and in R (boot command). Permutation can be also performed using the permute and permtest (for paired observations) commands in Stata, as well as the perm command in R. In the Supplement at http://www.compgen.org/tools/microarrays we give examples of performing bootstrapping and permutation t-test is Stata.
2.1.3 Bayesian Methods
The Bayesian methods provide an intuitively appealing framework for handling most of the problems encountered in the analysis of microarray data. Several Bayesian methods have been developed that replace t-test, which is one of the simplest and widely used statistical methods in microarray expression data analysis. These Bayesian methods share some common features but have also marked differences according to various criteria, especially in the prior distribution for the hyperparameters. Moreover, some of these methods are oriented towards hypothesis testing by relying on the Bayes Factor to compare the null against the alternative hypothesis [44, 45, 46, 47]; other methods are oriented towards parameter estimation and compute credible intervals for the parameters of interest, for example the difference of the means [48, 49]. One of the advantages of the t-test is that its simplicity allows in many cases a closed-form expression to be derived, especially for the Bayes Factor [44, 45, 46, 47], whereas other methods rely on Markov Chain Monte Carlo (MCMC) to sample from the posterior distribution [48, 49]. Another major advantage of the Bayesian methods is that within the Bayesian framework, one cannot only incorporate the uncertainty regarding the parameters and the small sample size, but also multiple testing , which is very important in microarray analysis [44, 50, 51].
There are several software implementations available of the aforementioned Bayesian methods. For instance, the Bayes Factor method of Rouder and coworkers , which is known as the Jeffreys–Zellner–Siow (JZS) t-test , is available as a web-calculator (http://pcl.missouri.edu/bayesfactor), as well as an R package (https://cran.r-project.org/web/packages/BayesFactor/index.html). The Savage–Dickey (SD) t-test , proposed by Wetzels and coworkers , is inspired by the JZS t-test and retains its key concepts. It is, however, applicable to a wider range of statistical problems, since it allows researchers to test order restrictions and applies to two-sample situations with unequal variance. The SD t-test is also implemented into an R package that uses WinBUGS (http://www.ruudwetzels.com/sdtest). Finally, the BEST (Bayesian Estimation Supersedes the t-test ) software package, which provides a Bayesian alternative to t-test, providing much richer information than a simple p-value, such as complete distributions of credible values for the effect size, difference of mean between groups, difference of standard deviations, and the normality of the data within the groups . The BEST package is implemented in R (http://www.indiana.edu/~kruschke/BEST/) and is also available as an online calculator (http://sumsar.net/best_online/). Moreover, the BEST method is implemented in the Bayesian First Aid package (https://github.com/rasmusab/bayesian_first_aid) that aims to provide user friendly Bayesian alternatives to the most widely used estimation commands.
2.1.4 Penalized t-Test
This method is implemented in the web-server Cyber-T (http://cybert.ics.uci.edu/) and in R (http://cybert.ics.uci.edu/). The parameter ν0 represents the degree of confidence in the background variance σ02 versus the empirical variance. In Cyber-T, the value of ν0 is user defined; the smaller the n value , the larger the ν0 value. A simple rule of thumb is to assume that K > 2 is needed to properly estimate the standard deviation and keep n + ν0 = K. This allows a more flexible treatment of situations in which the number n of available data points varies from gene to gene. The default value of K is 10. In particular, by using this approach, the empirical variance depends on ν0 “pseudo-observations” with a background variance σ02. For σ0, one could use the standard deviation of the entire dataset or of particular categories of genes. Cyber-T uses a flexible approach under which the background standard deviation is estimated by pooling together all the neighbouring genes contained in a window of size w (the default value of w is 101, corresponding to 50 genes in the immediate neighbourhood of the gene under consideration).
This differs slightly from the previous statistics in that the penalty a is applied to the sample standard deviation S rather than to the sample variance S2. Tusher et al.  in the so-called “Significance Analysis of Microarrays (SAM)” method, chose a to minimize the coefficient of variation of the absolute t-values while Efron et al. , used a as the 90th percentile of the S values. These choices are based on empirical rather than theoretical considerations. SAM is one of the oldest and most widely-used methods and it is available as an Excel plugin at http://statweb.stanford.edu/~tibs/SAM/, as well as implemented in several R packages (samr, ema).
RAM is based on the comparisons between a set of ranked t-statistics and a set of ranked Z-values (a set of ranked estimated null scores) yielded by a “randomly splitting” (RS) approach instead of the permutation approach used by SAM. Results obtained from simulated and real microarray data revealed that RAM is more efficient in the identification of DEGs under undesirable conditions such as small sample size, a large fudge factor, or mixture distribution of noises compared to SAM.
The regularised t-statistics have many desired properties. In particular, they are easily computed, have a natural interpretation, and are less computationally intensive compared to the full Bayesian methods and the resampling approaches. Moreover, simulation studies  have shown that regularised t-statistics are superior to the ordinary t-statistic for detecting DEGs, even when the sample size is very small (n < 10). The penalized t-statistics, on the other hand, can also be extended in several ways to apply to more general experimental situations. A disadvantage is that the null distribution of the modified t-statistic is not standard. Baldi and Long , as well as Smyth , rely on a modified t-distribution with adjusted degrees of freedom. On the other hand, methods such as SAM use permutations in order to calculate False Discovery Rate (FDR , see below).
2.1.5 Other Methods
As we have already mentioned, earlier microarray publications estimated differential expression of genes based solely on FC The moderated t-tests, on the other hand, borrow information across genes; they perform better, providing estimates of statistical significance and results more in line with FC rankings . However, even these contemporary statistical tests permit genes with relatively small FCs to be considered statistically significant probably due to t-statistic formula’s very small denominator.
Hence, it is becoming increasingly necessary in the literature that DEGs meet both p-value and FC criteria. Several authors require that genes satisfy an acceptable level of statistical significance and then rank significant genes by FC with an arbitrarily set cut-off. There are also authors who first apply a FC cut-off and then rank genes according to their p-value. Other authors declare genes as differentially expressed on the basis that they simultaneously show a FC larger than a given threshold value and satisfy the criterion for p-value . Such combined criteria are suggested to identify more biologically relevant sets of genes and even provide a much better inter-platform agreement compared to FC and p-values alone .
TREAT (t-tests relative to a threshold) is used to introduce statistical formalism to these approaches. This method is an extension of the empirical Bayesian moderated t-statistic presented by Smyth (i.e., limma), and can be used to test whether the true differential gene expression is greater than a given threshold value. By including the FC threshold value of interest in a formal hypothesis test, the methods achieve reliable p-values for identifying genes with differential expression that is biologically relevant . TREAT has been shown to perform well in both real and simulated data.
The RP method is available as an R package (RankProd), and also supported by the webserver RankProdIt (http://strep-microarray.sbs.surrey.ac.uk/RankProducts/).
The use of exact calculation and permutation methods have been proposed to determine the statistical significance. These approaches have serious limitations as they are computationally demanding. Approximation methods have been also proposed but these usually provide inaccurate estimates in the tail of the p-value distribution. Lately, however, a method to determine upper bounds and accurate approximate p-values of the RP statistic has been developed, decreasing the computational time significantly. The R code for this method is available at http://www.ru.nl/publish/pages/726696/rankprodbounds.zip .
The RP method has been reported to perform more reliably and consistently compared to SAM, even on highly noisy data. In realistic simulated microarray datasets, RP is more robust and accurate for sorting genes based on differential expression compared to t-statistics, especially for replicate numbers n < 10. This method performs particularly well on data contaminated by abnormal random noise and heterogeneous samples. RP, however, assumes equal measurement variance for all genes and tends to give overly optimistic p-values when this assumption does not apply. Therefore, appropriate variance-stabilizing normalization should be performed on the data prior to calculating the RP values. If applicable, another rank-based variant of RP , that is, average ranks, provides a suitable alternative with comparable performance.
2.2 Meta-Analysis of Microarrays
Meta-analysis is the statistical technique for combining data from multiple independent but related studies . In particular, meta-analysis can be used to identify a treatment effect that is consistent among studies. In case the treatment effect varies among studies, meta-analysis may be used to identify the cause for this variation. Hypotheses cannot be inferred and validated based solely on the results of a single study, as the results typically vary between studies; instead, data across studies should be combined . Meta-analysis applies universal formulas to a number of different studies. Nowadays, GEO (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) databases provide the option to compare the normalized raw data across many experiments and organisms, allowing in this way comparative gene expression profiling.
In this section, we provide a practical guide that could enable the reader to make informed decisions on how to conduct a meta-analysis of microarray data.
Issue 1: Selection of Appropriate Microarray Datasets
The first, and most critical, step in an experimental study is to clearly state objectives. Meta-analysis enables the identification of DEGs among multiple samples in order to improve classification within and across platforms, detect redundancy across diverse datasets, identify differentially co-expressed genes, and infer networks of genetic interactions. The second step of meta-analysis is to set eligibility criteria, either biological (e.g., tissue type, disease) or technical (e.g., one-channel versus two-channel detection, density of microarrays, technological platform). Based on these criteria, literature searches are preformed, using appropriate key terms, to retrieve relevant studies. These studies can be complemented by microarray data available in public databases that conform to the MIAME (Minimum Information About a Microarray Experiment ) guidelines.
Issue 2: Data Acquisition from Studies
The genes found to be differentially expressed in a given study constitute the published gene lists (PGLs) which are either included in the main text or provided as supplementary material. The gene expression data matrices (GEDM) contain preprocessed expression values of every probe-set and sample for one gene. The published GEDM cannot be used directly as input for meta-analysis because of the different algorithms used for processing raw data in the original studies, which may generate heterogeneous, non-comparable results.
Issue 3: Preprocessing of Datasets from Diverse Platforms
To enable consistent analysis of all datasets, bias introduced by the preprocessing algorithms should be eliminated. To this end, feature-level extraction output (FLEO) files, such as CEL files, should be obtained and converted to GEDM suitable for meta-analysis. Multiple studies from the same platform should be preprocessed using a single algorithm. In case the studies are conducted on different platforms, it is recommended to be preprocessed with comparable algorithms in order to be combinable.
Issue 4: Promiscuous Hybridization between Probes and Genes
The datasets are annotated using UniGene or RefSeq gene identifiers, collectively referred to as GeneIDs. Multiple probes can hybridize with the same GeneID, as UniGene represents a cluster of sequences that correspond to a unique gene. Conversely, one non-specific probe can cross-hybridize with multiple GeneIDs due to imperfect specificity. There are also probes with inadequate sequence information that cannot hybridize with any GeneID. One approach to resolve the “many-to-many” relationships between probes and genes is to include in the meta-analysis only probes that are associated with a single gene, and exclude the promiscuous probes that are associated with more than one gene; however, important information can be lost. Averaging the expression profiles prior to meta-analysis is not recommended either, given that probe binding affinity differences affect the gene expression measurements. It is therefore recommended to apply descriptive statistics, thereby reducing the “many-to-many” into “one-to-one” relationship between probe and GeneID for each study [66, 67, 68].
Issue 5: Choosing a Meta-Analysis Technique
The choice of meta-analysis techniques depends on the type of response (e.g., binary, continuous, survival). In this review, we focus on the two-class comparison of microarrays where the objective is to identify genes expressed differentially between two different conditions. In such cases as this, there are three broad categories of statistical methods for meta-analysis that make use of effect sizes, p-values and ranks.
2.2.1 Effect Size
In the case of a matched design (e.g., use of same individuals before and after treatment), there is a very similar formula , except that the natural unit of deviation is the standard deviation of the difference scores, and so this is the value that is likely to be reported or calculated from the data.
As we have already noted, this approach is based on common practices in meta-analysis and thus it was advocated early in the literature [73, 74]. However, to handle the problem of small sample size and non-normal data, most authors suggest a type of correction for calculating the statistical significance. Therefore, instead of relying on the normal approximation , they propose the permutation test. Although Choi and coworkers  suggest permutations to calculate p-values, a faster solution is offered in the Bioconductor package GeneMeta, which assumes a normal distribution on the z-scores after checking the reliability of this hypothesis by a Q–Q plot. In general, all the aforementioned resampling methods can be used, with bootstrapping being, probably, the most advantageous since it requires a smaller number of replications. The bootstrap or the permutation methods can also be used in different settings. One option would be to perform an analysis for each study separately, obtain a corrected estimate of variance and then use this in order to calculate the weights for the meta-analysis. Another option would be to perform the analysis in a single step using the resampling strategy (bootstrap or permutation) in a stratified manner, in which the studies are treated as strata.
All the standard methods reported above can be easily used with this effect size and its variance. The ratio of means has also been used for data other than gene expression, and, in general, it performs well even in small samples . Lately, the ratio of geometric means has also been proposed, especially for skewed data, and its application in the meta-analysis of gene expression data could be also investigated . The points mentioned above regarding bootstrap and permutation are also applicable to this effect size.
The aforementioned methods, since they are standard methods for meta-analysis, can be easily extended to a Bayesian framework . Several studies have been performed to this end, and source code to fit the model is available [79, 80]. In general, Conlon and coworkers [79, 80] use in their models a structure similar to the one Gotardo and coworkers use in their model for single studies; an additional level is added though to account for multiple studies. The main problem with the Bayesian methods is the increased computational complexity and time needed to perform the analysis, especially when a large number of genes is investigated which perhaps limit their applicability. The WinBUGS code to fit the models of Conlon and coworkers is available at http://people.math.umass.edu/~conlon/research/BayesPoolMicro/.
Finally, another promising approach is to use the moderated effect sizes calculated by methods such as limma, instead of the typical effect sizes, in the traditional meta-analysis. This is a two-step method relying in the first step on an advanced method for regularized t-test . Then, provided that that t = d√n, a traditional random effects meta-analysis is performed. Another modification of this work is that instead of using the approximation for the variance of d, the exact calculation given by Hedges is used. This approach is implemented in the R package metaMA (https://cran.r-project.org/web/packages/metaMA/index.html).
Several major meta-analysis methods for DEG analysis, including fixed effects and random effects methods, as well as methods for combining p-values and ranks (see next sections), are implemented in R packages, such as GeneMeta and metaMA. The most complete package , however, is MetaDE, which also offers functionality for preprocessing the data, as well as for displaying the results graphically . Stata lacks a meta-analysis command dedicated to microarrays, but several of the methods mentioned here can be easily implemented. As a proof of concept, we describe in the Appendix several approaches for performing random effects meta-analysis. One approach consists of performing the analysis for each study separately (using bootstrap or permutation) and then combine the results in the usual way. Another approach would be to perform meta-analysis in a single step and run the bootstrap or permutation simulation as a wrapper method; both bootstrap and permutation should be then performed in a stratified manner treating the studies as strata.
Another class of methods for meta-analysis consists of methods that combine ranks. There are several different approaches, the common denominator of which is that if the same gene is repeatedly at the top of the list ordered by up- or down-regulated genes in replicate experiments, the gene is more likely to be declared as differentially expressed. The Rank Product method, which we have already described in the context of a single study, uses FC to rank genes and calculates the products of ranks across samples and studies . A similar method, Rank Sum, uses the sum of ranks instead, but all other calculations are identical. The RankProd software is available at: https://www.bioconductor.org/packages/release/bioc/html/RankProd.html.
A related method, termed METRADISC (Meta-analysis of Rank Discovery Dataset ), is based on the same principle, but it is more general [84, 85]. The ranking within each study is performed with any available method (FC, t-test, p-value etc.) and then the average rank of a particular gene across studies is calculated. The overall mean can be weighted or unweighted; the weighted overall mean resembles the traditional methods for meta-analysis. The between-study heterogeneity of the study-specific ranks can also be computed. METRADISC is implemented in R (http://www.inside-r.org/node/155959) and it is also available as a stand-alone application (http://biomath.med.uth.gr/). The methods that use ranks are quite robust and can combine studies using different methods. However, the statistical inferences are based on Monte Carlo permutation tests, which may be time-consuming.
The rank-based methods offer several advantages compared to traditional approaches, including the FC criterion, fewer assumptions under the model, and robustness with noisy data and/or low numbers of replicates . These methods overcome heterogeneity across multiple datasets and combine them to achieve increased sensitivity and reliability. Of particular note, these methods do not require the simultaneous normalization of multiple datasets using the same technique, solving in this way a key issue in microarray meta-analysis pre-processing. Moreover, the rank-based methods transform the actual expression values into ranks, and thus they can integrate datasets produced by a wide variety of platforms (Affymetrix oligonucleotide arrays, two-color cDNA arrays etc.). Finally, the rank-based methods are quite general and therefore can be applied to different types of data, such as proteomics or genetic association data.
2.2.3 Combination of p-values
Interestingly, this is the exact formula from the QFAST method of Bailey and Gribskov, presented independently few years earlier. Source code for implementing TPM can be obtained from http://statgen.ncsu.edu/zaykin/tpm/. The different approaches for combining p-values have been compared in several evaluation studies [93, 94]. Most of the methods presented in this section are implemented in the metap command available in Stata and R.
Using this (hypothetical) effect size and its variance , standard methods for random effects meta-analysis can be applied easily. This approach requires only the Z-score, which can be either acquired directly or calculated from the p-value, the direction of association, and the number of replicates for each condition. This simple approach, inherits all the desirable properties of the method of Stouffer and, at the same time, performs optimal weighting, quantifies the association and enables random-effects meta-analysis in order to account for between-studies heterogeneity. If the original data are analysed with standard methods, the estimated d’s are accurate. If, however, a modified version of the t-test or a resampling method for the statistical significance is used, some discrepancies may be expected; nevertheless, the Z-score and the statistical significance (p-value) of the overall effect are accurate. A Stata program that implements this method and compares it against other methods for combining p-values is given in the Supplement.
2.3 Multiple Comparisons
A typical microarray experiment measures the expression of several thousand genes simultaneously across different conditions. When investigating for potential DEGs between two conditions, each gene is treated independently and a t-test (or any other test described above) usually is performed on each gene separately. The incidence of false positives (i.e., genes falsely declared as DEGs) is proportional to the number of tests performed and the critical significance level (p-value cut-off). When a t-test is performed, the null hypothesis (H0) is usually the hypothesis of no difference between the gene’s expression level, whereas the alternative hypothesis (H1) is that the expression levels differ. If the p-value is less than the chosen significance level, then the null hypothesis is rejected. Assuming the null hypothesis holds, in case 10,000 genes are tested at a 5% level of significance, 500 genes might be declared as significant, by chance alone. Thus, it is important to correct the p-value when performing a statistical test on a group or genes. This is the case for multiple testing correction methods.
Other popular multiple comparison correction methods which control FDR in microarray analysis and meta-analysis are the methods proposed by Benjamini and Yekutieli , Benjamini and Liu [103, 104], Benjamini, Krieger and Yekutieli . The methods described above are implemented in the multproc command in Stata and in multcomp package in R.
3 Closing Remarks
Microarray experiments enable researches to analyze a vast amount of genetic information in a single experimental run. Therefore, the expression of multiple genes can be measured simultaneously under specific conditions. The use of DNA microarrays is very promising towards understanding genes’ effect on diseases, drug discovery and development. Microarray experiments combined with bioinformatics analysis can reveal a great deal of information about a biological system and its dynamics. Such approaches though, like any other emerging technologies, come with shortcoming and disadvantages.
One of the main problems concerning microarray experiments is the lack of standardization. As a result, the data collected from different microarray platforms cannot be compared accurately, or even be replicated. In an evaluation study, Ioannidis and coworkers found that a large proportion of published studies could not be reproduced, either partially or completely . This was mainly attributed to data unavailability and incomplete data annotation or specification of data processing and analysis. The authors called for stricter publication rules that will enforce public data availability and explicit description of data processing and analysis. The issue of comparing data generated by different platforms have long been under investigation  and filtering of probes has been shown to significantly improve intra-platform data comparability .
Methods for combining different datasets in a meta-analysis can help researchers to alleviate some of the problems mentioned above . However, issues mentioned earlier such as the lack of standardization remain important obstacles for the development of such methods. In this chapter, we presented the available methods and provided information about the various available implementations of these statistical methods. In the recent literature, there are various studies that compare the different methods [110, 111, 112]. Notably, the lack of standardization is also apparent in the literature pertinent to studies in meta-analysis of microarrays, since different methods and combinations of these methods have been used in the recent literature. We have shown that, especially in the case of meta-analysis of effect sizes, various combinations of methods can be used. The final choice depends on the available software, the number of genes analysed, the number of studies and the different platforms to be combined. We have also shown that a meta-analysis of p-values can be performed under a random-effects method. It is also worth mentioning that, apart from the commonly used DerSimonian and Laird estimator, there are many available methods for calculating the between-studies variance in a random-effects meta-analysis; some of these methods are better suited for small and heterogeneous samples . Moreover, there are available approximate Bayesian methods for meta-analysis that do not rely on simulations, and hence faster . Taken together, the points above suggest that there is plenty of room for improvements in microarray meta-analysis methodology and software, both in terms of accuracy and speed. A recent systematic search in PubMed, resulted in the empirical evaluation of 333 articles based on microarray meta-analysis studies . The results of this evaluation were very interesting, since, apart from the three general classes of methods presented earlier (effect size, rank, p-values), a large proportion of the published studies was found to be conducted using the “inappropriate” method of pooling datasets. This is a well-known issue in the meta-analysis literature, and this approach of pooling datasets in order to simply create a larger one is not recommended, since it can lead to various types of bias. Vote counting, in which one counts the number of studies in which a gene was declared significant, is another commonly used approach that is not recommended either.
In this chapter, we presented a review on the methodological issues pertaining to microarray data analysis and meta-analysis. The relative microarray databases and available software were also presented. Moreover, statistical methods of microarray data analysis were illustrated by a case study.
- 26.Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182Google Scholar
- 30.Witten D, Tibshirani R (2007) A comparison of fold-change and the t-statistic for microarray data analysis. Analysis 1776:58–85Google Scholar
- 42.StataCorp (2013) Stata statistical software: release. StataCorp LP, College Station, TX, p 13Google Scholar
- 43.R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
- 55.Lönnstedt I, Speed T (2002) Replicated microarray data. Stat Sin 12:31–46Google Scholar
- 69.Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. L. Erlbaum, Hillsdale, New JerseyGoogle Scholar
- 70.Petiti DB (1994) Meta-analysis decision analysis and cost-effectiveness analysis. In: Monographs in epidemiology and biostatistics, vol 24. Oxford University Press, OxfordGoogle Scholar
- 77.Friedrich JO, Adhikari NK, Beyene J (2012) Ratio of geometric means to analyze continuous outcomes in meta-analysis: comparison to mean differences and ratio of arithmetic means using empiric data and simulation. Stat Med 31(17):1857–1886. https://doi.org/10.1002/sim.4501 PubMedCrossRefGoogle Scholar
- 87.Fisher RA (1946) Statistical methods for research workers, 10th edn. Oliver and Boyd, EdinburghGoogle Scholar
- 94.Cousins RD (2007) Annotated bibliography of some papers on combining significances or p-values. arXiv preprint arXiv:07052209Google Scholar
- 95.Stouffer SA, Suchman EA, De Vinney L et al (1951) Studies in social psychology in world war II. In: The American soldier: adjustment during army life, vol Vol. 1. Princeton University Press, PrincetonGoogle Scholar
- 97.Dudoit S, Yang YH, Callow MJ, et al (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical report # 578Google Scholar
- 98.Sidak Z (1967) Rectangular confidence regions for the means of multivariate normal distributions. J Am Stat Assoc 62:626–633Google Scholar
- 100.Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70Google Scholar
- 101.Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 57(1):289–300Google Scholar
- 103.Benjamini Y, Liu W (1999) A distribution-free multiple test procedure that controls the false discovery rate. Tel Aviv University, Tel AvivGoogle Scholar