Beyond the oneway ANOVA for ’omics data
Abstract
Background
With ever increasing accessibility to high throughput technologies, more complex treatment structures can be assessed in a variety of ’omics applications. This adds an extra layer of complexity to the analysis and interpretation, in particular when inferential univariate methods are applied en masse. It is wellknown that mass univariate testing suffers from multiplicity issues and although this has been well documented for simple comparative tests, few approaches have focussed on more complex explanatory structures.
Results
Two frameworks are introduced incorporating corrections for multiplicity whilst maintaining appropriate structure in the explanatory variables. Within this paradigm, a choice has to be made as to whether multiplicity corrections should be applied to the saturated model, putting emphasis on controlling the rate of false positives, or to the predictive model, where emphasis is on model selection. This choice has implications for both the ranking and selection of the response variables identified as differentially expressed. The theoretical difference is demonstrated between the two approaches along with an empirical study of lipid composition in Arabidopsis under differing levels of salt stress.
Conclusions
Multiplicity corrections have an inherent weakness when the full explanatory structure is not properly incorporated. Although a unifying ‘single best’ recommendation is not provided, two reasonable alternatives are provided and the applicability of these approaches is discussed for different scenarios where the aims of analysis will differ. The key result is that the point at which multiplicity is incorporated into the analysis will fundamentally change the interpretation of the results, and the choice of approach should therefore be driven by the specific aims of the experiment.
Keywords
Multiplicity Model selection ’omics ANOVAAbbreviations
 ANOVA
Analysis of variance
 BH
BenjaminiHochberg
 FDR
False discovery rate
 FWER
Familywise error rate
 MRF
Model, rank, filter
 OFDR
Overall false discovery rate
 PFER
Perfamily error rate
 RFM
Rank, model, filter
 RCBD
randomized complete block design
Background
There are two main avenues of analytical methods that can be applied to such ’omics data; multivariate methods and mass univariate methods (see e.g. [1, 2]). Each have their own advantages and disadvantages, but in combination can elicit a great deal of insight into the underlying mechanisms that generated the data [3, 4, 5, 6, 7]. In this paper, we predominantly focus on the issues that arise in the latter methods, the massunivariate approach, but would like to emphasise that these methods should add to the insight obtained from multivariate approaches rather than act as a replacement.
Mass univariate analysis
In contrast to multivariate approaches, mass univariate methods investigate each response variable independently. Although the covariance structure between response variables is not accounted for in such an analysis, this approach will provide information about the variability in individual response variables, and, in particular, how it relates to components of the explanatory structure. For example, identifying the sets of treatment conditions that are statistically significant for each response. This is particularly advantageous when an experiment may include complex explanatory structures that include multiple explanatory terms. Standard statistical techniques such as ANOVA (analysis of variance), REML (restricted maximum likelihood) and regression can be used to investigate both the biological size and statistical significance of different explanatory model components in an inferential framework. Moreover, these analysis approaches can cope with complex treatment structures (such as the factorial treatment structure in the above example), unbalanced design structures and missing values.
where n is the number of observations, y_{i} is the ith observation of a single response variable y, X_{i} is the ith row of the design matrix X, β is the vector of coefficients to be estimated and ε are independent error terms. Fitting and analysing such models is a completely standard statistical approach using the methods identified above.
where y_{ji} is the ith observation on the jth treatment. Once fitted, the estimated parameters of this model can be assessed. Specifically, using either a twosample ttest (for J=2) or an Ftest (for J>2) the null hypothesis of no difference in response in the J different treatments can be tested. When testing at a prespecified significance level, α, of 0.05, the probability of a type I error (the false positive rate) is controlled at 5%. However, it is well known that applying such a univariate analysis approach, en masse, to multiple response variables, will result in an unacceptably high type I error rate over the whole experiment due to the issue of multiple testing (also referred to as multiplicity), described below.
Multiple testing
which when the individual level of significance is α=0.05, gives α^{∗}=0.40. Thus, to control the overall probability over R tests (the number of response variables) of at least one type I error at a specified level α, the significance level for each test becomes 1−(1−α)^{1/R}, which in practice is often approximated by the upper bound α/R, the Bonferroni correction. This adjustment controls the family wise error rate (FWER), defined to be the probability of at least one type I error. Alternative ways of controlling the type I error in a multiplicity framework include controlling the perfamily error rate (PFER), which is defined to be the expected number of type I errors, and the false discovery rate (FDR), defined as the expected proportion of false discoveries. A plethora of methods for controlling the type I error, through differing multiple testing corrections, can be found in the literature (see [8, 9] and [10] for reviews). These often have different aims, such as which error rate is to be controlled, whether or not tests can be considered independent and whether the chosen error rate should be controlled in the strong or weak sense [8].
It is reasonable in an ’omics framework to assume there will be substantial positive dependence between response variables. This means that any correction based on the assumption of independent tests will be highly conservative and will consequently reduce the power of the test for fixed sample sizes. Alternative procedures are available that account for this dependence, for example [11] employed resampling methods and [12, 13, 14] adjusted the significance level based on an effective number of independent tests N_{eff}<N.
Where there is only a single explanatory variable in the model, as in Eq. 2, any of the above multiplicity corrections can be incorporated without difficulty to ensure the overall type I error rate of the whole experiment does not become overinflated.
Incorporating explanatory structures
However, there is a lack of coherence within the literature when explanatory structures become more complex. As the cost of ’omics experiments decreases, experimenters are increasingly generating ’omics datasets from experiments with more complex designs, including both crossed and nested treatment structures. Examples can be found in transcriptomics [15, 16, 17], proteomics [18, 19] and metabolomics (as exemplified from the open source database PMR [20]). In all of these scenarios, mass univariate analysis can be applied to investigate the effect of the explanatory structure on a responsebyresponse basis.
where y_{jki} is the response due to the ith observation on the jth genotype for the kth salt stress, a_{j} is the effect associated with genotype j, b_{k} is the effect associated with salt level k, (ab)_{jk} is the interaction effect of genotype j and salt level k and ε are independent error terms.
The twoway ANOVA table for a single lipid response variable (TAG 52.6) with treatment factors genotype and salt
Df  Sum Sq  Mean Sq  F value  Pr(>F)  

Genotype  3  27.503  9.168  38.230  < 0.001 
Salt  2  11.852  5.926  24.711  < 0.001 
Genotype:Salt  6  2.328  0.388  1.618  0.171 
Residual  36  8.633  0.240 
The corresponding oneway unstructured ANOVA table for a single lipid response variable (TAG 52.6), where Genotype:Salt is a single explanatory term corresponding to all combinations of genotype and salt conditions
Df  Sum Sq  Mean Sq  F value  Pr(>F)  

Genotype:Salt  11  41.682  3.789  15.802  < 0.001 
Residual  36  8.633  0.240 

identify which response variables show significant levels of variation due to the explanatory model (taking account of multiplicity), and

assess how these response variables are important by identifying the components of the explanatory model showing significant effects.
In the following, we present a general paradigm of analysis, through which we develop the theory behind two methods for incorporating a multiplicity correction within the general linear model framework. This is supported by a simulation study before demonstrating the methods on an application in lipidomics, where we investigate the biological understanding that can be gained from such an approach. We end with a discussion on the methods, highlighting current limitations and possible extensions.
Methods
Typically, linear models as defined in (1) are used both to estimate the effects due to individual explanatory variables (or the interactions between two or more explanatory variables), and to test the statistical significance of these effects. For the analysis of a designed experiment, it is conventional to fit the saturated (or maximal) model, and it is important to make a distinction between this model and the predictive model, used to generate predictions for different combinations of the explanatory variables (usually factors). The saturated model includes all of the explanatory terms associated with the factors included in the design of the experiment, and provides the basis for assessing the statistical significance of each of these terms (blocking factors, main effects of treatment factors and interaction effects between two or more treatment factors). Having determined the statistically significant effects relative to the estimated background (observationtoobservation) variability, the predictive model can be formed, containing all statistically significant terms plus terms that are marginal to these (i.e. both main effects must be included in the predictive model if the associated interaction effect is statistically significant). Where an experiment is carefully designed to be balanced and orthogonal, this selection of the predictive model is straightforward as all terms in the saturated model will be independent. For more general explanatory structures, it may be more difficult to define a saturated model, so that the fitted model will usually be the predictive model.

A ranking of responses in order of significance [RANK]

A filtering process to discard nonsignificant responses (incorporating corrections for multiplicity) [FILTER]

A model selection step to define the predictive model (i.e. to choose the important terms in the explanatory structure) for each response [MODEL].
The order in which these steps are implemented will depend upon the aims of the study.
Approach 1: rank, filter, model (RFM)
 1.
For each of the n response variables, fit the linear model with a single oneway (unstructured) treatment term and calculate the associated oneway ANOVA as in example (4) to obtain an overall test of significance at this first step.
 2.
Rank the responses based on the significance of this overall test and apply a multiplicity correction of choice to this set of n tests.
 3.
Filter out the responses deemed nonsignificant after the multiplicity correction.
 4.
For the remaining response variables, apply a model selection process to the full explanatory structure, generating a predictive model for each significant response variable.
To be explicit, let p^{(1)},...,p^{(R)} be the unadjusted pvalues for the responses 1...R of the oneway Ftest in step 1. After a multiplicity correction in step 2, the adjusted pvalues are given by max(1,c^{(1)}p^{(1)}),..., max(1,c^{(R)}p^{(R)}), where c^{(r)} is the adjustment applied to the pvalue for response r (e.g. under a Bonferroni correction c^{(r)}=R, ∀r∈{1...R}). Equivalently, the significance level, the probability of a type I error, has also been adjusted by 1/c^{(r)}, which under a Bonferroni correction is 1/R. Thus, each response is assessed against a specific significance level α/c^{(r)}, r=1...R. This response specific significance level is carried through to step 3 and is the significance level used in the model selection step.
Approach 2: model, rank, filter (MRF)
 1.
For each response variable, apply a model selection process to the full explanatory structure. This will then define specific explanatory structures for each response, yielding the predictive model for each of the n responses. Note, no multiplicity correction is applied at this step as the aim is to construct a predictive model for each response based on a consistent measure of significance.
 2.
For each responsespecific predictive model, obtain the associated oneway (unstructured) treatment term and fit the corresponding linear model that has a single treatment term to obtain an overall test of significance through a oneway ANOVA.
 3.
Rank the responses based on the significance of this overall test and apply a multiplicity correction of choice to this set of n tests.
 4.
Filter out the response variables deemed nonsignificant after the multiplicity correction.
This is depicted in Fig. 2b. As with RFM, this will result in different responses having different significant explanatory variables and hence can be used to classify responses according to their final explanatory structure. However, in contrast to RFM, the model selection step in MRF, regardless of which method used, will be consistent across all response variables, whilst the overall test for significance will differ. Since this test is applied after model selection, models for different response variables will have a differing number of explanatory terms and as such the overall Ftest will have responsespecific numerator and denominator degrees of freedom.
A comparison
Although the general framework is very similar for incorporating multiplicity corrections and model selection in these two approaches, they give fundamentally different classifications of the response variables. The approach that should be used will depend on the scenario in question. In the case of balanced orthogonal designs, where all model terms are either blocking or treatment factors, the RFM approach seems favourable as the multiplicity correction is applied to the saturated model. In a regression framework, where model terms may include observational variables, intuitively model selection is more important, motivating the use of MRF which puts more emphasis on the model selection process.
A general twoway ANOVA table consisting of a factorial treatment structure between factors A and B
df  SS  MS  F  

A  (t_{A}−1)  S S _{A}  \(\tfrac {1}{(t_{\mathrm {A}}  1)} {SS}_{\mathrm {A}}\)  MS_{A}/MS_{res} 
B  (t_{B}−1)  S S _{B}  \(\tfrac {1}{(t_{\mathrm {B}}  1)}{SS}_{\mathrm {B}}\)  MS_{B}/MS_{res} 
A:B  (t_{A}−1)(t_{B}−1)  S S _{AB}  \(\tfrac {1}{(t_{\mathrm {A}}  1)(t_{\mathrm {B}}  1)}{SS}_{\text {AB}}\)  MS_{AB}/MS_{res} 
Residual  N−t_{A}t_{B}  S S _{res}  \(\tfrac {1}{(Nt_{\mathrm {A}}t_{\mathrm {B}})}{SS}_{\text {res}}\) 
where under the null hypothesis that there is no difference between the t_{A}×t_{B} different treatments, \(\phantom {\dot {i}\!}\mathrm {F}_{A*B} \sim F_{t_{\mathrm {A}}t_{\mathrm {B}}  1, Nt_{\mathrm {A}}t_{\mathrm {B}}}\).
on t_{A}+t_{B}−1 and N−(t_{A}+t_{B}) degrees of freedom.
Thus, the overall test statistic under RFM (given by F_{A∗B}) can be directly related to the overall test statistic under MRF (given by F_{A+B}), see Section 1 of the Supplementary Material in Additional file 1 for details. This relationship is shown in Fig. 2c for t_{A}=4,t_{B}=3, N=36 and overall Fstatistic of F_{A∗B}=2, which implies an associated pvalue of p_{A∗B}=0.075. Thus, regardless of the significance of any individual main effects, under RFM this response variable would be filtered out as being nonsignificant. However, as can be seen in Fig. 2c, when the interaction term is sufficiently nonsignificant (and consequently one or more main effects are highly significant since the overall variance ratio is fixed at 2), MRF would correctly identify a significant response variable. Thus, RFM is a conservative approach that has the potential for prematurely filtering out responses.
By moving the filtering step of the RFM approach to postmodel selection step (i.e. a rank model filter (RMF) approach as described in Additional file 1 of the Supplementary Material), this conservativeness can be mitigated, but ranking in the first step will, in general, result in a more conservative multiplicity correction compared to the MRF approach that applies multiplicity corrections to the oneway test of the predictive model associated with larger residual degrees of freedom.
Results
The methods derived above are demonstrated and compared through a comprehensive simulation study before being applied to the lipidomics dataset previously described. For this dataset, we also consider in detail the interpretation of such analyses through the presentation and visualisation of the output.
Simulation study
 1.
RFM, with BenjaminiHochberg (BH) multiplicity correction controlling the FDR at 5% and model selection via Ftests
 2.
MRF, with BenjaminiHochberg multiplicity correction controlling the FDR at 5% and model selection via Ftests
 3.
RFM, with Bonferroni multiplicity correction controlling the FWER at 5% and model selection via Ftests
 4.
MRF, with Bonferroni multiplicity correction controlling the FWER at 5% and model selection via Ftests
To compare the RFM and MRF approaches, we investigated the observed error rates within the simulation study. Throughout, we focus our discussion to the analysis under a BH multiplicity correction. Further discussion under a Bonferroni multiplicity correction is given in Section 3.3 of the Supplementary Material in Additional file 1.
Case study: Lipidomics
Applying both RFM and MRF under a BH correction to control the FDR to the lipidomics dataset described previously, only minor differences in the number of lipids identified as showing differential expression can be seen. Specifically, of the 131 lipids, 129 were identified as showing differential expression between treatments under RFM, whilst under MRF all 131 lipids were identified. However, discrepancies between the methods become apparent when assessing the selected predictive model, with around 4% fewer lipids found to have a significant interaction term under RFM compared with MRF. These differences were exacerbated with the application of a more stringent Bonferroni correction (not shown).
Although the identification of ‘significant’ response variables is useful, greater insight can always be gained by coupling the notion of statistical significance to biological relevance. Within the ’omics framework, it is common to use high throughput technologies to identify a list of the most ‘interesting’ response variables to investigate further. In the simplest case, where comparisons are only made between two treatment conditions, comparisons are often visualised through volcano plots, which show the relationship between statistical significance (often scaled logarithmically) and biological relevance (often presented in terms of fold change to a baseline condition). This enables easy identification of the ‘most interesting’ responses that are both statistically significant and biologically relevant. However, for more complex explanatory (treatment) structures, it is far from clear how to identify ‘interesting’ responses.
Overall assessment
In addition, these approaches can be used to characterise the response variables through the identified predictive model. For example, the lipidomics response variables can be categorised into 8 different groups, where each group contains response variables with the same predictive model. As shown in the simulation study, the predictive models identified by the two approaches differ and so different categorisations of the response variables are obtained as shown in Fig. 5c and e.
More meaningful insight can be obtained by coupling these overall measures of significance with measures of biological relevance. One approach to obtaining measures of biological relevance is through predictions obtained from the predictive model selected in the twostep analysis procedures. Since an overall assessment of significance based on the saturated model (as under RFM) may have very little relevance to predictions from the predictive model, the overall assessment of significance based on the predictive model (as under MRF) should be used.
It is important to emphasise that the (biologically relevant measures of) differences are not necessarily statistically significant, and that the aim here is to produce a ranking considering a measure of overall statistical significance coupled with a notion of overall biological relevance. For example, lipid X30.0 is considered to be important as it shows both differential expression (a statistically significant oneway analysis) and the largest fold increase in any treatment condition compared to the baseline of no salt stress in Columbia.
Marginal assessment
 1.
Assess individual model terms through the appropriately structured ANOVA (or equivalent analysis framework),
 2.
Incorporate a set of orthogonal contrasts into the treatment structure of the linear model and assess these terms (e.g. through ANOVA),
 3.
Compare predicted means through pairwise ttests (also referred to as multiple comparisons or posthoc tests).
In what follows, we consider extensions of only the first two approaches to the mass univariate framework. There are a multitude of reasons for not considering the third (see, for example, references and discussion in Chapter 8 of [27]).
Since such a marginal assessment will be focussed on the statistical significance of individual model terms, multiplicity corrections should be applied before the predictive model selection. Consequently, for such marginal assessments, the RFM approach is more appropriate.
In addition, the adjusted pvalues for each treatment term can be obtained for each response variable. This is shown for the lipidomic data in Fig. 7b, where lipids can be identified according to the most significant treatment factor. For instance, the majority of PC lipids are seen to have more significant effects associated with salt stress, while MGDG lipids have more significant effects associated with differences in genotype.
The above assessment of individual factors extends the analysis obtained from ANOVA to the mass univariate framework. In a similar way to a single univariate analysis, a greater level of detail can be obtained by decomposing the explanatory model structure. If each term within the linear model can be parameterised into a set of one degree of freedom terms (or contrasts), than marginal assessment boils down to a set of hypothesis tests that can be directly related to a treatment difference and hence traditional volcano plots can be used. In practice, this will rarely occur but as exemplified in the lipidomics experiment, contrasts can be included to extract a greater level of detail. For example, rather than simply analysing the factor associated with three levels of salt stress, this factor can be decomposed into two parts in order to test for evidence of a linear trend and deviations from such a trend in the quantitative levels of salt stress. Similarly, the factor associated with four different genotypes of Arabidopsis can be decomposed into two parts in order to test for a) differences in ecotypes (Eutrema vs. Others) and b) differences in genotypes within the same ecotype (Columbia, Ta0 and Shadara). These effects are shown in Fig. 7c and again groups of responses showing similar patterns can be extracted.
It is clear that as more complex models are considered this approach of marginal assessment becomes more involved. However, the generalised visualisations extending the interpretation of ANOVA into three dimensions (term, effect size, response) can greatly aid the biological interpretation.
Discussion
In this paper, two methods have been introduced for incorporating multiplicity corrections in a mass univariate analysis for nontrivial explanatory (treatment) structures. Each approach has its own advantages and the choice between them will depend upon the context of any analysis. The RFM approach has been shown to be conservative in identifying statistically significant response variables, as multiplicity is applied to the oneway test of the saturated model, whereas the MRF approach overcomes this conservativeness through an inflation of the residual degrees of freedom due to a oneway test of the predictive model. Inflating the residual degrees of freedom under MRF may not always be desirable and an alternative would be to partition the treatment sums of squares into two parts, one associated with the oneway treatment structure of the predictive model and the other associated with a ‘lackoffit’ term. In this way, a oneway test of the predictive model can be defined whilst maintaining the residual degrees of freedom for which the experiment was designed.
The approach to use will be influenced by the type of analysis used downstream. Overall assessments (a single assessment per response), such as a single ranking, a classification by predictive model or a categorisation of differentially expressed vs. nondifferentially expressed response variables naturally fit within the MRF framework. In comparison, marginal assessments (multiple independent assessments per response), such as individual treatment effects or an incorporation of contrast effects are more naturally expressed in the RFM framework.
Alternative approaches
In this paper, attention has been focussed on controlling the multiplicity of tests of the full explanatory model over the R response variables, i.e. at the response variable level within an experiment, through a stagewise hypothesis testing procedure. Alternatives to this stagewise paradigm can be found in the literature, but have limitations to the responselevel interpretation. Specifically, rather than controlling the FDR at the responselevel, an alternative would be to correct for multiplicity over the R responses for each of the p explanatory terms, i.e. at the explanatory term level within an experiment.
The naïve approach might consider that for p explanatory terms over R response variables, a total of R×p tests are required. As such, a global multiplicity correction over this full set of R×p tests could be applied. However, this approach will be vastly overconservative. Moreover, for all but the simplest form of correction, interpretation becomes difficult as each of the explanatory terms for a single response variable will be assessed at a different level of significance as, for example, under a BenjaminiHochberg correction. This prevents the use of a modelbased interpretation that combines statistical significance with biological relevance as in the examples above.
In orthogonal balanced designs, it is conceivable that controlling multiplicity for each explanatory term separately is desirable. Specifically, p separate multiplicity corrections can be applied for each of the p explanatory terms. This then results in the identification of groups of responses that are statistically significant for each separate explanatory term as implemented in [15]. As with the approaches derived in this paper, an additional model selection step can be included to base the multiplicity corrections on the predictive rather than saturated models. Applying these approaches to the simulated data in scenario 1 (Section 4 of the Supplementary Material in Additional file 1), large discrepancies can be seen, particularly in the assessment of significance of the main effects. This is unsurprising, since this Model, Subset, Filter (MSF) approach does not respect the ‘bottomup’ marginality interpretation of the ANOVA, with main effects potentially omitted without the prior removal of the associated interaction terms. Moreover, the final interpretation becomes limited, as again one cannot analyse the output in a modelbased framework as the assessment of individual terms is not consistent within a response variable.
Both the global multiplicity corrections (for all R×p tests) and separate multiplicity corrections (over all p tests for each of R terms) are incorporated in the limma package [24] for differential expression analysis of RNAseq and microarray data.
Extensions
This paper has focussed attention on designed experiments where all terms within a linear model are factors and the associated analysis can be extracted through ANOVA. In practice, many ’omics datasets may be better suited to alternative univariate analysis approaches such as linear mixed models (to account for unbalanced designs), linear and generalized linear models to account for covariates or nonnormal responses and hierarchical models to account for known dependence structures. Regardless of the complexity of the individual univariate technique, so long as there is a clear definition of a saturated and predictive model, the principles introduced in this paper will hold.
Conclusions
Mass univariate approaches provide a valuable complement to multivariate techniques to analyse and interpret ’omics data. In particular, univariate approaches are often well developed for problematic data, for instance, in dealing with missing values, unequal replication, unbalanced designs and autocorrelated error structures, which can be difficult to incorporate in a multivariate setting. Moreover, mass univariate approaches enable statistical significance and biological relevance to be assessed at an individual response variable level in addition to the profile level assessment obtained through a multivariate analysis.
However, as demonstrated in this paper, when analysing ’omics data through a mass univariate approach, a choice of procedures for incorporating multiplicity corrections is available. To gain a deeper understanding of the mechanisms underlying a particular response variable, it is particularly important to assess the full treatment structure of the design rather than the simplified oneway analysis common in the literature. When coupled with an approach to control the multiplicity rate, different analysis approaches give more or less influence to either fitting the predictive model or correcting for multiplicity. Thus, it is important to be aware of the implicit choice that is being made.
It is often the case that as the number of responses increases, ‘simple’ and ‘easy to interpret’ analyses are preferred. This often comes at the cost of statistical rigor, but equally the interpretation of modelbased approaches can be cumbersome. However, models generally provide a deeper insight, for instance, defining the classification of responses into subgroups based on the statistical significance of terms in a model or clustering responses based on predictions in different conditions. Two different procedures for obtaining the predictive model whilst also incorporating multiplicity corrections have been introduced and illustrated on data. The approach to use will be driven by the specific aims of the analysis, namely whether a marginal or overall assessment is most appropriate.
Since this choice in methods arises due to the presence of nontrivial (more complex) explanatory treatment structures, this consideration will become increasingly important as the ability to include more complex treatment structures and collect observational covariates becomes more accessible due to the influx of accessible ’omics technologies.
Notes
Supplementary material
