The multivariate analysis of variance (MANOVA) is a natural generalization of the univariate analysis of variance (ANOVA) to multidimensional observations. That is, two or more possibly correlated dependent variables are simultaneously analyzed using the known values of one, two, or more factors. There are several reasons why a researcher should consider employing a MANOVA instead of separate ANOVAs for each dependent variable. First, by using a MANOVA, researchers can obtain a more detailed description of the phenomenon under investigation, and we also have a better chance of discovering the overall impact of the treatment effect. Second, by using a MANOVA, the researchers control the overall alpha level at the desired level. Third, by using a MANOVA, the researchers can increase the statistical power. Although the MANOVA offers some useful advantages, there are some disadvantages as well. One is that a MANOVA is harder to interpret than an ANOVA. Another is the degree of freedom (df) lost. For this reason, if the hypotheses of interest are univariate in nature, it may be desirable to conduct separate ANOVAs. However, in practice, researchers often find that the response variables are correlated (Farivar, Cunningham, & Hays, 2007).

Many psychological processes are too complex to be measured by a single response variable. For this reason, the MANOVA has been one the most utilized multivariate methods in published behavioral research. In fact, the technique has been applied successfully in such diverse psychological areas as basic psychological processes (Lindemann & Bekkering, 2009), clinical psychology (Crowell et al., 2008), developmental psychology (Gathercole, Pickering, Ambridge, & Wearing, 2004), educational psychology (Blote, Van der Burg, & Klein, 2001), neuropsychology (Holtzer et al., 2007), and personality and social psychology (Jonas & Sassenberg, 2006). However, as has been pointed out in numerous multivariate texts, including Anderson (2003), Johnson and Wichern (2008), and Hair, Black, Babin, and Anderson (2010), in order for the MANOVA to work appropriately, the statistical hypotheses must be tested on the assumption that the observations are independently and normally distributed with a common covariance matrix.

For testing linear hypotheses in a multivariate design setup, the most often used test statistics are Lawley–Hotelling’s trace, Pillai–Bartlett’s trace, and Wilks’s lambda criterion. Several studies have been conducted to investigate the extent to which MANOVA tests are robust to violations of underlying assumptions (Finch & Davenport, 2009; Hakstian, Roed, & Lind, 1979; Ito, 1980; Mardia, 1971; Olson, 1974; Sheehan-Holt, 1998). The general conclusion to be drawn from these studies is that for equal group sizes, violations of normality and equality of covariance matrices produce slightly distorted error rates. However, empirical findings also indicate that for unequal group sizes, MANOVA criteria result in biased (i.e., conservative or liberal) tests when assumptions are not met, particularly when homoscedasticity is violated. Therefore, before employing these test criteria, it is important to check the tenability of the MANOVA assumptions. Naik (2003) and Jamshidian and Schott (2007) provide a review of some of the common procedures adopted in practice. In spite of the fact that studies of robustness properties of test criteria have been confined to the one-way MANOVA, it seems reasonable to extend these results to the more complex multifactor MANOVA, taking into account what has been found in univariate settings (e.g., Vallejo, Fernández, & Livacic-Rojas, 2010a).

In order to avoid the negative impact that the heterogeneity of covariance matrices has on multivariate test criteria, several researchers have developed generalizations of the univariate solutions to the Behrens–Fisher problem for testing differences in mean vectors. Among them, Johansen (1980) generalized Welch’s univariate approximate df solution to analyze multivariate data collected in one-way and multiway layouts. Coombs and Algina (1996) extended the univariate Brown–Forsythe (BF) approach to analyze multivariate data collected in a one-way layout, and Vallejo and Ato (2006) and Vallejo, Arnau, and Ato (2007) extended the approximate df BF procedure to the univariate and multivariate repeated measures context. Algina (1994) had previously presented its applications in the context of a split-plot design. Brunner, Dette, and Munk (1997) developed a robust solution for the analysis of univariate heteroscedastic factorial designs based on Box’s (1954) method of matching moments. A rank version of this approximation was used for Brunner, Munzel, and Puri (2002) on the nonparametric Behrens–Fisher problem in the multivariate context. A completely different approach for dealing with unequal covariance matrices is based on the concept of generalized p values proposed by Gamage, Mathew, and Weerahandi (2004). Another option for handling heteroscedastic data directly is to use bootstrapping and permutation resampling methods.

Finch and Davenport (2009) recently examined the performance of the approximate F MANOVA test criteria listed above with their Monte Carlo permutation test analogs, as implemented through the SAS Institute’s (2008) PROC GLM program, when the underlying assumptions were not met. The results indicated that both the approximate F and Monte Carlo tests were liberal when normality and homogeneity covariance assumptions were simultaneously violated, and, in some cases, they were substantially liberal. More recently, Krishnamoorthy and Lu (2010) compared, under normality, Johansen’s test, the generalized inference procedure proposed by Gamage et al. (2004), and the parametric bootstrap test. They concluded that the latter approach is the only test that appears to be satisfactory for all of the conditions considered. The Johansen procedure performs satisfactorily when the sample sizes are moderate, while the generalized inference procedure appears to be liberal even when the sample sizes are moderately large. Therefore, the parametric bootstrap test could be applied in situations where the Johansen test cannot. However, the computational effort is considerable, and under some conditions of departures from normality, the parametric bootstrap test is not always robust.

Vallejo, Fernández, and Livacic-Rojas (2010a) investigated the joint impact of violating variance homogeneity and normality on the performance of the Brunner et al. (1997) and Johansen (1980) procedures for analyzing univariate factorial designs. The results indicated that both approaches consistently controlled the rates of error when the shape of the distribution was symmetric. However, these methods can be affected by asymmetric distributions. Later, Vallejo, Ato, and Fernández, (2010b) demonstrated that it is possible to achieve robustness, even when data are extremely heterogeneous and severely skewed, applying the original Brunner et al. (1997) approach in conjunction with Hall’s transformation. Previously, Vallejo and Ato (2006) had corrected the multivariate BF approach in a manner similar to that employed by Mehrotra (1997) to correct the univariate BF approach. In this work, Vallejo and Ato found that the modified BF approach (referred to hereafter as MBF) was typically robust for the conditions when the BF approach was not robust. The results in Brunner et al. (1997) and Vallejo and Ato, presented in the context of univariate factorial designs or in the context of multivariate analysis of data collected in split-plot designs, can also be used for the problem of comparing several multivariate normal mean vectors when the covariance matrices are arbitrary.

Accordingly, the purpose of the present study was twofold: (1) to show how the Bruner et al. (1997) and Vallejo and Ato (2006) methods can be adapted to analyze multivariate data collected in two-way layouts and (2) to examine the operating characteristics (i.e., Type I rates and Type II error rates) of the new approximate df solutions for testing omnibus main and interaction effects when normality and covariance homogeneity assumptions are violated. In addition to examining the behavior of these procedures, for comparative purposes, Johansen’s (1980) test and extensions of the linear model to accommodate multivariate heterogeneous data as implemented through the SAS PROC MIXED module were also studied. The former may serve as a benchmark, since it has been found to be generally robust under conditions similar to those investigated in our article. An advantage of the PROC MIXED approach, however, is that it can easily accommodate incomplete data and will tend to produce correct analyses provided the data are missing at random and the distributional assumptions are met. Also, the usual MANOVA test using Wilks’s lambda criterion available in SAS PROC GLM is included, in spite of its well-known conservatism or liberalism, because the users of factorial designs almost exclusively demand this approach. In all cases, nonorthogonal two-way MANOVA designs, based on the unweighted multivariate linear model, were used.

Definition of the statistical tests

Consider a multivariate two-way layout where factor A has j = 1,…,a levels and factor B has k = 1,…,b levels with i = 1,…,n jk subjects per cell (j, k). In terms of the full rank (FR) or cell means model, the response vector \( {{\text{y}}_{ijk}} = \left( {{Y_{ijk1}},...,{Y_{ijkp}}} \right)\prime \) for the ith subject within the jkth cell of the factorial design is modeled by \( {{\text{y}}_{ijk}} = {\mu_{jk}} + {{\text{e}}_{ijk}}, \) where μ jk represents the population means of the jkth cell and e ijk are independent random vectors of errors, each having a p-variate normal distribution with \( E\left( {{{\text{e}}_{ijk}}} \right) = {0} \) and \( V\left( {{{\text{e}}_{ijk}}} \right) = \sum {_{jk}} . \) In some cases, the response vector y ijk is modeled by \( {{\text{y}}_{ijk}} = \mu + {\alpha_j} + {\beta_k} + {\gamma_{jk}} + {{\text{e}}_{ijk}}, \) where the parameter vector μ is the overall mean, α j (β k ) is the effect of the jth (kth) level of A (B), and γ jk is the joint effect of treatment levels j and k. The link between the FR model and the less full rank (LFR) model is \( {\mu_{jk}} = \mu + {\alpha_j} + {\beta_k} + {\gamma_{jk}}, \) with the constraints \( \sum\nolimits_{j = 1} {{\alpha_j}} = \sum\nolimits_{k = 1} {{\beta_k} = } \sum\nolimits_{j = 1} {{\gamma_{jk}}} = \sum\nolimits_{k = 1} {{\gamma_{jk}} = 0}, \) since the LFR model is overparameterized (i.e., the number of cell means is less than the number of model parameters to be estimated).

By stacking the subvectors \( {{\text{y}}_{111}},...,{{\text{y}}_{nab}} \), both models can be considered in a unified framework under a multivariate linear model

$$ {\text{Y}} = {\text{X}}\,{\text{B}} + {\text{E}}, $$
(1)

where Y is an n × p matrix of observed data, X is an n × q fixed design matrix with rank R = R (X) = ab < q (1 + a + b + ab) for the LFR model and R (X) = ab = q for the FR model, B is a q × p matrix that contains the unknown fixed effects common to all participants, and E is an n × p matrix of random errors. We assume that the rows of Y are normally and independently distributed within each of the ab cells or treatment combinations, with mean vector μ jk and covariance matrix Σ jk . The unbiased estimators of Σ jk are \( {\hat{\Sigma }_{jk}} = 1/\left( {{n_{jk}} - 1} \right)\,{{\text{E}}_{jk}}, \) where \( {{\text{E}}_{jk}} = {\text{Y}}_{jk}^\prime \,{{\text{Y}}_{jk}} - {\hat{\beta '}_{jk}}\,{\text{X}}_{jk}^\prime \,{{\text{Y}}_{jk}} \) are distributed independently as Wishart \( {W_p}\,\left( {{n_{jk}} - 1,\,\,\sum {_{jk}} } \right) \) and \( {\hat{\beta }_{jk}} \) is the ordinary least squares estimator of vector β jk (Nel, 1997). We also assume that \( {n_{jk}} - 1\, \geqslant \,\,p \), so that \( {\mathbf{\hat{\Sigma }}}_{jk}^{ - 1} \) exists with a probability of one.

From an FR model perspective, the general linear hypothesis can be written using matrix notation as follows:

$$ {H_0}(t){:}\,\,{{\text{R}}_t}\mu = {0},\,\,\,\,\,\left( {t = A,\,\,B,\,\,{\text{or}}\,\,AB} \right)\,, $$
(2)

where R t is the contrast matrix associated with the specific hypothesis, \( \mu = \left( {{\mu_{11}},...,{\mu_{1b}},\,{\mu_{21}},...,{\mu_{ab}}} \right)\prime \) is a vector column of population cell means, and 0 is the null vector. Throughout the article, the matrices R t to test the multivariate null hypotheses H 0 (A) H 0 (B) and H 0 (AB) are defined as \( {{\text{R}}_A} = {{\text{C}}_A} \otimes {{\text{I}}_p}, \) \( {{\text{R}}_B} = {{\text{C}}_B} \otimes {{\text{I}}_p}, \) and \( {{\text{R}}_{AB}} = {{\text{C}}_{AB}} \otimes {{\text{I}}_p}, \) where \( {{\text{C}}_A} = \left[ {\left( {\left( {{1_{a - 1}}| - {{\text{I}}_{a - 1}}} \right) \otimes \left( {1/b} \right){1}_b^\prime } \right)} \right], \) \( {{\text{C}}_B} = \left[ {\left( {1/a} \right)1_{\text{a}}^\prime \otimes \left( {{1_{b - 1}}| - {{\text{I}}_{b - 1}}} \right)} \right], \) and \( {{\text{C}}_{AB}} = \left[ {\left( {{{1}_{a - 1}}| - {{\text{I}}_{a - 1}}} \right) \otimes \left( {{{1}_{b - 1}}| - {{\text{I}}_{b - 1}}} \right)} \right], \) are matrices of between-subjects contrasts with full row rank, 1 α(b) is an a (b)-dimensional vector of one, I α(b) is an a (b)-dimensional identity matrix, the symbol | epresents the augmented matrix obtained by appending the columns of two given matrices (.|.), and the circle times operator \( \left( \otimes \right) \) denotes the direct product of the matrices. Accordingly, R t forms a set of linearly independent contrasts among the levels of factors for each dependent variable. Similarly, the contrast vectors for conducting multivariate multiple comparisons involving both marginal means and interaction or tetrad contrasts are defined as \( {{\text{r}}_A} = \left( {{{\text{p}}_{jj\prime }}\, \otimes \,{1}_k^\prime } \right) \otimes {{\text{I}}_p}, \) \( {{\text{r}}_B} = \left( {1_j^\prime \, \otimes {{\text{p}}_{kk'}}} \right) \otimes {{\text{I}}_p}, \) and \( {{\text{r}}_{AB}} = \left( {\,{{\text{p}}_{jj\prime }}\, \otimes \,{{\text{p}}_{kk\prime }}} \right) \otimes {{\text{I}}_p}\,, \) where P jj contains the coefficients hat contrast the jth and j'th row means and \( {{\text{p}}_{kk\prime }} \) contains the coefficients that contrast the kth and k'th column means.

Multivariate version of the Welch–James type statistic (WJ)

It is well known in the literature (see, e.g., Maxwell & Delaney, 2004) that the usual ANOVA can become very sensitive to departures from the homogeneity assumption, particularly in the unbalanced case. To overcome this problem, Welch (1951) and James (1951) suggested weighting the terms appearing in the F ratio to account for the different population variances and estimating the denominator df from the data. Johansen (1980) extended the Welch–James null distribution results to the regression setting. According to Lix, Algina, and Keselman (2003), the multivariate WJ-type statistic given by Johansen is

$$ {T_{W{J_t}}} = \left( {\,{{\text{R}}_t}\,\hat{\mu }} \right)\prime \,{\left( {{{\text{R}}_t}\,\hat{\Omega }\,{\text{CR}}_t^\prime \,} \right)^{ - 1}}\,\left( {{{\text{R}}_t}\,\hat{\mu }} \right)\,/c, $$
(3)

where \( \hat{\mu } = \left( {{{\hat{\mu }}_{11}} \ldots \,{{\hat{\mu }}_{1b}},\,{{\hat{\mu }}_{21}} \ldots \,{{\hat{\mu }}_{ab}}} \right)\prime, \;\,\hat{\Omega } = diag\,\left( {{{\hat{\Sigma }}_{11}}/{n_{11}} \ldots \,{{\hat{\Sigma }}_{ab}}/{n_{ab}}} \right),\;\,c = R\left( {{{\text{R}}_t}} \right) + 2A - \left( {6A} \right)/\left[ {R\left( {{{\text{R}}_t}} \right) + 2} \right], \) and \( \,A = \frac{1}{2}\sum {_{jk}\left[ {tr\,{{\left( {\,\hat{\Omega }{\text{R}}_t^\prime {{\left( {{{\text{R}}_t}\hat{\Omega }{\text{R}}_t^\prime } \right)}^{ - 1}}{{\text{R}}_t}{{\text{Q}}_{jk}}} \right)}^2} + tr\,{{\left( {\hat{\Omega }{\text{R}}_t^\prime {{\left( {{{\text{R}}_t}\hat{\Omega }\,{\text{R}}_t^\prime } \right)}^{ - 1}}{{\text{R}}_t}{{\text{Q}}_{jk}}} \right)}^2}} \right]/\left( {{n_{jk}} - 1} \right).} \) The matrix Q jk is a block diagonal matrix of dimension abp associated with X jk , such that the (q, r)th block of Q jk is I p × p if q = r = jk and is zero otherwise and \( tr\,\left( \cdot \right) \) denotes the trace of a square matrix.

Johansen (1980) showed that, under \( {H_0}(t){:}\,\,{{\text{R}}_t}\mu = {0},\;{T_{W{J_t}}} \) is distributed approximately as an F random variable with numerator df \( {\hat{f}_1} = R\left( {{{\text{R}}_t}} \right) \) and denominator df \( {\hat{f}_2} = {\hat{f}_1}\,\left( {{{\hat{f}}_1} + 2} \right)\,/\,3\,A. \) Thus, H 0(t)s s are rejected if \( {T_{W{J_t}}} \geqslant {F_{^{1 - \alpha }}}\left( {{{\hat{f}}_1},{{\hat{f}}_2}} \right). \)

Multivariate version of the ANOVA-type statistic (BDM)

For the analysis of univariate heteroscedastic factorial designs, Bruner et al. (1997) proposed a modification of the quadratic forms used in the usual ANOVA with estimated df based on a classical moment-matching approach by Box (1954). In this section, we extend the ANOVA-type statistic to the multivariate heteroscedastic factorial designs case. The corresponding multivariate version of the ANOVA-type statistic is given by

$$ {F_{{{N_{t}}}}} = \frac{{N\widehat{\mu }\prime {{\text{P}}_{t}}\widehat{\mu }}}{{{\text{tr}}\left( {{\text{P}}_{t}^{ * }\widehat{{\text{V}}}} \right)}} = \frac{{{{\widehat{{\text{Q}}}}_{1}}}}{{{{\widehat{{\text{Q}}}}_{2}}}}, $$
(4)

where \( N = \sum\nolimits_{j = 1}^a {\sum\nolimits_{\,k = 1}^b {{n_{jk}}} } \) is the total sample size, \( {{\text{P}}_t} = {\text{R}}_t^\prime {\left( {{{\text{R}}_t}{\text{R}}_t^\prime } \right)^{ - 1}}{{\text{R}}_t} \) denotes the projection matrix on the image R t , \( {\text{P}}_t^* \) denotes the diagonal matrix of the elements of P t , and \( \widehat{{\text{V}}} = N\widehat{\Omega } \)

The distribution of the ANOVA-type statistic F N can be approximated by a \( {\chi^2} \) distribution or by an F distribution with estimated df. For small samples sizes, Brunner et al. (1997) suggested approximating the distribution of F N using the central F distribution with estimated numerator df f 1 and denominator df \( {f_2}. \) This is equivalent to using a ratio of two independent chi-square variables, each divided by its df—that is, \( {F_N}\, \approx \,\left( {\chi_{{f_1}}^2/{f_1}} \right)/\left( {\chi_{{f_2}}^2/{f_2}} \right) = F\,\left( {{f_1},\,\,{f_2}} \right). \)

Using results from Bruner et al. (1997) and using well-known theorems on the distribution of quadratic forms (see, e.g., Box, 1954; Mathai & Provost, 1992), it can be demonstrated that under \( {{\text{R}}_t}\mu = {0} \), the distribution of quadratic form \( {{\text{Q}}_1} = N\mu {{\text{P}}_t}\mu \) in Eq. 4 is asymptotically equivalent to quantity \( Z = \sum\nolimits_{j = 1}^{\,a} {\sum\nolimits_{\,k = 1}^{\,b} {\sum\nolimits_{\,l = 1}^{\,p} {\lambda_{jk}^{(p)}Z_{jk}^{(p)}} } }, \) where the Zs are mutually independently random variables, each distributed as chi-square with one df, and the λs are the real nonzero latent roots of the matrix P t V. The above sum is approximately distributed as\( {c_1}\chi_{{f_1}}^2, \) where c 1 is a constant and \( \chi_{{f_1}}^2 \) is a central chi-square with f 1 df, and c 1 and f 1 determined in such a way that the first two moments (i.e., expectation and dispersion) of Z are equal to those of \( {c_1}\chi_{{f_1}}^2, \) respectively. In particular, the numerator df is obtained by solving simultaneously the following equations:

$$ \begin{array}{*{20}{c}} {E\,\left( {\sum {_{j = 1}^a\sum {_{k = 1}^b} } \sum {_{l = 1}^r} \lambda_{jk}^{(p)}Z_{jk}^{(p)}} \right) = \sum {_{j = 1}^a\sum {_{k = 1}^b} } \sum {_{l = 1}^p} \lambda_{jk}^{(p)} = E\left( {{c_1}\chi_{{f_1}}^2} \right) = {c_1}f_1} \\ {V\,{{\left( {\sum {_{j = 1}^a\sum {_{k = 1}^b} } \sum {_{l = 1}^p\lambda_{jk}^{(p)}Z_{jk}^{(p)}} } \right)}_1} = 2\sum {_{j = 1}^a} \sum {_{k = 1}^b\sum {_{l = 1}^p} } \lambda_{jk}^{{{(p)}^2}} = V\left( {{c_1}\chi_{{f_1}}^2} \right) = 2c_1^2f} \\ \end{array}; $$
(5)

—that is,

$$ {f_1} = \frac{{{{\left( {\sum {_{j = 1}^a} \sum {_{k = 1}^b\sum {_{l = 1}^p} } \lambda_{jk}^{{{(p)}^2}}} \right)}^{\,2}}}}{{\left( {\sum {_{j = 1}^a} \sum {_{k = 1}^b\sum {_{l = 1}^p} } \lambda_{jk}^{{{(p)}^2}}} \right)}}. $$
(6)

Thus, it follows that \( {{\text{Q}}_1}/{c_1}{f_1}\, \approx {c_1}\chi_{{f_1}}^2/{c_1}{f_1}\, \approx \chi_{{f_1}}^2/{f_1}\,. \) If V was known, f 1 could easily be computed using \( \sum {_{{j = 1}}^{a}} \sum {_{{k = 1}}^{b}\sum {_{{l = 1}}^{p}} } \lambda _{{jk}}^{{\left( p \right)}} = {\text{tr}}\left( {{{\text{P}}_{t}}\widehat{{\text{V}}}} \right) \) and \( \sum {_{{j = 1}}^{a}} \sum {_{{k = 1}}^{b}\sum {_{{l = 1}}^{p}} } \lambda _{{jk}}^{{{{\left( p \right)}^{2}}}} = {\text{tr}}\left( {{{\text{P}}_{t}}\widehat{{\text{V}}}{{\text{P}}_{t}}\widehat{{\text{V}}}} \right) \)

Again applying the moment-matching principle, it is also possible to determine the estimated denominator df. Now let \( {{\text{Q}}_2} = {\text{tr}}\,\left( {{\text{P}}_t^* {\text{V}}} \right)\,, \) where

$$ {\text{tr}}\left( {{\text{P}}_{t}^{*}{\text{V}}} \right) = N\sum {_{{j = 1}}^{a}} \sum {_{{k = 1}}^{b}} \frac{1}{{{n_{{jk}}}\left( {{n_{{jk}}} - 1} \right)}}\sum\limits_{{l = 1}}^{{{n_{{jk}}}}} {\left( {{{\text{y}}_{{ijk}}} - {{\overline {\text{y}} }_{{jk}}}} \right)\prime {\text{P}}_{{ll}}^{*}\left( {{{\text{y}}_{{jk}}} - {{\overline {\text{y}} }_{{jk}}}} \right),} $$

and \( {\text{P}}_{ll}^* \) is the jkth p-dimensional submatrix of \( {\text{P}}_t^* = \left\{ {{\text{P}}_{ll}^*} \right\}. \) The distribution of \( {{\text{Q}}_2} = {\text{tr}}\,\left( {{\text{P}}_t^*{\text{V}}} \right) \) is approximated by the distribution of \( {c_2}\chi_{{f_2}}^2/{f_2} \) such that the first two moments coincide; that is,

$$ E\left( {\text{tr}} \right.\,\left( {{{\text{Q}}_2}} \right) = N\sum {_{j = 1}^a} \sum {_{k = 1}^b{\text{tr}}\left( {{\text{P}}_{ll}^*{{\text{V}}_{jk}}n_{jk}^{ - 1}} \right) = E\left( {{c_2}\chi_{f2}^2} \right) = {c_2}} $$
(7)
$$ V\left( {tr\,\left( {{\text{Q}}{}_2} \right)} \right) = 2{N^2}\sum {_{j = 1}^a} \sum {_{k = 1}^b\left[ {{\text{vec}}\left( {{\text{P}}_{ll}^*} \right)\prime {\text{V}}_{jk}^2{\text{vec}}\left( {{\text{P}}_{ll}^*} \right)} \right]{{\left[ {n_{jk}^2\left( {{n_{jk}} - 1} \right)} \right]}^{ - 1}}} = V\left( {{c_2}\chi_{f2}^2} \right) = 2c_2^2/{f_2}. $$

Solving these equations, we have

$$ {f_2} = \frac{{{{\left[ {\sum {_{j = 1}^a} \sum {_{k = 1}^b} {\text{tr}}\left( {{\text{P}}_{ll}^*{{\text{V}}_{jk}}n_{jk}^{ - 1}} \right)} \right]}^2}}}{{{N^2}\sum {_{j = 1}^a} \sum {_{k = 1}^b\left[ {{\text{vec}}{{\left( {{\text{P}}_{ll}^*} \right)}^\prime }{\text{V}}_{jk}^2{\text{vec}}\left( {{\text{P}}_{ll}^*} \right)} \right]{{\left[ {n_{jk}^2\left( {{n_{jk}} - 1} \right)} \right]}^{ - {1}}}} }} $$
(8)

If V is known, f 2 can easily be computed using \( N\sum {_{{j = 1}}^{a}} \sum {_{{k = 1}}^{b}} {\text{tr}}\left( {{\text{P}}_{{ll}}^{*}{{\text{V}}_{{jk}}}n_{{jk}}^{{ - 1}}} \right) = {\text{tr}}\left( {{\text{P}}_{t}^{*}\widehat{{\text{V}}}} \right) \) and \( {N^{2}}\sum {_{{j = 1}}^{a}} \sum {_{{k = 1}}^{b}\left[ {{\text{vec}}\left( {{\text{P}}_{{ll}}^{*}} \right)\prime {\text{V}}_{{jk}}^{2}{\text{vec}}\left( {{\text{P}}_{{ll}}^{*}} \right)} \right]{{\left[ {n_{{jk}}^{2}\left( {{n_{{jk}}} - 1} \right)} \right]}^{{ - {\text{1}}}}} = {\text{tr}}\left( {\widehat{{\text{V}}}{\text{P}}_{t}^{{*2}}\widehat{{\text{V}}}{{\text{D}}^{ * }}} \right)} \) where D* is a diagonal matrix with \( 1/\left( {{n_{jk}} - 1} \right) \) as its jkth diagonal element.

In addition to approximating the null distribution of F N by the central \( F\left( {{{\hat{f}}_1},\,{{\hat{f}}_2}} \right) \) distribution with estimated df based on Eqs. 6 and 8, respectively (hereafter referred to as BDM), we also focus on an F approximation with estimated numerator df based on Eq. 6 and denominator df = ∞ (hereafter referred to as BDM*), as suggested by Brunner et al. (2002) in a nonparametric framework.

Multivariate version of the modified Brown–Forsythe test statistic (MBF)

Practical implementation of the MBF procedure requires estimation of the df of the approximate central p-dimensional Wishart distribution, which can be easily derived by equating the first two moments of the quadratic form associated with mth source of variation in model (1) to those of the central Wishart distribution. A detailed explanation of the so-called multivariate Satterthwaite’s approximation can be found in Vallejo and Ato (2006) in the context of univariate repeated measures. Applying the approach of these authors, the Wilks’s lambda criterion for testing \( {H_0}(t){:}\,{{\text{C}}_t}\mu = {0} \) against \( {H_0}(t):\,{{\text{C}}_t}\mu \, \ne {0} \) can be expressed as the following ratio of two determinants: \( \Lambda = |{\text{E}}_t^*|/|\left( {{{\text{H}}_t} + {\text{E}}_t^*} \right)|, \) where the hypothesis matrix (H t ) and the error matrix \( \left( {{\text{E}}_t^* } \right) \) are determined by

$$ {{\text{H}}_{t}} = \left( {{{\text{C}}_{t}}\widehat{{\text{B}}}{\text{A}}} \right)\prime {\left[ {{{\text{C}}_{t}}{{\left( {{\text{X'X}}} \right)}^{{ - 1}}}{\text{C}}_{t}^{\prime }} \right]^{{ - 1}}}\left( {{{\text{C}}_{t}}\widehat{{\text{B}}}{\text{A}}} \right), $$
(9)

and

$$ {\text{E}}_t^* = \left( {{{\hat{f}}_e}/{{\hat{f}}_h}} \right)\sum\nolimits_{j = 1}^\alpha {\sum\nolimits_{k = 1}^b {c_t^\bullet } } {\hat{\sum }_{jk}}, $$
(10)

where \( {\hat{f}_e} \) and \( {\hat{f}_h} \) are the approximate df for \( {\text{E}}_t^* \) and H t , respectively, \( c_t^\bullet \) is the jkth diagonal element of the matrix \( C_t^{^\prime }{\left( {{{\text{C}}_t}{\text{D}}C_t^{^\prime }} \right)^{ - 1}}{{\text{C}}_t}{\text{D}}, \) D is a diagonal matrix with \( 1/{n_{jk}} \) as its jkth diagonal element, and A is an p-dimensional identity matrix. Using results from Vallejo and Ato, the approximations to the error and hypothesis df, respectively, are

$$ {\hat{f}_e} = \frac{{{{\left( {{\text{tr}}\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {c_t^\bullet {{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right)}^2} + {\text{tr}}{{\left( {\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {c_t^\bullet {{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right)}^2}}}{{\sum\limits_{j = 1}^a {\sum\limits_{k = 1}^b {\frac{1}{{{n_{jk}} - 1}}\left[ {{\text{t}}{{\text{r}}^2}\left( {c_t^\bullet {{\hat{\Sigma }}_{jk}}{{\hat{\Xi }}^{ - 1}}} \right) + {\text{tr}}{{\left( {c_t^\bullet {{\hat{\Sigma }}_{jk}}{{\hat{\Xi }}^{ - 1}}} \right)}^{\,2}}} \right]} } }} $$
(11)

and

$$ {\hat{f}_h} = \frac{{{{\left( {{\text{tr}}\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {c_t^\bullet {{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right)}^2} + {\text{tr}}{{\left( {\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {c_t^\bullet {{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right)}^2}}}{{\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {{M_t}} } + {\text{t}}{{\text{r}}^2}\left( {\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {c_t{{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right) + {\text{tr}}{{\left( {\sum\nolimits_{j = 1}^a {\sum\nolimits_{k = 1}^b {{c_t}{{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} } } \right)}^2}}}, $$
(12)

where \( {\text{M}} = \left[ {{\text{t}}{{\text{r}}^2}\left( {{{\hat{\Sigma }}_{jk}}{{\hat{\Xi }}^{ - 1}}} \right) + {\text{tr}}{{\left( {{{\hat{\Sigma }}_{jk }}{{\hat{\Xi }}^{ - 1}}} \right)}^2}} \right]\left( {c_t^\bullet - {c_t}} \right) \), \( \hat{\Xi } = \left( {c_1^\bullet {{\hat{\Sigma }}_1} + ... + c_{ab}^\bullet {{\hat{\Sigma }}_{ab}}} \right), \) and \( {c_t} = {n_j}/N \).

Using the transformation of Wilks’s lambda criterion to F statistic, \( {H_0}(t):{{\text{C}}_t}\mu = {0} \)’s are rejected if

$$ {F_{{\text{MB}}{{\text{F}}_t}}} = \frac{{1 - {{\hat{\Lambda }}^{1/\hat{s}}}}}{{{{\hat{\Lambda }}^{1/\hat{s}}}}}\frac{{{{\hat{f}}_2}}}{{{{\hat{f}}_1}}} \geqslant {F_{^{1 - \alpha }}}\left( {{{\hat{f}}_1},{{\hat{f}}_2}} \right), $$
(13)

where \( {\hat{f}_1} = t{\hat{f}_h},{\hat{f}_2} = \left[ {{{\hat{f}}_e} - \left( {t - {{\hat{f}}_h} + 1} \right)/2} \right]s - \left( {t{{\hat{f}}_h} - 2} \right)/2, \) and \( \hat{s} = {\left[ {\left( {{t^2}\hat{f}_h^2 - 4} \right)/\left( {{t^2} + \hat{f}_h^2 - 5} \right)} \right]^{\tfrac{1}{2}}}, \) with t equal to the dimension of \( {\text{E}}_t^* . \) Here, \( {F_{^{1 - \alpha }}}\left( {{{\hat{f}}_1},\,{{\hat{f}}_2}} \right)\, \) is the upper 1-α critical value of an F distribution. In addition, we also focus on an F approximation with numerator df = R(C t ) and estimated denominator df based on Eq. 12 (hereafter referred to as MBF*).

Multivariate version of the general linear model (MLM)

In its classical form, the model defined in Eq. 1 assumes that the different rows of E are distributed mutually independently and the p elements in any row follow a multivariate normal distribution with mean vector 0 and unknown covariance matrix Σ. In other words, while the errors of the lth column of E are independently distributed as N (0, σ2 I N ), the rows of E have a common covariance matrix Σ that is assumed to be positive definite. Since the previous model is not often very realistic, suppose that we relax the constant variance assumption and assume, instead, that covariance matrices differ from cell to cell. In this case, the matrix V can be modeled as \( {\text{V}} = {\text{diag}}\,\left\{ {{{\text{T}}_{11}},...,{{\text{T}}_{ab}}} \right\}, \) where \( {{\text{T}}_{jk}} = \Sigma {}_{jk}{{\text{I}}_{{n_{jk}}}} \).

When covariance matrices are unequal across design cells, the appropriate estimator of B is not the ordinary least squares estimator. For known V, the appropriate estimator of B is the generalized least squares (GLS) estimator given by

$$ \widehat{{\text{B}}} = {\left( {{\text{X}}\prime {{\text{V}}^{{ - 1}}}{\text{X}}} \right)^{{ - 1}}}{\text{X}}\prime {\text{VY}}, $$
(14)

with dispersion matrix

$$ V\left( {\widehat{{\text{B}}}} \right) = {\left( {{\text{X}}\prime {{\text{V}}^{{ - 1}}}{\text{X}}} \right)^{{ - 1}}} $$
(15)

If V is unknown, one may estimate B, replacing V with its estimate \( \widehat{{\text{V}}} \) in Eq. 14, which is V with its T parameters replaced by their maximum likelihood estimators. Likewise, the variance of \( \widetilde{{\text{B}}} \) is usually estimated by replacing V with its estimate \( \widehat{{\text{V}}} \) in Eq. 15. Although a number of estimation strategies are available, the present article uses REML estimation as implemented through the SAS Institute’s (2008) PROC MIXED program.

Following Vallejo et al. (2007), the multivariate linear model (hereafter referred to as MLM) described in Eq. 1 can be fitted using PROC MIXED by stacking the p n-vectors y i into a single np × 1 vector \( \widetilde{{\text{y}}} \) the p q-vectors β i into a pq × 1 vector \( \tilde{\beta }, \) and the p n-vectors e i into a np × 1 vector \( \widetilde{{\text{e}}} \) The resultant vector form can be written as

$$ \widetilde{{\text{y}}} = \widetilde{{\text{X}}}\widetilde{\beta } + \widetilde{{\text{e}}} $$
(16)

where \( \widetilde{{\text{X}}} = {{\text{I}}_{p}} \otimes {\text{X}} \) is a np × pq design matrix with \( R\left( {\widetilde{{\text{X}}}} \right) = abp < pq \) for the LFR model and \( R\left( {\widetilde{{\text{X}}}} \right) = abp = pq \) for the FR model. Basically, the multivariate structure is modeled in the PROC MIXED framework by recasting the multivariate model as a univariate one and including an additional level for the multiple outcomes.

An approximated Wald-type F-statistic for testing any null hypothesis of the form \( {H_0}:{{\text{R}}_t}\tilde{\beta } = {0} \) is given by

$$ {F_{t}} = \frac{1}{{{f_{1}}}}\left( {{{\text{R}}_{t}}\widetilde{{\text{b}}}} \right)\prime {\left[ {{{\text{R}}_{t}}V\left( {\widetilde{{\text{b}}}} \right){\text{R}}_{t}^{\prime }} \right]^{{ - 1}}}\left( {{{\text{R}}_{t}}\widetilde{{\text{b}}}} \right), $$
(17)

where R t is a matrix of contrasts of rank f 1, \( \widetilde{{\text{b}}} \) is the empirical GLS estimator of \( \tilde{\beta }, \) and \( V\left( {\widetilde{{\text{b}}}} \right) \) is the covariance matrix of \( \widetilde{{\text{b}}} \) It is known that \( V\left( {\widetilde{{\text{b}}}} \right) \) tends to underestimate the true sampling variability of \( \tilde{\beta } \) by not accommodating the variability in the estimate of \( V\left( {\widetilde{{\text{y}}}} \right) \) when estimating the covariance structure of \( \widetilde{{\text{b}}} \) and when computing associated Wald-type statistics. This is especially important when the sample size is not large enough to support likelihood-based inference and complex covariance structures are used (Kenward, 2001).

To deal with the effect of bias in the estimate of standard errors on inference about the mean structure, Kenward and Roger (1997, 2009) provide a method that can be applied with any fixed effects and covariance structure. This solution involves obtaining an adjusted covariance matrix of fixed effects inflating the conventional estimated covariance matrix of \( \widetilde{{\text{b}}} \) to derive an appropriate F approximation by replacing, in Eq. 12, \( V\left( {\widetilde{{\text{b}}}} \right) \) with the adjusted estimator of the covariance matrix of \( \widetilde{{\text{b}}} \) and the estimated denominator df with the generalized F-statistic based on it. The generalized F-statistic approximately follows an F-distribution with numerator df \( {f_1} = R\left( {{{\text{R}}_t}} \right) \) and denominator df f 2 appropriately estimated with the method proposed by Kenward and Roger. A sample syntax for undertaking the MLM analysis is given in the Appendix.

A Monte Carlo study

In order to evaluate the operant characteristics of the proposed approaches, we carried out two simulation studies using a 3 × 4 factorial MANOVA design. We chose to investigate designs with three and four levels of factors, because these are typical of designs encountered in education and psychology studies. In fact, our preliminary evaluation of the major psychological Spanish journals between January 2001 and December 2008 confirms that the number of levels in a factor typically oscillates between two and six, with a mode of three.

The first study focused on comparing the robustness of the approaches when the normality and covariance homogeneity assumptions did not hold. With this aim the following five variables were manipulated:

  1. 1.

    Number of dependent variables. Since some of the procedures examined in this study have been shown to be affected by the number of dependent variables in other contexts (Brunner et al., 2002; Krishnamoorthy & Lu, 2010), we collected Type I error rates when the number of dependent variables was two, three, and four. We initially planned to investigate p = 6; however, preliminary simulations indicated that using SAS PROC MIXED took an inordinate amount of time when p = 6. It is doubtful that this change substantially affected the results.

  2. 2.

    Total sample size. Since the procedures may be affected by the sample size, the performance of the test statistics was investigated using two different global sample size conditions: N = 108 and N = 216. Only these two sizes were examined, because we felt that they would suffice to provide a comparison between the procedures for small and moderate sample sizes.

  3. 3.

    Degree of sample size imbalance. For each sample size condition, both a null and a moderate degree of cell size inequality were explored, as indexed by a coefficient of sample size variation (C), where \( C = \left( {1/\,\bar{n}} \right)\,{\left[ {{{\sum\nolimits_{jk} {\left( {{n_{jk}} - \bar{n}} \right)} }^2}/ab} \right]^{1/2}},\;\bar{n} \) being the average size of the cells. For N = 108 and C = 0, the size of each cell was (1) n jk = 9, and for N = 108 and C = 0.35, the unequal cell sizes were (2) n 1k = (6, 6, 7, 8), n 2k = (6, 8, 10, 11), and n 3k = (7, 10, 12, 17). When N = 216, the sizes of these cells were doubled.

  4. 4.

    Relationship of cell sizes to dispersion matrices. Both the positive and negative relationships between cell sizes and dispersion matrices were investigated. In the positive relationship, the larger cell sample sizes were associated with the larger dispersion matrices. In the negative relationship, the smaller cell sample sizes were associated with the larger dispersion matrices. The heterogeneous covariance matrices we consider here were obtained as follows: \( \sum {_{jk}} = \left( {2 \times \,j\,\, \times k} \right)\,/2. \) Details on these structures when the number of dependent variables was four (p = 4) are shown in Table 1.

  5. 5.

    Population distribution shape. While the examined procedures are based on the normality assumption, real data rarely conform to normality (Micceri, 1989). Therefore, in order to investigate the possible effects of the shape of the distribution on the robustness of the procedures, we generated data from normal and nonnormal distributions, both symmetric and asymmetric, using the methods described in Vallejo, Fernández, Livacic–Rojas, and Tuero–Herrero (2011). Specifically, besides the multivariate normal distribution with univariate skewness (γ1) and kurtosis (γ2) equal to zero, the data were obtained from a symmetric light-tailed multivariate distribution with shape parameters equivalent to those of a Laplace distribution (i.e., γ1 = 0; γ2 = 3), from an asymmetric light-tailed multivariate distribution with shape parameters equivalent to those of an exponential distribution (i.e., γ1 = 2; γ2 = 6), and from an asymmetric heavy-tailed multivariate distribution with skewness and kurtosis values of 4 and 42, respectively. These values are well within the range of skew and kurtosis encountered in applied psychological research by Micceri.

Table 1 Parameter values of the dispersion matrices used to generate the data and types of population mean configurations examined

The second study focused on the comparison of the sensitivity of the approaches for detecting main effects and interactions in two-way MANOVA designs, but only for those methods whose validity does not depend on the underlying assumptions made by them. The data sets were generated in the same way as in the first study, except that they employed different correlation structures between the dependent variables. Because the power of the multivariate test criteria has been shown to be sensitive not only to effect size, but also to the direction and size of the correlations among the dependent variables, we chose to use a small and moderate positive correlation, where all pairwise correlations are .3 and .6, respectively, and a negative correlation where all pairwise correlations between the dependent variables are -.3 or -.6 for p = 2 and -.3 for p = 4. The different chosen negative correlation values are due to the fact that, for example, a multivariate normal distribution with p = 4 and all pairwise correlations of -.6 would not have a positive definite covariance matrix.

When our interest focused on investigating the sensitivity of the tests for detecting the real effect of the rows (variable A), the cell means corresponding to the simple main effect of B at a 1 took on the value 0, and those corresponding to the simple main effect of B at a 2 were based on this value plus the appropriate effect size (i.e., .1, .2, .3, .4, .5, .6, .7, .8, .9 or 1). The remaining cell means were calculated as the mean for factor B at a 2 plus the appropriate effect size value. On the other hand, when our interest was centered on calculating the power of the columns (variable B), the cell means corresponding to the to the simple main effect of A at b 1 took on the value 0, and those corresponding to the simple main effects of A at b 2 and A at b 3 were based on this value plus the appropriate effect size. The remaining cell means were calculated as the mean for A for levels b 2 and b 3 plus the appropriate effect size value. Finally, when our interest resided in estimating the power of the AB interaction, the cell means referring to the intersection of the levels a 1 b 4, a 2 b 1, and a 3 b 1 took on the value 0, and those corresponding to the simple main effects of A at b 2 and A at b 3 were based on this value plus the appropriate effect size. The remaining cell means were calculated as the mean for A for levels b 2 and b 3 plus the appropriate effect size value. Details of the procedure described are summarized in Table 1.

Ten thousand replications of each condition were performed using a .05 nominal alpha level. The proportion of false rejections of the null and alternative hypotheses was computed for evaluating the operating characteristics of methods investigated. A SAS/IML macro was used for all calculations.

Results of the robustness study

We used Bradley’s (1978) liberal criterion to facilitate the comparison between our results and those obtained by other researchers in similar studies. According to this criterion, those tests whose empirical error rate \( \left( {\hat{\alpha }} \right) \) lies in the interval \( .5\alpha \leqslant \hat{\alpha } \leqslant 1.5\alpha, \) would be considered robust. Therefore, for the nominal α level employed in this study (α = 5%), the interval used for defining the robustness of the tests was \( 2.5 \leqslant \hat{\alpha } \leqslant 7.5. \) Correspondingly, a test procedure was considered too liberal if its estimated Type I error rate was greater than 7.5 and too conservative if the error rate was smaller than 2.5.

In Tables 2, 3 and 4, we present simulated α-level [in percentages] for the evaluated multivariate tests statistics. Summarizing the simulations, we found the following:

  1. 1.

    When data were obtained from multivariate symmetric distributions, both normal (i.e., γ1 = 0; γ2 = 0) and nonnormal (i.e., γ1 = 0; γ2 = 3), the descriptive analyses indicated that the error rates for the multivariate versions of the ANOVA-type statistic (i.e., BDM and BDM*), the multivariate versions of the modified Brown–Forsythe procedure (i.e., MBF and MBF*), and the MLM approach were contained within the robustness interval for all the investigated conditions. For p = 2, the WJ procedure also maintains rates of Type I error closer to the nominal value. However, for p = 3 and p = 4, it tended to have an inflated Type I error rate that improved as the sample size increased. On the other hand, it was also found that the MANOVA test using the transformation of Wilks’s lambda criterion to F statistic shows robustness properties similar to those of the ANOVA test, although the higher the dimensionality the less robust.

  2. 2.

    When data were obtained from an asymmetric light-tailed multivariate distribution (i.e., γ1 = 2; γ2 = 6), the descriptive analyses indicated that the error rates for the BDM, BDM*, and MBF* procedures were contained within the robustness interval for all the investigated conditions. The rates of Type I error for the MBF procedure were frequently conservative. The tendency to be conservative was worse when sample sizes were small. On the other hand, the WJ and MLM procedures were occasionally liberal for the test of the interaction effect when the smaller cell sample sizes were associated with the larger dispersion matrices, with the degree of liberalness decreasing with increases in the sample sizes. With respect to the MANOVA test, the results shown in the tables indicate a substantial effect of the presence of covariance heterogeneity and a very slight effect of the absence of multivariate normality.

  3. 3.

    When data were obtained from an asymmetric heavy-tailed multivariate distribution (i.e., γ1 = 4; γ2 = 42), the descriptive analyses indicated that the error rates for the BDM* and MBF* procedures were contained within the robustness interval for all the investigated conditions. The rates of Type I error for the BMD and MBF procedures were frequently very conservative, while the rates of the WJ and MLM procedures were occasionally liberal for the test of the interaction effect. Also, not surprisingly, the pattern of results obtained with the MANOVA test was also not affected by the shape of the distribution. Once more, the critical variable was the absence of covariance homogeneity.

Table 2 Empirical Type I error rates for the 3 × 4 factorial MANOVA design (p = 2 & rho = .6)
Table 3 Empirical Type I error rates for the 3 × 4 factorial MANOVA design (p = 3 & rho = .6)
Table 4 Empirical Type I error rates for the 3 × 4 factorial MANOVA design (p = 4 & rho = .6)

Results of the study of power

When data were obtained from asymmetric multivariate distributions, both light-tailed (i.e., γ1 = 2; γ2 = 6) and heavy-tailed (i.e., γ1 = 4; γ2 = 42), error rates of the BDM* and MBF* procedures were scarcely affected by the lack of homogeneity. It is also clear from Tables 2, 3 and 4 that the MLM approach controls Type I error rates in most of the investigated conditions. The other methods could not be recommended because of their conservatism or liberalism. Therefore, we focused on comparing the BDM*, MBF*, and MLM approaches with regard to their sensitivity in detecting differences between nonnull mean vectors of the two-way MANOVA design.

To assess the power of the BDM*, MBF*, and MLM tests, we manipulated the number of dependent variables (p = 2 or p = 4), total sample size (N = 108 or N = 216), degree of sample size imbalance (C = 0 or C = .35), relationship of cell sizes to dispersion matrices (positive or negative), effect sizes (.1, .2, . . . or 1), direction and size of the correlations among the dependent variables (± .3 or ± .6), and population distribution shape (normal or exponential). Preliminary simulations suggested that little would be learned beyond the conclusions reached using the normal and exponential distributions, so the performance for the Laplace distribution was not examined. The power of the three analytical procedures was estimated on each effect size for unweighted row and column effects and for testing interaction effects. For tests of the interaction, the shape of these curves, averaging across the total sample size, degree of sample size imbalance, and relationship of cell sizes to dispersion matrices, are displayed graphically (Figs. 1 and 2).

Fig. 1
figure 1

Estimated power curves corresponding to the interaction (p = 2), averaging across the total sample sizes and type of pairing of cell sizes, for the robust multivariate tests in comparison: MBF*, BDM*, and MLM. The underlying distribution is multivariate normal (above) or multivariate exponential (below) with positive correlation (left) or negative correlation (right)

Fig. 2
figure 2

Estimated power curves corresponding to the interaction (p = 4), averaging across the total sample sizes and type of pairing of cell sizes, for the robust multivariate tests in comparison: MBF*, BDM*, and MLM. The underlying distribution is multivariate normal (above) or multivariate exponential (below) with positive correlation (left) or negative correlation (right)

The power results of the Monte Carlo experiments displayed in Figs. 1 and 2, can be summarized as follows:

  1. 1.

    Overall, the analytical procedures performed better when the total sample size and/or the effect sizes increased. The procedures also performed generally better when group sizes and covariance matrices were positively paired than when they were negatively paired. Averaging across all the manipulated conditions, the overall success was 22.4% for the BDM*, 36.3% for the MBF*, and 45.0% for the MLM.

  2. 2.

    For data with positively correlated variables, the differences between the MBF* and BDM* were less than 1 percentage point regardless of number of dependent variables, population distribution shape, or size of the correlations among the dependent variables investigated. With positive correlation, MLM turned out to be uniformly superior to its competitors, correctly rejecting the null hypothesis 29.3% of the time. The percentage of rejections achieved by the rest of the approaches was 20.7% with BDM* and 21.4% with MBF*.

  3. 3.

    For data with negatively correlated variables, the MBF* and MLM procedures performed much better than the BDM* approach, which was generally little affected by the direction of the correlations among the dependent variables. With negative correlation, MLM turned out to be uniformly superior to its competitors, correctly rejecting the null hypothesis 60.7% of the time. The percentage of rejections achieved by the rest of the approaches was 24.1% with BDM* and 51.2% with MBF*.

  4. 4.

    When the different variables were positively correlated, the results show that the power of all three tests declined slightly as the number of dependent variables increased, with differences ranging from 4.5 to 6.7 percentage points. In the case of positively correlated variables, we also found that as the correlation between dependent variables increased, the power of the tests decreased slightly, with differences ranging from 0.7 to 2.8 percentage points. In contrast, when the different variables were negatively correlated, the power of the MBF* and MLM tests increased moderately as the number of dependent variables increased, with differences ranging from 13.2 to 14.5 percentage points. In addition, stronger relationships among the variables were associated with higher power, particularly for smaller numbers of dependent variables, with differences ranging from 14.0 to 17.2 percentage points.

  5. 5.

    Finally, the simulated power curves displayed in Figs. 1 and 2 reveal that the pattern of power for the MLM approach was slightly affected by the form of distribution. Specifically, when data were generated from an asymmetric distribution with a moderate degree of kurtosis (i.e., exponential-type data), the MLM approach was slightly more powerful than the other distribution (i.e., multivariate normal), although the averaged power differences did not reach 5 percentage points. However, it can be seen that the power rates of the BDM* and MBF* tests scarcely affected the form of distribution.

Basically, the pattern of results found when our interest was centered on comparing the sensitivity of the methods for detecting the row and column effects was qualitatively similar to the pattern shown in Figs. 1 and 2 for the interaction. Hence, in order to conserve space, the results are neither presented nor discussed here but are available from the first author.

Discussion and recommendations

The operating characteristics of the three approximate df solutions based on ordinary least squares (i.e., MBF, BDM, and WJ) were compared with an empirical form of generalized least squares method (i.e., MLM) under violations of normality and covariance homogeneity in a multivariate factorial design setup. The BDM and WJ approaches used in this article constitute a direct extension of the works of Brunner et al. (1997) and Johansen (1980), respectively. In the specific context of univariate factorial designs, previous studies had revealed that the two procedures were generally robust to violations of underlying assumptions (Keselman, Carriere, & Lix, 1995; Vallejo et al., 2010a, b). On the other hand, MBF and MLM have also been proven to be robust when they have been used to analyze longitudinal data with unequal dispersion matrices (Vallejo & Ato, 2006; Vallejo et al., 2007). Our results are consistent with those obtained in the aforementioned studies; furthermore, they provide new findings that help researchers in the selection of feasible alternatives for testing main and interaction effects.

With respect to robustness, it is noteworthy that of the evaluated methods, only the BDM* and MBF* procedures performed satisfactorily under every condition manipulated. When the data were extracted from multivariate symmetric distributions, the BDM, BDM*, MBF, MBF*, and MLM approaches were much closer to the nominal alpha level over the variety of conditions considered in the experiments than were the WJ and PROC GLM tests. Rates of Type I error for the PROC GLM were very conservative (below 0.6%) or very liberal (above 23%) when cell sizes and covariance matrices were positively or negatively paired, respectively, while the performance of the WJ procedure for tests of the interaction was clearly affected by the type of pairing of cell sizes and covariance matrices, as well as the number of dependent variables and the total sample size. However, for the null hypothesis of zero row and column effects, the WJ procedure provided adequate control of the Type I error rates for all the situations considered.

When data were obtained from multivariate asymmetric distributions, our results also indicated that both BDM* and MBF* procedures were able to provide very effective Type I error control, while the rates for the BDM and MBF approaches were often conservative, especially when the data were generated from extremely biased distributions. It is also important to note that the executions PROC GLM were again markedly affected by the lack of homogeneity among the covariance matrices and scarcely by the lack of normality, while the remaining procedures studied (i.e., MLM and WJ) tended to yield liberal values for tests of the interaction under negatively paired conditions. On the whole, the results revealed that the MLM approach provided better Type I error control than did the WJ test, but it also becomes liberal. Although previous findings (Krishnamoorthy & Lu, 2010) indicated that rates of error for the WJ were frequently well controlled for the one-way MANOVA, the data from this study demonstrate that the WJ procedure was often liberal if the interactive effects were involved in the design, particularly when the number of dependent variables increased and total sample size was small, an observation made by Lix et al. (2003) in the context of multivariate repeated measures data.

With respect to power, the results of our study for tests of the interaction, commonly the most interesting issue for researchers (Maas & Snijders, 2003), showed that the MLM procedure was uniformly more powerful than its most direct competitors. In fact, averaging across all the conditions manipulated, the overall success was 22.4% for the BDM*, 36.3% for the MBF*, and 45.0% for the MLM. The BDM* and MBF* procedures behaved similarly for data with positively correlated variables regardless of the number of dependent variables, the population distribution shape, or the size of the correlations among the dependent variables investigated. However, the BDM* procedure always gave less power than did the MBF* procedure for data with negatively correlated variables. Of course, these power differences are not at the expense of high Type I error rates, since both procedures provide similar control of the error rates. Hence, on the basis of our results and those reported by Bathke, Harrar, and Madden (2008), we recommend using the modified-type ANOVA statistic with caution.

Given that the MLM approach was only occasionally liberal when cell sizes and covariance matrices were negatively paired, and then by only a small amount, and that the MLM was uniformly more powerful than the BDM* and MBF* tests, the MLM would be recommended for most applied researchers. Only if it were crucial to maintain strict alpha control would a researcher choose BDM* or MBF* and then MLM. Consequently, on the basis of the power analysis, it is reasonable to consider that the MLM approach constitutes an appropriate solution for analyzing two-way MANOVA models when the underlying assumptions are not met. In addition, it should not be forgotten that this approach can easily be performed with standard statistical packages, such as SAS (SAS Institute, 2008) PROC MIXED.

Notwithstanding the above, one reviewer noted that the degree of covariance heterogeneity used in this article seems small. As was mentioned above, the largest and smallest covariance matrices differed by a 12:1 ratio. This ratio may seem considerable; however, Keselman et al. (1998) noted that in a review of articles published in prominent education and psychology journals, ratios as large as 24:1 and 29:1 were observed in factorial completely randomized designs. In response, we reanalyzed the robustness of BDM*, MBF*, and MLM methods by varying the degree of covariance heterogeneity. To have some reassurance that these methods will perform well for data encountered in practice, we examined their performance when the unequal covariance matrices were in the ratios 16:1, 25:1, and 36:1, and the data were obtained from the multivariate normal and multivariate nonnormal distributions considered in this article. As was expected, the results obtained using the heteroscedastic methods based on ordinary least squares (i.e., MBF, BDM) were very similar to those reported in Tables 2, 3 and 4, so the details are not reported here but are available from the authors upon request. Perhaps the most important result found in this reanalysis was that under normality, the empirical form of generalized least squares method (i.e., MLM) might be a reasonable strategy. If moderate to severe skewness is present, however, (e.g., γ1 = 2; γ2 = 6 or γ1 = 4; γ2 = 42), the MLM method can, on occasion, be nonrobust.

To determine the generalizability of our findings, we also revised the robustness of the BDM*, MBF*, and MLM methods by varying the degree of kurtosis (i.e., γ1 = 0 and γ2 = 12, 24, 36, and 54). The results reached using three dependent variables (p = 3) and a ratio of (largest to smallest) standard deviations of 5 (these results are available from the authors upon request) show that Type I error rates for the symmetric moderate-tailed distributions (γ1 = 0 and γ2 = 12 and 24) are reasonably close to 0.05, but it is also found that the methods become conservative as the distributions become more heavy-tailed. In fact, for symmetric heavy-tailed distributions (γ1 = 0 and γ2 = 36 and 54), the three methods tended to produce conservative results. Without entering into a more extended discussion, it is important for the interpretation of results to note that the MBF* method was more conservative for negative than for positive pairing conditions, while the BDM* and MLM methods were frequently conservative for several of the conditions of positive and negative pairings of group sizes and covariance matrices, and the interaction error rates were almost always below those for the MBF* method. It is also important to note that with kurtosis values equal to or greater than 36, the sample size required by these methods to achieve robustness could be much larger than researchers are likely to have available to them in the behavioral sciences.

To conclude, we would like to add four comments. First, the pattern of results was very clear and consistent, so we feel that they could be generalized to a wider range of conditions. Nevertheless, readers should keep in mind that our findings are suggestive as to what might happen in several heteroscedastic and nonnormal cases, but not definitive for all types of heterogeneity and nonnormality. Second, if the distribution from which the data are sampled is heavy-tailed, the adverse effects of nonnormality can probably be overcome by substituting robust measures of location (e.g., trimmed mean) and scale (e.g., Winsorized covariance matrices) for the usual mean and covariance matrices. According to Keselman, Algina, Lix, Wilcox, and Deering (2008), one argument for the trimmed mean is that it can have a substantial advantage in terms of accuracy of estimation when sampling from heavy-tailed symmetric distributions without altering the hypothesis tested, because it represents the center of the data. In particular, Wilcox (2005) has shown that with modest sample sizes, the 20% trimmed mean performs well in many situations because it is able to handle a high proportion of outliers. Third, if the distribution from which the data are sampled is skewed, choosing a completely valid alternative is complex. One possible solution could be to use a predetermined amount of trimming and symmetrizing transformations (see Luh & Guo, 2004; Vallejo et al., 2010b) to contend with skewness and heteroscedastic statistics to contend with heterogeneity (e.g,, the modified versions of the MBF and BDM methods). It is also possible to use asymmetric trimming, determining the amount of trimming in each tail on the basis of descriptive measures of distribution shape. Another option for overcoming the deleterious effects of nonnormality is to use permutation and bootstrapping methods, not forgetting the solution that involves replacing the original data with ranks and then applying the heteroscedastic techniques considered in this article. Lastly, in future research, it would also be informative to examine the performance of the linear model using techniques that allow distribution of error terms other than the normal ones and relax the requirement of constant variability. The SAS PROC GLIMMIX module, a generalization of the MIXED and GENMOD procedures, allows researchers to fit a sequence of models to data when normality and covariance assumptions are not necessarily satisfied (see Schabenberger, 2007, for details and suggestions). Nonparametric resampling methods, such as the bootstrap and permutation, as well as the MLM approach conducted through SAS PROC GLIMMIX have not been evaluated in comparative studies, but they may be able to offer competitive results.