QANOVA: quantile-based permutation methods for general factorial designs

Population means and standard deviations are the most common estimands to quantify effects in factorial layouts. In fact, most statistical procedures in such designs are built toward inferring means or contrasts thereof. For more robust analyses, we consider the population median, the interquartile range (IQR) and more general quantile combinations as estimands in which we formulate null hypotheses and calculate compatible confidence regions. Based upon simultaneous multivariate central limit theorems and corresponding resampling results, we derive asymptotically correct procedures in general, potentially heteroscedastic, factorial designs with univariate endpoints. Special cases cover robust tests for the population median or the IQR in arbitrary crossed one-, two- and higher-way layouts with potentially heteroscedastic error distributions. In extensive simulations, we analyze their small sample properties and also conduct an illustrating data analysis comparing children’s height and weight from different countries.


Introduction
Factorial designs are popular in various fields such as ecology, biomedicine and psychology (Cassidy et al. 2008;Mehta et al. 2010;Kurz et al. 2015) as they allow us to study interaction effects between different factors alongside their main effects. In fact, Lubsen and Pocock (1994) pointed out that "it is desirable for reports of factorial trials to include estimates of the interaction between the treatments." The ANOVA-F-test is the most common tool for this but suffers from restrictive assumptions such as homoscedasticity and normality. Thus, several tests have been developed that allow for non-normal errors or are valid for heteroscedastic one-and two-way or even more general factorial designs (Johansen 1980;Brunner et al. 1997;Bathke et al. 2009;Pauly et al. 2015;Friedrich et al. 2017a, b;Harrar et al. 2019).
All these procedures describe effects by (contrasts of) means. This is in line with a phenomenon observed in various areas: Comparisons are mainly based upon means or variances but not on their robust counterparts. This can be explained in part by the simplicity and elegance gained by using linear or, under independence, additive statistics. Nevertheless, it contradicts the important role of statistics based on quantiles, like the median and the interquartile range (IQR), in data exploration and modeling, e.g., in boxplots or summary statistics. The interest in analyzing quantiles has led to the development of quantile regression, which is commonly established nowadays (Koenker and Hallock 2001;Koenker et al. 2019). However, as, e.g., stressed by Beyerlein (2014) "it appears to be quite underused in medical research." One reason may be that, although there exist several approaches for specific designs (Sen 1962;Potthoff 1963;Fung 1980;Hettmansperger and McKean 2010;Fried and Dehling 2011;Chung and Romano 2013), there does not exist an equal abundance of methods based on quantiles for general factorial designs. There are procedures, at least for the median, but they often require strong distributional assumptions (as symmetry) or, at least, an extension to factorial designs is missing. Therefore, the main aims of the present paper are to develop inference procedures (tests and compatible confidence regions) (i) for the median, the IQR or any linear combination of quantiles. (ii) for factorial designs to study robust main and interaction effects. (iii) for general heterogeneous or heteroscedastic models beyond normality. (iv) being theoretically valid and performing satisfactorily for finite samples.
To achieve these goals, we combine and extend the ideas of Chung and Romano (2013) (tests for equality of medians in one-way ANOVA models) and Pauly et al. (2015) (mean-based testing procedures in general factorial designs) to (simultaneously) infer arbitrary linear contrasts of general quantiles. In view of (ii) and (iv), we follow the idea of permuting studentized Wald-type statistics to obtain methods that are finitely correct in case of exchangeable data (e.g., under the null hypothesis of equal means/medians in the classic F-ANOVA normal model) but also asymptotically valid for general non-exchangeable settings. This alluring technique has originally been developed for special two-sample models (Neuhaus 1993;Janssen 1997;Janssen and Pauls 2003;Pauly 2011) and has recently displayed its full strength to obtain accurate methods in one-way (Chung and Romano 2013) and more general factorial designs Friedrich et al. 2017a;Smaga 2017;Harrar et al. 2019).
However, to derive the fore-mentioned theoretical evidence in our general quantilebased approaches we could not employ the methods derived in the previously mentioned papers. In fact, to overcome some technical difficulties that occur when jointly permuting sample quantiles, we had to take a detour in which we extended some results for general permutation empirical processes and uniform Hadamard differentiability (van der Vaart and Wellner 1996) that are of own mathematical interest. Anyhow, this finally results in (i)-(iv), i.e., a flexible toolbox for inferring contrasts of different quantiles in factorial designs. In the special case of the median and its bootstrap-based variance estimator, we obtain the one-way permutation test derived in Chung and Romano (2013).
Outline: We first introduce the model, estimators for population quantiles and how to formulate null hypotheses in them to test for certain main or interaction effects. In Sect. 3, we state the theory to handle the joint asymptotics for sample quantiles and their covariance matrix estimators. As the latter are crucial to obtain the correct dependency structure, we study different approaches: kernel density estimators, bootstrapping or certain interval estimates. As they are mostly only known for the sample median, we explain in Sects. 3.1-3.3 how to extend them to our general situations. From these findings, we deduce three different asymptotically valid testing procedures. To improve their small sample performance, we consider their respective permutation versions in Sect. 4, prove asymptotic exactness and analyze their power under local and fixed alternatives. To compare the small sample behavior of the resulting tests, we conducted extensive simulations presented in Sect. 5. Finally, we illustrate the new methodology by analyzing a recent dataset on height and weight of children from different countries in Sect. 6. All proofs details to higher-way layouts and additional simulation results are deferred to supplement.

The setup
We consider a general model given by mutually independent random variables, e.g., corresponding to the outcome from independent patients in randomized clinical trials, with absolutely continuous distribution functions F i and densities f i . This setup allows to incorporate factorial structure of different kinds by adequately splitting up indices. To accept this, consider, e.g., a two-way design with factors A (a levels) and B (b levels). Setting k = a · b, we split up the group index i = (i 1 , i 2 ) and model observations as X i 1 i 2 j ∼ F i 1 i 2 (i 1 = 1, . . . , a; i 2 = 1, . . . , b). Factorial designs of more complexity can be incorporated similarly ).
Having the model fixed, we now turn to the parameters of interest: Choosing m ∈ N different probabilities 0 < p 1 < . . . < p m < 1, we want to study inference methods for the corresponding quantiles Pooling them in q = (q 1 , . . . , q k ) = (q 11 , . . . , q 1m , q 21 , . . . , q km ) , we are particularly interested in testing the QANOVA null hypothesis H 0 : Hq = 0 r for a contrast matrix H ∈ R r ×km of interest. Here, H is called a contrast matrix if H1 km = 0 r , where 1 d and 0 d are vectors in R d consisting of 1's and 0's only. Choosing the con-trast matrices in line with the design and the question of interest allows us to test various hypotheses about main and interaction effects. Moreover, we want to point out that respective confidence regions for corresponding contrasts of quantiles can be obtained straightforwardly by inverting the test procedures. In what follows, we will therefore focus on hypothesis testing but provide some exemplary confidence intervals in the context of the illustrative data analyses in Sect. 6. Turning back to the null hypothesis, we recall from general ANOVA that it is convenient to re-formulate it as H 0 : Tq = 0 km for the unique projection matrix T = H (HH ) + H (Brunner et al. 1997;Pauly et al. 2015;Smaga 2017). Here, A + denotes the Moore-Penrose inverse of the matrix A. In fact, both matrices, H and T, describe the same null hypothesis, while T has preferable properties as being symmetric and idempotent. To infer H 0 , we propose sensitive test statistics in the vector of corresponding sample quantiles. To introduce them, let 1{X i j ≤ t} denote the group-specific and pooled empirical distribution function, respectively, where n = k i=1 n i . Let X (i) 1:n i ≤ . . . ≤ X (i) n i :n i be the order statistics of group i. Then, the natural estimator of the quantile q ir is Examples of specific hypotheses To give some examples of hypotheses covered within this framework, we first consider a one-way design. For m = 1, we obtain the k-sample null hypothesis Here, I k ∈ R k×k denotes the unit matrix, J k = 1 k 1 k and we suppressed the second index of the quantiles (m = 1). Choosing p 1 = 1/2 gives the null hypothesis of equal medians which reduces to the null hypothesis of equal means in case of symmetric error distributions. Setting k = ab, we consider a two-way design with factors A (having levels i 1 = 1, . . . , a) and B (with levels i 2 = 1, . . . , b) and suppose that we like to formulate main and interaction effects in terms of quantiles: Here, ⊗ is the Kronecker product andq i 1 · ,q ·i 2 andq ·· are the means over the dotted indices. The latter hypotheses can also be described more lucid by utilizing an additive effects notation. To this end, we decompose the quantile q i 1 i 2 = q μ + q α i 1 + q β i 2 + q αβ i 1 i 2 from group (i 1 , i 2 ) into a general effect q μ , main effects q α i 1 and q β i 2 as well as an interaction effect q αβ i 1 i 2 assuming the usual side conditions i 1 q α Then, the null hypotheses can be written as This methodology can be straightforwardly extended to higher-way layouts as described in supplementary material.
Beyond working with specific quantiles, it is also possible to infer hypotheses about linear combinations c q i = m r =1 c r q ir of quantiles. Here, c ∈ R k is an arbitrary vector, e.g., choosing c 1 = −c 2 = −1 for m = 2 and setting p 1 = 0.25 and p 2 = 0.75 lead to the group-specific interquartile ranges c q i = I Q R i . To obtain similar hypothesis in these parameters as above, the contrast matrix has to be specified to H = H ⊗ (c 1 , . . . , c r ), where H is one of the aforementioned contrast matrices. For example, H = P k together with the previous choices for c and p 1 , p 2 gives the null hypothesis {I Q R 1 = · · · = I Q R k } of equal IQRs among all k groups. However, the framework is much more flexible and even allows to infer hypotheses about IQRs and medians simultaneously by choosing p 1 = 0.5, p 2 = 0.25 and p 3 = 0.75 together with adequate contrast matrices.

Asymptotic results
To establish the joined asymptotic theory for the sample quantiles and their covariance matrix estimators, we assume non-vanishing groups throughout: Recall that the sample median will be asymptotically normal if the underlying density is positive and continuous in a neighborhood of the true median. This statement can be extended to the multivariate case (Serfling 2009), e.g., under the following assumption, which we consider throughout.
Assumption 1 Let F i be continuously differentiable at q ir with positive derivative f i (q ir ) > 0 for every r = 1, . . . , m and i = 1, . . . , k.
Proposition 1 (Theorem B in Sec. 2.3.3 of Serfling (2009) where Z i is a zero-mean, multivariate normal distributed random variable with nonsingular covariance matrix (i) given by its entries In general, the covariance matrix is unknown and, thus, needs to be estimated. Let us suppose, for a moment, that a consistent estimator (i) for (i) is available. Then, we could define a Wald-type statistic for testing H 0 : where ⊕ denotes the direct sum. By Proposition 1, the limiting covariance matrix (i) is positive definite, which implies that the Moore-Penrose inverse (Rao and Mitra 1971, Theorem 9.2.2). We summarize this as Thus, comparing S n (T) with the (1 − α)-quantile of the limiting null distribution defines an asymptotic exact level α test ϕ n = 1{S n (T) > χ 2 rank(T),1−α }. As Proposition 1 is not restricted to the null hypothesis, we can even deduce that n −1 S n (T) always converges in probability to (Tq) (T T) + Tq. Since Tq = 0 km implies (Tq) (T T ) + Tq > 0 (see Supplement for a verification) consistency follows.
It remains to find appropriate estimators (i) for the unknown covariance matrices. For that purpose, we examine different strategies: "Brute force" via plug-in of a kernel density estimator into (6) or using a different approach that first estimates the diagonal elements (i) aa and then employs their following relationship with the remaining matrix elements: In the latter case, we consider two ways for estimating the variances (i) aa : Via bootstrapping (Efron 1979) or with the interval estimator proposed in Price and Bonett (2001). In the following, we explain all three possibilities in detail.

Kernel estimator
A popular way to estimate densities is so-called kernel density estimators, which are based on a Lebesgue density K : R → [0, ∞) with K (x) dx = 1 and a bandwidth h n → 0. For more flexibility, we allow for different choices within the groups and add the corresponding group index, i.e., we work with K i and h ni . Then, the kernel density estimator for f i is given by Nadaraya (1965) proved strong uniform consistency of (9), i.e., we have Assumption 2 Let K i be of bounded variation and f i be uniformly continuous. Furthermore, suppose that ∞ n=1 exp(−γ nh 2 ni ) converges for any choice of γ .
Here, the convergence of the series ∞ n=1 exp(−γ nh ni ) is, e.g., implied by choosing h n,i = n −θ i for some θ ∈ (0, 1/2). We further note that Schuster (1969) discussed necessary and sufficient conditions for the stated uniform consistency. In particular, all f i need to be uniformly continuous. Moreover, the conditions on the bandwidths can be weakened when the kernel fulfills additional regularity conditions (Silverman 1978). Anyhow, combining Proposition 1 and the strong consistency of (9) yields consistency of the plug-in covariance matrix estimators.

Bootstrap estimator
In their one-way tests for equality of medians, Chung and Romano (2013) used the bootstrap approach of Efron (1979) to estimate the sample median's asymptotic variance. We adopt this idea for general quantiles. Therefore, for every group i, let X * i1 , . . . , X * in i denote a bootstrap sample (drawn with replacement) from the observations X i = (X i j ) j=1,...,n i . From this, we can calculate bootstrap versions of all previous estimators which we indicate by a superscript * . The mean squared error of the bootstrapped sample quantile q * ir given the data can be explicitly calculated using (3) Following Efron (1979), the probabilities P i j can be rewritten to where B n, p denotes a binomial distributed random variable with size parameter n and success probability p. In contrast to the standard jackknife method, the bootstrap median variance estimator ( σ * i (1/2)) 2 converges to 1/(4 f 2 i (F −1 i (1/2))) as desired (Efron 1979). Moreover, a detailed proof for strong consistency of this estimator was given by Ghosh et al. (1984) under

Interval-based estimator
McKean and Schrader (1984) introduced an estimator for the sample median standard deviation based on a standardized confidence interval. Later, Price and Bonett (2001) suggested to modify this estimator to improve its performance in small sample size settings. Both estimators are consistent (Price and Bonett 2001) and can compete with the aforementioned bootstrap approach in simulations (McKean and Schrader 1984; Price and Bonett 2001) with a slightly better performance of the Price-Bonnet modification. While both papers only treat the median, extensions to general quantiles follow intuitively and have already been used, e.g., for the 25%-and 75%-quantile in Bonett (2006). The (extended) McKean-Schrader estimator for the standard deviation of the pth sample quantile, p ∈ (0, 1), is given by are the lower and upper limits of binomial intervals. Here, z α/2 denotes the (1−α/2)-quantile of the standard normal distribution. Typically, α = 0.05 is chosen leading to z α/2 ≈ 1.96. A brief discussion on the effect of the choice α on the estimator can be found in Price and Bonett (2001). In fact, the Price-Bonnet modification concerns the choice of α: They propose to replace it in the denominator by the following finite sample correction (for ease of notation we suppressed the dependency on i) Clearly, α * n ( p) → α by the central limit theorem. For large sample sizes, the benefit of the correction is negligible and may even lead to computational problems due to n i j 1, especially for j ≈ n i /2. Thus, we only use the modifications for sample sizes smaller than 100 and recommend to set α * n ( p) = α for larger values (n i > 100). Moreover, the simulations of Price and Bonett (2001) reveal that additionally adding 2n −1/2 i to the denominator results in a slight reduction of bias and mean squared error. Altogether, we thus define their extended estimator for the respective standard deviation as As explained above, this estimator is consistent for the variance, leading to:

Lemma 3
We have for all i = 1, . . . , k and a, b = 1, . . . , m: Utilizing the different choices of covariance estimators results in three different versions of the asymptotic test ϕ n . However, simulation results (Sect. 5) exhibit serious issues for small to moderate sample sizes which may be due to a rather poor χ 2approximation to the test statistic. To tackle this problem, we propose the initially mentioned technique of permuting studentized statistics.

Permutation test
For a better finite sample performance, it is often advisable to replace the asymptotic critical value of the test, here the (1 − α)-quantile of the χ 2 rank(T) -distribution, by a resampling-based critical value. For the current problem, we promote the permutation approach, which leads to a finitely exact test under exchangeability, i.e., under H 0 : F 1 = . . . = F k . Moreover, the proper studentization within the Waldtype statistic makes it possible to transfer the consistency and asymptotic exactness (under H 0 : Tq = 0) of the tests ϕ n to their permutation versions. To explain this, let X π = (X π i j ) i=1,...,k; j=1,...,n i be a random permutation of the pooled data X = (X i j ) i=1,...,k; j=1,...,n i . As for Efron's bootstrap, we draw new samples from the pooled data, but now without replacement. In other words, we randomly permute the group memberships of the observations X i j . Pooling the data affects our Assumptions 1 and 2 such that we need to replace the original distribution functions F i and their densities f i by their pooled versions To be concrete, we postulate Assumption 4 Let F be differentiable with uniformly continuous derivative f such that f (F −1 ( p r )) > 0 for all r , and K i be a kernel fulling Assumption 2.
As in Chung and Romano (2013), it turned out that the asymptotic correctness of the permutation approach needs a certain convergence rate in (4): Theorem 3 Under H 0 : Tq = 0 km as well as under H 1 : Tq = 0 km , the permutation version S π n (T) of S n (T) with any of the covariance estimators (10)-(11) always mimics its null distribution asymptotically, i.e., Replacing the critical value χ 2 rank(T),1−α of the asymptotical tests with c π n (α), the (1 − α)-quantile of the conditional distribution function x → P(S π n (T) ≤ x|X), leads to three different permutation tests ϕ π n = 1{S n (T) > c π n (α)}. Under the assumptions given in Theorem 3, it follows that c π n (α) converges in probability to χ 2 rank(T),1−α irrespective whether the null hypothesis is true or not. Thus, we can deduce the asymptotic exactness of the permutation test and its consistency for general fixed alternatives (Janssen and Pauls 2003, Lemma 1 and Theorem 7). In addition, we prove in the next section that the permutation test has an asymptotic relative efficiency of 1 compared to the asymptotic test, i.e., the tests' asymptotic power values coincide for local alternatives.
Local alternatives To study local alternatives, we need to replace Model (1) with its local counterpart given by a triangular array of row-wise independent random variables X ni j ∼ F ni (i = 1, . . . , k; j = 1, . . . , n i ) with absolutely continuous distribution functions F ni , corresponding densities f ni , quantiles q nir and quantile vector q n = (q n11 , . . . , q n1m , q n21 , . . . , q nkm ) . Within this framework, we discuss local alternatives Tq n = O(n −1/2 ), i.e., small perturbations of the null hypotheses, under the following additional regularity conditions: Assumption 5 For every i = 1, . . . , k, let F i be an absolutely continuous distribution function with corresponding density f i . Moreover, set for all n ∈ N and all x ∈ R. (ii) Suppose that f i is continuous and positive at q ir = F −1 i ( p r ) and that f ni converges uniformly to f i in a compact neighborhood around q ir for all r . (iii) For the permutation approach, suppose additionally (12), Assumption 4 and uniform convergence of f ni to f i in a compact neighborhood around q r = F −1 ( p r ) for every r . (1), condition (i) ensures the usual √ n-convergence of F ni to F i . Anyhow, the tests' asymptotic power functions can be described by means of a non-central χ 2 distribution:

Simulations
To assess the tests' small sample performance, we complement our theoretical findings with numerical comparisons. For ease of presentation, we consider 1. A one-way layout in which we like to infer the null hypothesis H 0 : {I Q R 1 = · · · = I Q R 4 } of equal IQRs, i.e., as described at the end of Sect. 2 we choose probabilities p 1 = 0.25 and p 2 = 0.75 and specify the contrast matrix as H = P 4 ⊗ (−1, 1). 2. A 2 × 2 layout in which we test for the presence of main or interaction effects measured in terms of medians, i.e., setting k = a ·b = 2 ·2 we infer the hypotheses H 0 : {H A q = 0 ab } (no main median effect of factor A) and H 0 : In addition, we present detailed simulations for a five-factor model in supplement. Data were simulated within Model (1) where we consider (a) balanced and unbalanced settings given by sample size vectors n 1 = (15, 15, 15, 15) and n 2 = (10, 10, 20, 20), respectively. (b) five different distributions for i j : the standard normal distribution (N 0,1 ), Student's t-distribution with d f = 2, 3 degrees of freedom (t 2 , t 3 ), the Chi-square distribution with d f = 3 (χ 2 3 ) and the standard log-normal distribution (L N 0,1 ). All distributions were centered by subtracting the respective median m i from i j . (c) a homoscedastic setting σ 1 = (σ 1 , . . . , σ 4 ) = (1, 1, 1, 1) and heteroscedastic designs σ 2 = (1, 1.25, 1.5, 1.75) and σ 3 = (1.75, 1.5, 1.25, 1). Together with n 2 , the latter represent a positive and negative pairing, respectively.
The simulations were conducted by means of the computing environment R (R Core Team 2020), version 3.5.0, generating N sim = 5000 simulation runs and N perm = 1999 permutation iterations for each setting. The nominal level was set to α = 5%. We compare the type-1 error rate as well as the power values of our tests below. In both cases, we include all three variance estimation strategies introduced in Sects. 3.1-3.2. For the kernel density estimation, we choose the classical Gaussian kernel with a bandwidth according to Silverman's rule-of-thumb (Silverman 1986, Eq. (3.31)), where we applied the function bw.nrd0 from the R package stats to determine the latter.
In case of the 2 × 2-median design, these methods are additionally compared with the current state-of-the-art tests for regression parameters in quantile regression: From the R package quantreg (Koenker et al. 2019), we choose the rank inversion method by Koenker and Machado (1999) for non-iid errors, the default choice in quantreg, and the wild bootstrap approach of Feng et al. (2011). For a fair comparison, we include the main factors A and B and their interaction in the respective regression model. Hence, regression parameters β A , β B and β AB are estimated, and corresponding p-values for testing H 0 : β A = 0 (no main effect A) and H 0 : β AB = 0 (no interaction effect) are derived by both quantreg approaches.
Type-1 error In this subsection, we discuss the type-1 error control of all procedures. To simulate under the corresponding null hypotheses, we set μ i = μ i 1 i 2 = 0 in the 2 × 2-median-based cases and restrict to the homoscedastic setting σ = σ 1 for the 4-sample IQR testing question. The standard error of the estimated sizes in case of N = 5000 simulation runs is 0.3% if the true type-1 error probability is 5%, i.e., estimated sizes outside the interval [4.4%,5.6%] deviate significantly from the nominal 5% significance level. Table 1 Type-1 error rate in % (nominal level α = 5%) for testing the median null hypothesis of no main effect in the 2 × 2 design for the rank-based (Rank) and wild bootstrap (Wild) quantile regression approach as well as all asymptotic and permutation tests using the interval-based (Int), kernel density (Kern) and bootstrap (Boot) approach for estimating the covariance matrix The observed type-1 error rates for the 2×2-median design are displayed in Table 1 for testing the hypothesis of no main effect. It is readily seen that all asymptotic tests are rather conservative with type-1 errors reaching down to 1.7% for the bootstrapbased and 0.7% for the interval-based approaches, respectively. This conservativeness is less pronounced for the test based upon the kernel density variances estimator that exhibits values between 2.7% and 5.7% and a reasonable good error control in case of the standard normal and χ 2 3 distribution except for the settings with positive variance pairing. In contrast, all permutation methods control the type-1 error level reasonably well except for the situations with a skewed distribution and negative pairing. Here, we find error rates up to 7.2% for the tests based upon the interval-and kernel-based variance estimators. For the two quantile regression methods from the R package quantreg (Koenker et al. 2019), the observations are diverse: The rank-based approach tends to conservative test decisions in case of unbalanced sample sizes with observed error rates in the range 2.5−3.5%. However, in case a balanced homoscedastic design with symmetric errors, a slight liberality (6.4−6.9%) is detected. For all other settings, Table 2 Type-1 error rate in % (nominal level α = 5%) for the four-sample IQR testing problem of our asymptotic and permutation tests using the interval-based (Int), kernel density (Ker) and bootstrap (Boo) approach for estimating the covariance matrix the decisions are accurate. In comparison, the wild bootstrap strategy is liberal for almost all balanced settings (with observed error rates up to 7.7%) and conservative for all positive pairings (2.8−3.7%). Overall, the permutation procedure that uses a bootstrap variance estimator exhibits the most robust type-1 error control with values ranging from 4.7−6.4%. Summarizing the results for the interaction tests presented in supplement, we get a similar impression for the wild bootstrap quantile regression strategy and the six Waldtype procedures. For them, the only major difference is that the permutation methods also exhibit a fairly well error control for the settings with skewed distributions and negative pairing. However, the results for the rank-based quantile regression method are partially different: While the type-1 error rate is still accurate for balanced sample sizes, the decisions become very liberal in the unbalanced scenarios with estimated type-1 error rates between 6.1% and 10.1%.
The type-1 error rates in the situation of the four-sample testing problem of equal IQRs are presented in Table 2. Here, the finite sample behavior of the asymptotic tests becomes even more extreme: For the symmetric distributions, the type-1 error rates are between 0.4% and 1.3% for the interval-based estimator and between 0.3 and 1.2% for the bootstrap approach, i.e., very conservative. In contrast, the decisions for the kernelbased method are quite accurate with values between 3.7% and 5.0%. Switching to skewed distribution, however, the type-1 error rates increase, leading to very liberal decisions in the log-normal case with values up to 10.2% for the kernel-based and 7.5% for the interval-based tests. Here, only the bootstrap-based method remained very conservative. In comparison, all permutation counterparts lead to satisfactory type-1 error control close to the 5% level. Due to the extreme behavior of the asymptotic tests in this setting, we conducted additional simulation results in supplement. Therein, all asymptotic tests for equality of IQRs more or less approach the 5% level for larger group-specific sample sizes n i ≥ 150.
Power behavior Due to the diverse behavior of the asymptotic tests and the rank-based quantile regression method under the null hypotheses and for ease of presentation, we solely focus on permutation tests and the wild bootstrap quantile regression strategy here. The results for the asymptotic tests are presented in supple-  Fig. 1 Power curves for the 2 × 2-median testing problem (first two rows) and for the four-sample IQR testing problem (last row) of the permutation PBK test (dash-dotted), the wild bootstrap quantile regression test (dashed) and the three permutation tests based on interval-based (long-dashed), kernel density (dotted) and bootstrap (solid) covariance matrix estimation, respectively, for n = n 2 , σ = σ 1 and shift alternatives μ = (0, 0, 0, δ) (median) or scale alternatives σ = (1, 1, 1, 1 + δ) (IQR) ment, and apart from their different levels under H 0 , their power curves run almost parallel to the ones of the permutation version.
To achieve a scenario under the alternative in the 2 × 2-median test setting, we disturbed the respective null setup by adding a shift parameter δ = μ 2,2 to the last group. In addition to the aforementioned methods, we considered the permutation Wald-type test (PBK) of Pauly et al. (2015) which was developed for testing means in general factorial designs. Their procedure is implemented in the R package GFD (Friedrich et al. 2017b). For a fair comparison, we included their PBK test just for the cases where mean and median coincide, i.e., for the symmetric distributions. The results for the procedures inferring a main effect are presented in Fig. 1, while the corresponding power curves of the interaction tests are shown in supplement. Studying Fig. 1, we observe that the PBK test leads to higher power values compared to our tests for the normal distribution settings but is less powerful under the t 2 -and t 3distributions. An explanation may be given by the (asymptotic) efficiencies of the location estimators: While the sample mean is more efficient than the sample median under normal distributions, the situation is reversed for the two more heavy-tailed t-distributions. A comparison among the median-based permutation tests shows that the interval-based approach leads to lower power values than the other two methods for both t-distributions, while the bootstrap approach is slightly less powerful than the other two for the skewed log-normal distribution. Under normality, however, the tests' power functions are almost identical. In comparison, the wild bootstrap quantile regression method has considerably less power than all other methods for testing main median effects. The power curves for the interaction effects presented in supplement show a similar pattern for almost all tests. The only exception is the wild bootstrap approach which exhibits a similar power behavior as the permutation tests. Moreover, it is slightly advantageous for shift alternatives with δ > 1.
To obtain alternatives for the four-sample IQR testing problem, we consider scale alternatives σ = (1, 1, 1, 1 + δ). For ease of presentation, we only show the results for normal as well as lognormal distributions here. The resulting power curves are plotted in Fig. 1. We can observe that the kernel density approach leads to lower power values compared to the other two methods.
Recommendation Summarizing the findings, we recommend the use of the permutation methods over their asymptotic counterparts as they show a much better type-1 error control in case of small and moderate sample sizes (n i ≤ 200). However, there is no general recommendation for choosing between the three permutation versions as their power behavior (slightly) differed with respect to underlying settings, e.g., for comparing IQRs the interval-and bootstrap-based approaches performed better, while the kernel method exhibits the largest power for testing medians in a 2 × 2 design with heavy tails. In comparison with quantile regression, the advantage of the proposed factorial design approach is the simple incorporation of interaction effects without a loss in power. This benefit can be seen in the power simulations for the 2 × 2-median design, as shown in Fig. 1, and becomes even more pronounced in higher-way layouts, see Green et al. (2002) and Green (2012) and the additional simulation results in supplement.

Illustrative data analysis
A typical everyday situation in which we are confronted with quantiles is percentile curves for child heights and weights. We re-analyzed growth and weight data of children from five sites (Brazil, India, Guatemala, the Philippines, South Africa), which was provided to us by the COHORTS group (Richter et al. 2012). Both, height and weight, were converted to z-scores regarding the WHO child standards (de Onis et al. 2007). Having a comparison of percentile curves in mind, we test for effects in the 25%-, 50%-and 75%-quantile simultaneously. This also demonstrates the flexibility of the proposed methodology. For illustrative purposes, we focus on the following subgroups: Example 1 We compare the birth weight of firstborns from the countries (factor A) Brazil and South Africa including both genders (factor B). To avoid confounding  Fig. 2 Group-wise boxplots (outliers are not displayed) for the birth weight data from Example 1 (left) and the height data from Example 2 (right) Table 3 For the effect of the country on the birth weight (Example 1) and the maternal height on the height at 2 years, the p values (in %) are shown for our asymptotic and permutation approach using the interval-based (Int), kernel density (Ker) and bootstrap (Boo) strategies for covariance matrix estimation effects regarding age, education or marital status, we restrict our analysis to 30-yearold or younger married mothers with a comparable education level of 9 completed school years. The n = 173 children are divided into n 1 = 65 boys and n 2 = 46 girls from Brazil and n 3 = 36 boys and n 4 = 26 girls from South Africa. We would like to infer whether there are differences between the countries regarding the boys' and girls' birth weight, respectively.

Example 2
We investigate the effect of the mother's height (factor A) on the children's height at the age of 2 years. Both sexes (factor B) are included. We restrict to firstborns of unmarried mothers from the Philippines. For this analysis, we divide the women into the groups "small" and "tall" consisting of the women, respectively, being smaller and taller than the median height of 150cm. The group "small" consists of data for n 1 = 8 boys and n 2 = 13 girls, and in the group "tall," there are data for n 3 = 12 boys and n 4 = 11 girls.
To get a first graphical impression, the group-specific box plots are presented in Fig. 2. In both cases, it appears that factor A (country and maternal height, respectively) leads to a shift of all three empirical quantiles of the children's height and weight. To infer this conjecture, we like to check for a main effect of factor A regarding the three quantiles q i = (q i1 , q i2 , q i3 ) , i = 1, . . . , 4 corresponding to the probabilities ( p 1 , p 2 , p 3 ) = (0.25, 0.5, 0.75) simultaneously. That is, we test H 0 : {q 1 +q 2 = q 3 + q 4 }. The p values of all three asymptotic and permutation tests (ignoring multiplicity) are summarized in Table 3.
It is apparent that the asymptotic and permutation test leads to different decisions at the nominal level α = 5%. In fact, the seemingly present effect from Fig. 2 is Table 4 Point estimates θ for the difference θ i 2 = q 2i 2 − q 1i 2 of the countries' median with respect to sex for Example 1 together with permutation-based 95% confidence intervals Here, Int (interval-based), Ker (kernel density) and Boo (bootstrap) indicate the applied covariance matrix estimation technique not detected by any asymptotic tests with p values around 8-10%. In contrast, the p values of the permutation approaches are, except for the kernel density method in Example 2, less than 5%. To investigate the reasons why these decisions are different, we run additional simulations for the three-quantile testing problem under the sample size settings of Example 2. The results are presented in supplement and may explain the above decisions to some extent: As in Sect. 5, the asymptotic tests are quite conservative with type-1 error rates ranging between 0.8% and 4.2%. Moreover, the permutation kernel density approach is less powerful than the other two permutation methods under shift alternatives for skewed distributions. Beyond hypothesis testing, the theoretical results can also be used to formulate asymptotically valid confidence regions for contrasts of quantiles by inverting the corresponding tests. We exemplify this for the difference between two quantiles as effect parameter of interest. To this end, consider Example 1 and encode factor A (country) and factor B (gender) as follows: i 2 = 1 for the boys, i 2 = 2 for the girls, i 1 = 1 for Brazil and i 1 = 2 for South Africa. Then, for a fixed gender i 2 , the asymptotic correct z-and permutation-(1 − α)-confidence intervals for the difference θ i 2 = q 2i 2 − q 1i 2 of the countries' quantiles are ( q 2i 2 − q 1i 2 ) ± z α/2 √ n σ 2 1i 2 + σ 2 2i 2 , ( q 2i 2 − q 1i 2 ) ± c π ni 2 (α/2) √ n σ 2 1i 2 + σ 2 2i 2 , respectively. Here, σ 2 i 1 i 2 = (i 1 i 2 ) 11 is an estimator for the asymptotic variance of √ n( q i 1 i 2 − q i 1 i 2 ) using one of our strategies from Sects. 3.1-3.3 and c π ni 2 (α/2) is the (1−α/2)-quantile of the permutation distribution of √ n( q π 2i 2 − q π 1i 2 )( σ 2,π 1i 2 + σ 2,π i 2 ) −1/2 . To illustrate the application, we calculated the 95% permutation-based confidence intervals for the median difference separately for gender in Table 4. Ignoring multiplicity, we see that all three permutation procedures agree on a significant difference in the girl's median birth weight (at level α = 5%) but do not find a corresponding effect for the boys.

Discussion
While an abundance of methods exists for inferring means and mean vectors in general heterogeneous factorial designs (Johansen 1980;Brunner et al. 1997;Bathke et al. 2009;Zhang 2012;Konietschke et al. 2015;Pauly et al. 2015;Harrar et al. 2019), there are not so many methods for the analysis of medians or quantiles. To this end, we combined the idea of studentized permutations from heteroscedastic mean-based  and one-way median-based ANOVA (Chung and Romano 2013) to establish flexible methods for inferring quantiles in general factorial designs which we coin QANOVA. In fact, we proposed three permutation methods in Wald-type statistics that only differ in the way the covariance matrix is estimated. All of them are applicable to construct confidence regions and to test null hypotheses about arbitrary contrasts of different quantiles.
The resulting procedures are finitely exact under exchangeability of the data and shown to be asymptotically valid. In doing so, we had to extend some results about permutation empirical processes and uniform Hadamard differentiability (van der Vaart and Wellner 1996) that are of own mathematical interest. From them, we could deduce the asymptotic validity as well as results about the procedures' asymptotics under fixed and local alternatives. In the special case of the median, these results even reveal new insights into one-way permutation test of Chung and Romano (2013).
In addition to the theoretical findings, we analyzed the procedures in extensive simulations. Our results indicate an accurate type-1 error control for the permutation methods in almost all settings. Only in case of skewed distributions and small unbalanced samples with a heteroscedastic negative pairing, a slight liberality was found when testing for main effects. Beyond this, we can recommend all three permutation methods with clear conscience. Currently, we work on implementing them within an R-package. We are confident that the current results can be transferred to questions about related quantile-based estimands, e.g., coefficients of quartile variation (Bonett 2006) as well as to complex ANOVA settings with correlated variables, e.g., quantilebased repeated measurements or complex MANOVA designs.