Introduction

Rank-based nonparametric statistical tests are developed based on the idea of how often a randomly chosen observation from one distribution results in a smaller value than another randomly chosen observation from another distribution. To measure such effects, the original observations are converted to ranks to extract information about their empirical distribution functions of different treatments/groups/samples. Unlike the popular parametric tests which compare means, rank-based nonparametric tests require virtually no distributional assumptions on the data, making them particularly suitable for studies with non-normal distributions (e.g., reaction times data) and/or small sample sizes. However, despite their clear advantages, overall, nonparametric methods are largely underused in psychological studies (Field & Wilcox, 2017).

One possible reason for the unpopularity may come from the misconception that converting the actual observed values into ranks leads to a loss of information; however, a loss of efficiency occurs only when data are exactly or are close to being normally distributed for comparing locations. For instance, Lehmann (2009) studied the asymptotic relative efficiency (ARE) of the Mann–Whitney U test compared to the two-sample t test. Here, the ARE is the limit of the ratio of sample sizes required by the two tests being compared to achieve the same results in terms of level and power. On normal distributions, the Mann–Whitney U test is about 95% as efficient as the t test. As the underlying distribution of the data becomes less similar to a normal distribution (e.g., skewed, light-tailed, or heavy-tailed), the ARE of the Mann–Whitney U test compared to the t test may increase without bound, generally exceeding 100%. That is, in the large-sample case, the Mann–Whitney U test is typically more powerful than the t test.

Another reason why the nonparametric tests are less popular may be due to the difficulty of performing multiple comparisons. Traditionally, nonparametric multiple comparisons of independent samples have been performed in two steps. In the first step, the Kruskal–Wallis test is performed to evaluate the equality of distributions among different treatment groups. When a statistically significant difference is detected, the Mann–Whitney U tests are used for post hoc comparisons. However, interestingly, this two-step procedure can result in paradoxical results; i.e., it is possible to obtain results where, between three or more treatment groups, the pairwise differences are all statistically significant, yet none of them is stochastically dominant. In other words, there is no treatment group from which a random observation tends to be larger than a random observation from any of the other treatment groups. Mathematically speaking, this phenomenon is a consequence of the widely known nontransitive paradoxes. In this paper, we review the above-mentioned situation more clearly using a set of modified dice as an example with a more detailed explanation of stochastic differences. Then, we describe a method which eliminates the paradoxes by defining a reference distribution and comparing each sample to that distribution.

Lack of research in calculating an easily interpretable effect size for nonparametric multiple comparisons may be yet another reason why they are underused. In the normal-based parametric tests, Cohen’s d, which divides the difference of two means by their pooled standard deviation, is often used as an effect size to understand the practical significance of the results. Supplying effect sizes in addition to (adjusted) p-values is highly recommended as, for example, Cohen (1994) famously described that using p-values with large sample sizes can show statistically significant results when no difference of practical importance is present. Similarly, even when a statistically significant result is found, p-values give little information about how different samples are. Thus, in this paper, we propose a new multiple comparison procedure that can accommodate various effect sizes to supplement p-values by generalizing the work of Konietschke et al., (2012), providing practical measures of the stochastic differences between samples. The idea resonates well with the statement released by the American Statistical Association (Wasserstein and Lazar, 2016), which strongly encourages practitioners to make decisions using various measures of significance. Furthermore, we suggest a log odds-type effect size similar to Cohen’s d for nonparametric multiple comparisons, allowing the users to easily interpret the results.

Even though the importance of effect sizes has been emphasized above, p-values (or some measure of statistical significance) are also likely to remain prevalent. In psychological studies, there are many situations where one hypothesis contains several sub-hypotheses for different contrasts, requiring many tests to be performed. To ensure that research findings are replicable with a high probability, a nonparametric multiple comparison procedure for these contrasts (which shall be referred to as a nonparametric multiple contrast testing procedure (MCTP)) proposed in this paper is designed to provide a strong control the family-wise error rate (FWER) asymptotically at some prespecified α ∈ (0,1). That is, for any configuration of true and false null hypotheses, the probability of making at least one type I error is at most α (Pesarin & Salmaso, 2010). An appropriate FWER control provides a safeguard against type I errors at the expense of failing to detect some effects that are true (Cramer et al., 2016). We give theoretical justifications of the asymptotic strong control of the FWER of the proposed nonparametric MCTP by utilizing the idea of simultaneous test procedure (STP) proposed by Gabriel (1969).

The contributions made in this paper can be summarized as follows. Firstly, we provide a concise review of key ideas and issues that occur in nonparametric multiple comparisons, including the nontransitive paradoxes and reference distribution, by expanding the brief explanations given in Konietschke et al., (2012). Then, we propose a new nonparametric MCTP that provides a strong control of the FWER and accommodates various nonparametric effect sizes and contrasts. In particular, we discuss the idea of relative effects, effect sizes for the relative effects in multiple comparisons, how to generalize the nonparametric MCTP of Konietschke et al., (2012) to accommodate various effect sizes, theoretical justifications of the strong FWER control, and small-sample approximations. Then, the newly proposed nonparametric MCTP is evaluated through a simulation study and a real-life application. Lastly, conclusions and future work are summarized, and technical details are provided in Appendix AD. In addition, the proposed method is implemented in the R package ‘nparcomp’ via the function ‘mctp’.

Nontransitive paradoxes

Many nonparametric tests, including the Mann–Whitney U test, measure the so-called relative effect, to compare different samples. As a result of that, nontransitive paradoxes can occur, making the results less interpretable. In this section, we review the relative effect and nontransitive paradoxes, and discuss a way to avoid the paradoxes.

To understand the paradox, let Xi be a random variable from the i-th sample. To measure the stochastic superiority of the i-th sample compared to the j-th sample in the two-sample case, the relative effect, which is defined as

$$ \begin{array}{@{}rcl@{}} p_{ij}=\int F_{i}dF_{j}=\Pr(X_{i}<X_{j})+0.5\Pr(X_{i}=X_{j}), \end{array} $$

is used (see Munzel and Hothorn 2001; Reiczigel et al., 2005; Wolfsegger & Jaki 2006; Ryu 2009; Umlauft et al., 2017). Specifically, if pij > 0.5, the j-th sample is stochastically larger than the i-th sample. Similarly, if pij < 0.5, the j-th sample is stochastically smaller than the i-th sample. If pij = 0.5, the two samples are stochastically equal. In other words, the relative effect pij tells us how likely it is that a random observation from the j-th sample be larger than a random observation from the i-th sample. It is also straightforward to see that pji = 1 − pij. Note that these relative effects have been used as ways of measuring stochastic differences (see Cliff 1993 and Vargha & Delaney 2000 for more details).

In the classical parametric setting where the means are being compared (e.g., the t test), transitivity is preserved. That is, in the case of three samples, if their means μi, i = 1,2,3, are such that μ1 < μ2 and μ2 < μ3, then it must be the case that μ1 < μ3. However, surprisingly, when the relative effects are compared, there could be a situation where p21 > 0.5 (the first sample is stochastically larger than the second sample) and p13 > 0.5 (the third sample is stochastically larger than the first sample) do not necessarily imply p23 > 0.5 (the third sample is stochastically larger than the second sample). This paradox, often referred to as nontransitive paradox, can be better understood by way of an example.

Suppose that there are three fair dice, whose faces have been modified as follows:

  • Die 1 has faces 3,3,4,4,8,8;

  • Die 2 has faces 2,2,6,6,7,7;

  • Die 3 has faces 1,1,5,5,9,9.

Now, suppose that we are trying to find the best of these dice, or the one which rolls a higher value most often. A quick calculation shows that Die 1 rolls a higher value than Die 2 5/9 times. Similarly, Die 3 beats Die 1 5/9 times. Finally, Die 2 beats Die 3 5/9 times (see Appendix A). That is, p21 = p13 = p32 = 5/9, which implies that p21 > 0.5, p13 > 0.5 and yet p23 = 4/9 < 0.5. The rock-paper-scissors-like effect causes problems when deciding which die is the best (in the sense of finding the stochastically largest die). Unless it is apparent which die must be rolled against, there is no way of choosing the best die.

Obviously, the above situation is undesirable when performing pairwise comparisons of multiple samples using relative effects. Specifically, the above example implies that nonparametric tests, such as the Mann–Whitney U test that utilizes relative effects, should not be used for (post hoc) pairwise comparisons. To understand the problem better, by viewing the faces of the three dice as observations from three samples, their estimated relative effects (\(\hat {p}_{ij}\)) are given by \(\hat {p}_{21} = \hat {p}_{13} = \hat {p}_{32} = 5/9\). Now, suppose that statistically significant stochastic superiority is declared when \(\hat {p}_{ij} > 0.55\). Then, because 5/9 > 0.55, each pairwise comparison tells us that the latter die is significantly larger stochastically, yet they result in contradictory statements because of the paradox.

However, we can solve the problem by defining relative effects using a reference distribution. To understand how the reference distribution works, suppose that we have a second set of the same three dice in a black box. Now, let us draw a die at random from the black box and denote its face by Y. In other words, in this case, Y can be thought of as a random variable representing the face of an 18-faced fair die containing all the faces from the three dice. We call this new die a reference die.

Let us define the relative effect of each die by pi = Pr(Y < Xi) + 0.5Pr(Y = Xi), i = 1,2,3, where the comparisons are made to the common reference die. It can be shown that p1 = p2 = p3 = 1/2, from which it can be concluded that all the three dice are stochastically equal to the reference die (see Appendix A). In this situation, the non-transitivity paradox cannot occur because all the three dice are compared to the same reference die. That implies that we can define which die is “larger” decisively by comparing the values of pi. In addition, the distribution of the reference die is called the reference distribution, which will be defined more rigorously in the next section.

The reference distribution

We define the reference distribution by closely following the notation used in Konietschke et al., (2012). Let Xik indicate the k-th random variable in the i-th independent sample, which has ni observations, i = 1,…,a, k = 1,…,ni, and let \(N={\sum }^{a}_{i=1}n_{i}\) denote the total number of observations. Moreover, let Fi(x) = Pr(Xik < x) + 0.5Pr(Xik = x), − < x < , be the normalized distribution function for the i-th sample. In general, we only require that

$$X_{ik} \sim F_{i}, k=1,\ldots,n_{i},$$

where Fi are non-degenerate distribution functions. Specifically, we do not require any relationship between the distributions; that is, some could be exponentially distributed while others may be normally or even binomially distributed, for example. Note that this allows us to consider samples which are heteroscedastic, from discrete data, and/or samples without finite means or variances (e.g., Cauchy distribution). We denote the vector of all distribution functions by F = (F1,…,Fa).

These Fi on their own cannot easily describe differences among distributions. To describe differences, let \(G(x)=\frac {1}{a}{\sum }_{i=1}^{a} F_{i}(x)\) be an unweighted mean distribution. By viewing G as a distribution function, we call the composite distribution the reference distribution and use it to define treatment effects,

$$p_{i} = \int \!GdF_{i} = \Pr(Y \!<\! X_{ik})+0.5\Pr(Y = X_{ik}),\! i = 1,\ldots,a,$$

where XikFi and YG. If pi < pj, the values from Fi tend to be smaller than those from Fj. On the other hand, if pi = pj, neither distribution tends to be smaller or larger (Noguchi et al., 2012).

As we saw in the previous section, the reference distribution has many advantages. Most importantly, because every treatment effect pi refers to the same fixed reference distribution, there is no risk of paradoxical conclusions of the kind described in the example above. Furthermore, although the weighted mean distribution having the distribution function \(\tilde {G}(x)=\frac {1}{N}{\sum }_{i=1}^{a} n_{i}F_{i}(x)\) has been used in the past, use of the unweighted mean distribution is recommended because it is independent of sample sizes and their allocations. Thus, the effects pi can be used in the formulation of null hypotheses because they are model constants (Brunner et al., 2018; Gao et al., 2008; Konietschke et al., 2012).

Contrast vectors and null hypotheses

Multiple comparisons are made by specifying q contrasts of interest. A contrast is an a-dimensional vector representing the coefficients of the parameters to be used for making comparisons. In general, the contrast vector for the -th comparison can be written as c = (c1,…,ca), a non-zero vector such that \({\sum }_{i=1}^{a} c_{\ell i} = 0\). Without loss of generality, we add one more constraint \({\sum }_{i=1}^{a} |c_{\ell i}| = 2\) and describe its advantage in the next section.

To specify the parameters to be used for making comparisons, let

$$ \begin{array}{@{}rcl@{}} \textbf{\textit{p}} \!:=\! (p_{1},\ldots,p_{a})^{\prime} = \left( \int GdF_{1},\ldots,\int GdF_{a}\right)^{\prime} = \int Gd{F} \end{array} $$

be a vector of a relative effects. The vector p is then used to formulate the family of q null hypotheses:

$$\boldsymbol{\Omega}=\{H_{0}^{\ell}\colon\textbf{\textit{c}}^{\prime}_{\ell}\textbf{\textit{p}}=0, \ell=1, \ldots, q \},$$

tested against their respective two-tailed alternatives.

In general, the family of hypotheses can be specified with any set of contrast vectors although, in practice, the choice of which contrasts to use is tremendously important. For example, a standard method of comparing multiple samples is that of all-pairwise comparisons, attributed to Tukey (Gabriel, 1969). This method includes all the null hypotheses of the form pipj = 0 for all i < j. In our notation, this is tested using the contrast vectors with ci = 1, cj = − 1, and cu = 0 for u∉{i,j}. For example, if we let i = 1, j = 2, and a = 4 for the -th comparison, we have c = (1,− 1,0,0). Another method, attributed to Dunnett, compares every treatment to a single, fixed treatment, usually the control group. Assuming that the fixed treatment is the first sample, this type of contrast contains all the null hypotheses of the form p1pj = 0 for all j > 1. That is, the corresponding contrast vector for the -th comparison have elements whose values are given by c1 = 1, cj = − 1 for j = + 1, and 0 otherwise.

Careful attention should be paid to which contrasts are chosen. Tukey’s all-pairwise comparisons, while certainly thorough, can greatly decrease the power of a test as they may include comparisons not directly of research interest. On the other hand, Dunnett’s many-to-one comparisons can result in a far more powerful test; however, they may not answer every hypothesis of interest. Also, there are many other ways of defining contrasts depending on specific research questions. We therefore favor flexible methods that allow for the use of arbitrary contrast vectors. An application section at the end of this paper includes an example of a nontrivial contrast.

It should be noted that the null hypotheses considered here are valid in the case of heteroscedasticity. This can be easily seen by exemplifying normally distributed random variables \(X_{ik} \sim N(\mu _{i},{\sigma _{i}^{2}}), i=1,\ldots ,a; k=1,\ldots ,n_{i}\). Here, the relative effects can be computed using the parameters μi, \({\sigma _{i}^{2}}\), and the cumulative distribution function of N(0,1), Φ(⋅), by

$$ \begin{array}{@{}rcl@{}} p_{j} = \frac{1}{a} \sum\limits_{i=1}^{a} \int \!\!F_{i}dF_{j} = \frac{1}{a} \sum\limits_{i=1}^{a} {\Phi}\!\left( \frac{\mu_{j} - \mu_{i}}{\sqrt{{\sigma_{i}^{2}} + {\sigma_{j}^{2}}}} \right)\!, j = 1,\ldots,a. \end{array} $$

Thus, pj = 0.5 and pi = pj hold even under heteroscedasticity, i.e., \({\sigma _{i}^{2}} \not = {\sigma _{j}^{2}}\). Therefore, testing the null hypotheses H0: pij = 0.5 or H0: p1 = ⋯ = pa are also known as the nonparametric Behrens–Fisher problem (Brunner et al., 2018; Konietschke et al., 2012; Brunner et al., 2002). In general nonparametric models, pi = pj neither implies that variances or shapes of the underlying distributions are identical. Statistical methods that do not rely on the assumption of equal variances are especially important when the distribution of a statistic under the alternative hypothesis is important, i.e., for the computation of confidence intervals. For a general discussion about heteroscedastic methods and their importance, we refer to the comprehensive textbook by Wilcox (2017).

Finally, we note that the general definition of a “treatment effect” depends on the actual research question. Again, the effects of interest considered in this paper are formulated in the sense that different variances (or even higher moments) across the groups are not considered as a treatment effect. If no treatment effect is defined in a way that treatments have no effect on the data, exchangeability of the data may be a more appropriate definition of a treatment effect (Pesarin, 2001; Calian et al., 2008; Westfall and Troendle, 2008).

Comparing relative effects

When comparing two samples, it is perhaps most intuitive to consider the difference between their relative effects. That is, the i-th sample is compared to the j-th sample by considering pipj. However, this simple effect size may be difficult to interpret because its magnitude is not directly comparable to the most popular effect size known as Cohen’s d, which is typically given by \(d_{ij} = (\bar {x}_{i} - \bar {x}_{j})/s_{p}\), where sp is the pooled standard deviation.

On the other hand, by letting g(x) = k log[x/(1 − x)] for some k > 0, we obtain

$$ \begin{array}{@{}rcl@{}} g_{\log}(p_{i},p_{j}) = g(p_{i}) - g(p_{j}) = k \log \left[\frac{p_{i}/(1-p_{i})}{p_{j}/(1-p_{j})}\right], \end{array} $$

a constant multiple of the log odds (or log odds ratio). As for the choice of k to make the distribution of glog closest to that of standard normal, Haley (1952) suggested k = 1/1.702, which is the most optimal choice in the minimax sense (Camilli, 1994). Thus, we adopt Haley’s suggestion in this paper.

The log odds-type effect size is a favorable effect size as it resembles Cohen’s d. Hasselblad and Hedges (1995) and Chinn (2000) have noted (with a slightly different choice of k) that the distribution of dij and glog(pi,pj) are comparable under the assumption of normality with homogeneous variances. Therefore, the interpretation of glog(pi,pj) in terms of its magnitude may be made by referring to how it would be interpreted in terms of Cohen’s d. In fact, an extensive simulation study by Sánchez-Meca et al., (2003) indicates that the proposed effect size, which is in fact close to the one suggested in Cox (1970), seems to perform well under various situations.

Even though the discussion so far has been based on measuring the difference in the two-sample case, its generalization is required for the multi-sample case. For example, when comparing four samples, some may be interested in making an average comparison of the first two samples to the last two. That is, the corresponding null hypothesis assuming the additive effect is given by \(H_{0}^{\ell }\colon (p_{1} + p_{2})/2 - (p_{3} + p_{4})/2 = 0\). To accommodate these nontrivial cases, we need to define a useful way of obtaining the effect size expressed in a form of comparing two effects, as illustrated in Tukey (1991).

To achieve its generalization, firstly, we consider separating each of the q contrast vectors c, = 1,…,q, into the positive and negative parts. Specifically, let c,1 be a vector such that its i-th element is given by c,1,i = max{c,i,0}. Similarly, let c,2 be a vector such that its i-th element is given by c,2,i = max{−c,i,0}. That implies that c = c,1 −c,2. Also, \({\sum }_{i=1}^{a} c_{\ell i} = 0\) and \({\sum }_{i=1}^{a} |c_{\ell i}| = 2\) imply that \({\sum }_{i=1}^{a} c_{\ell , 1, i} = {\sum }_{i=1}^{a} c_{\ell , 2, i} = 1\). For example, using the average comparison above, for the contrast vector c = (1/2,1/2,− 1/2,− 1/2), we have c,1 = (1/2,1/2,0,0) and c,2 = (0,0,1/2,1/2).

Let us recall the null hypothesis \(H_{0}^{\ell }\colon \textit {\textbf {c}}^{\prime }_{\ell }\textit {\textbf {p}}=0\). Using the notation above, it can be rewritten as \(H_{0}^{\ell }\colon \textit {\textbf {c}}^{\prime }_{\ell , 1}\textit {\textbf {p}} - \textit {\textbf {c}}^{\prime }_{\ell , 2}\textit {\textbf {p}}=0\). Moreover, because g is assumed to be strictly increasing, it is also mathematically equivalent to \(H_{0}^{\ell }\colon g(\textit {\textbf {c}}^{\prime }_{\ell , 1}\textit {\textbf {p}}) - g(\textit {\textbf {c}}^{\prime }_{\ell , 2}\textit {\textbf {p}})=0\), although the latter representation is clearly preferred as it explicitly specifies the effect g to be considered. Here, we have obtained a generalization of the effect size to the multi-sample case given by \(g_{\ell }(\textit {\textbf {p}}) = g(\textit {\textbf {c}}^{\prime }_{\ell , 1}\textit {\textbf {p}}) - g(\textit {\textbf {c}}^{\prime }_{\ell , 2}\textit {\textbf {p}})\). As a consequence, the family of hypotheses we are considering can be more appropriately written as

$$\boldsymbol{\Omega}^{g}=\{H_{0}^{\ell}\colon g_{\ell}(\textbf{\textit{p}})=0, \ell=1, \ldots, q \}.$$

At the same time, it becomes clear why setting the constraint \({\sum }_{i=1}^{a} |c_{\ell i}| = 2\) is effective. Because that constraint implies that \({\sum }_{i=1}^{a} c_{\ell , 1, i} = {\sum }_{i=1}^{a} c_{\ell , 2, i} = 1\), both \(\textit {\textbf {c}}^{\prime }_{\ell , 1}\textit {\textbf {p}}\) and \(\textit {\textbf {c}}^{\prime }_{\ell , 2}\textit {\textbf {p}}\) can be interpreted as weighted averages of p1,…,pa. That also ensures \(\textit {\textbf {c}}^{\prime }_{\ell , 1}\textit {\textbf {p}} \in (0,1)\) and \(\textit {\textbf {c}}^{\prime }_{\ell , 2}\textit {\textbf {p}} \in (0,1)\), implying that the generalization works for any strictly increasing and continuously differentiable g whose domain is (0,1).

As an example, let us apply the transformation glog(x) = k log[x/(1 − x)] to the generalized effect size. Then, we obtain

$$ \begin{array}{@{}rcl@{}} g_{{\log},\ell}(\textbf{\textit{p}}) = g_{\log}(\textbf{\textit{c}}^{\prime}_{\ell, 1}\textbf{\textit{p}}) - g_{\log}(\textbf{\textit{c}}^{\prime}_{\ell, 2}\textbf{\textit{p}}) = k \log \left[\frac{\textbf{\textit{c}}^{\prime}_{\ell, 1}\textbf{\textit{p}}/(1-\textbf{\textit{c}}^{\prime}_{\ell, 1}\textbf{\textit{p}})} {\textbf{\textit{c}}^{\prime}_{\ell, 2}\textbf{\textit{p}}/(1-\textbf{\textit{c}}^{\prime}_{\ell, 2}\textbf{\textit{p}})}\right]. \end{array} $$

In real-life situations, because pi are unknown, they are replaced by their estimators \(\hat {p}_{i}\) (see Konietschke et al., 2012 for details). Let the vector of relative effect estimators be denoted by \(\hat {\textit {\textbf {p}}} = (\hat {p}_{1},\ldots ,\hat {p}_{a})'\). Then, the generalized effect size estimator is given by \(g_{\ell }(\hat {\textit {\textbf {p}}})\).

We note that the effects \(p_{i} = \int GdF_{i}\) involve all of the distributions. Thus, the contrast g(pi) − g(pj) does not only involve the distributions Fi and Fj, but also all other distributions involved in the experiment. Therefore, it should always be interpreted as a relative measure compared to the overall experiment. When the comparison of specific distributions is strictly of interest, pairwise defined effects \(p_{ij} = \int F_{i}dF_{j}\) may be a better choice. These effects, however, may result in nontransitive conclusions as described above.

Test statistics

Ultimately, we are interested in finding a testing procedure that addresses each of the q individual null hypotheses \(H^{\ell }_{0}\colon g_{\ell }(\textit {\textbf {p}})=0\), = 1,…,q, where the prespecified error rate is properly controlled. This type of testing procedure is called the multiple contrast testing procedure (MCTP). In this paper, we consider controlling the most common error rate known as the FWER. The FWER is defined as the probability of rejecting at least one true null hypothesis.

Even though the Bonferroni adjustment is the most common method for controlling the FWER, it is known to be highly conservative, leading to possibly many false non-rejections (Bender and Lange, 1999). Therefore, we firstly construct qt-type test statistics which are jointly approximately multivariate t-distributed, from which we suggest a much less conservative nonparametric MCTP that takes the correlation among the test statistics into account.

The construction of the t-type test statistics is done by an appropriate standardization of the generalized effect size estimators \(g_{\ell }(\hat {\textit {\textbf {p}}})\), = 1,…,q. Let us define a vector of generalized effect size estimators by \(\textit {\textbf {g}}(\hat {\textit {\textbf {p}}}) = (g_{1}(\hat {\textit {\textbf {p}}}),\ldots ,g_{q}(\hat {\textit {\textbf {p}}}))'\). Then, its standardization can be derived by applying the multivariate delta method to the statistic \(\sqrt {N}(\hat {\textit {\textbf {p}}}-\textit {\textbf {p}})\), which is asymptotically multivariate normal under some mild regularity conditions. In particular, it can be shown that the statistic \(\sqrt {N}[\textit {\textbf {g}}(\hat {\textit {\textbf {p}}})-\textit {\textbf {g}}(\textit {\textbf {p}})]\) is asymptotically multivariate normal with expectation 0 and some covariance matrix denoted by \(\textit {\textbf {V}}^{g}_{N}\) (see Appendix B for details). In other words, the large-sample distribution of \(\textit {\textbf {g}}(\hat {\textit {\textbf {p}}})\) is approximately multivariate normal with expectation g(p) and covariance matrix \(\textit {\textbf {V}}^{g}_{N}/N\). Hence, by considering its marginals, the large-sample distribution of \(g_{\ell }(\hat {\textit {\textbf {p}}})\) is approximately normal with expectation g(p) and variance \(v^{g}_{\ell \ell }/N\), where \(v^{g}_{\ell \ell } = (\textit {\textbf {V}}^{g}_{N})_{\ell \ell }\). By standardization, the asymptotic distribution of \(\sqrt {N}[g_{\ell }(\hat {\textit {\textbf {p}}}) - g_{\ell }(\textit {\textbf {p}})]/\sqrt {v^{g}_{\ell \ell }}\) is standard normal.

The argument above shows that an appropriate t-type test statistic for \(H_{0}^{\ell }\) is given by

$$ \begin{array}{@{}rcl@{}} T^{g}_{\ell} &= \frac{\sqrt{N}[g_{\ell}(\hat{\textbf{\textit{p}}})-g_{\ell}(\textbf{\textit{p}})]}{\sqrt{\hat{v}^{g}_{\ell \ell}}}, \end{array} $$

where we replaced the unknown \(v^{g}_{\ell \ell }\) with its sample estimator \(\hat {v}^{g}_{\ell \ell }\) in the denominator. Under \(H_{0}^{\ell }\), noting that g(p) = 0 and \(g_{\ell }(\hat {\textit {\textbf {p}}}) = g(\textit {\textbf {c}}^{\prime }_{\ell , 1}\hat {\textit {\textbf {p}}}) - g(\textit {\textbf {c}}^{\prime }_{\ell , 2}\hat {\textit {\textbf {p}}})\),

$$ \begin{array}{@{}rcl@{}} T^{g}_{\ell} &= \frac{\sqrt{N}[g(\textbf{\textit{c}}^{\prime}_{\ell, 1}\hat{\textbf{\textit{p}}}) - g(\textbf{\textit{c}}^{\prime}_{\ell, 2}\hat{\textbf{\textit{p}}})]}{\sqrt{\hat{v}^{g}_{\ell \ell}}}. \end{array} $$

To obtain the critical values and adjusted p-values, it is necessary to understand the joint distribution of \(\textit {\textbf {T}}^{g} = ({T^{g}_{1}}, {\ldots } , {T^{g}_{q}})'\) under the global null hypothesis \(H_{0} \colon \bigcap _{\ell =1}^{q} \{g_{\ell }(\textit {\textbf {p}})=0\}\). In the first step, we consider the asymptotic joint distribution of Tg. By applying Slutsky’s theorem, Tg asymptotically follows a multivariate normal distribution with expectation 0 and correlation matrix Rg, where \((\textit {\textbf {R}}^{g})_{\ell m}= v^{g}_{\ell m}/\sqrt {v^{g}_{\ell \ell }v^{g}_{m m}}\). That is, the critical values and adjusted p-values can be obtained by referring to such multivariate normal distribution. However, in practice, because Rg is unknown, it is replaced by its estimator \(\hat {\textit {\textbf {R}}}^{g}\), where \((\hat {\textit {\textbf {R}}}^{g})_{\ell m} = \hat {v}^{g}_{\ell m}/\sqrt {\hat {v}^{g}_{\ell \ell }\hat {v}^{g}_{m m}}\).

In reality, the asymptotic results are relevant only when large samples are available. Therefore, the results from the previous paragraph are mainly of theoretical interests. At the same time, because small sample sizes frequently occur in psychological studies (Szucs & Ioannidis, 2017), it is highly desirable to have an accurate small-sample approximation of the joint test statistics Tg, which will be explored in the next section.

Small-sample approximation, adjusted p-values values, and simultaneous confidence intervals

An accurate small-sample approximation of the joint distribution of the test statistics Tg is essential to obtain reliable statistical results. Even though the asymptotic distribution of Tg under H0 is multivariate normal, it is known that the multivariate normal approximation tends to produce liberal results, leading to possibly inflated false rejections. Also, psychological and behavioral data are often heteroscedastic, as emphasized in Wilcox (2017). Moreover, it is well known that the rank-transformed observations are in general heteroscedastic even if the original observations are homoscedastic (Brunner et al., 1997). Thus, we present a better approximation which is robust to heteroscedasticity using the multivariate t-distribution with appropriately modified degrees of freedom. Using the multivariate t-based approximation, we discuss how to obtain a critical value corresponding to a given FWER α, adjusted p-values, and 100(1 − α)% simultaneous confidence intervals (SCIs).

Konietschke et al., (2012) suggested a Box-type approximation (see Box 1954; Brunner et al.,1997; Gao et al., 2008) for accurate small-sample results. Specifically, following their notation, let \(\hat {\omega }^{2}_{\ell i}\) denote the empirical variances of the variables \(A_{\ell i k} = c_{\ell i}(\widehat {G}(X_{ik})-\tfrac 1a \widehat {F}_{i}(X_{ik})) - {\sum }_{s\not =i} c_{\ell s}\tfrac 1a \widehat {F}_{s}(X_{ik})\), where \(\widehat {G}\) and \(\widehat {F}\) denote the empirically estimated G and F, respectively (for more details, we refer to p. 750 of Konietschke et al.,, 2012). Then, an approximate small-sample distribution of Tg with g(x) = x under H0 is given by the q-dimensional t-distribution with expectation 0, the correlation matrix \(\hat {\textit {\textbf {R}}}^{g}\), and degrees of freedom ν = max{1,min{ν1,…,νq}}, where

$$\nu_{\ell} = \frac{\left( {\sum}_{i=1}^{a} \hat{\omega}^{2}_{\ell i}/n_{i}\right)^{2}} {{\sum}_{i=1}^{a} \hat{\omega}^{4}_{\ell i}/[{n_{i}^{2}}(n_{i} - 1)]}.$$

For convenience, we denote this distribution \(\textit {\textbf {t}}(\nu , \textbf {0}, \hat {\textit {\textbf {R}}}^{g})\).

In our case, a slight modification is necessary to accommodate the cases where g(x)≠x. To do so, following the idea of Noguchi and Marmolejo-Ramos (2016), we suggest to replace ν with \(\nu ^{g} = \max \{1, \min \{{\nu ^{g}_{1}},\ldots ,{\nu ^{g}_{q}}\}\}\), where

$$\nu^{g}_{\ell} = \frac{\left( {\sum}_{i=1}^{a} [{\sum}_{t=1}^{2} \{g^{\prime}(\textbf{\textit{c}}^{\prime}_{\ell, t}\hat{\textbf{\textit{p}}})\}^{2} I(c_{\ell, t, i} \!>\! 0)]\hat{\omega}^{2}_{\ell i}/n_{i}\right)^{2}} {{\sum}_{i=1}^{a} [{\sum}_{t=1}^{2} \{g^{\prime}(\textbf{\textit{c}}^{\prime}_{\ell, t}\hat{\textbf{\textit{p}}})\}^{4} I(c_{\ell, t, i} \!>\! 0)]\hat{\omega}^{4}_{\ell i}/[{n_{i}^{2}}(n_{i} - 1)]}.$$

Here, I(c,t,i > 0) = 1 if c,t,i > 0 and 0 otherwise. As a remark, when g(x) = x, \(\nu ^{g}_{\ell } = \nu _{\ell }\) because g(x) = 1.

Using νg, an accurate critical value corresponding to FWER = α and adjusted p-values can be computed. Let \(t_{1-\alpha , 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\) denote the two-sided equicoordinate (i.e., the quantiles in each dimension coincide) 100(1 − α)-th percentile of \(\textit {\textbf {t}}(\nu ^{g}, \textbf {0}, \hat {\textit {\textbf {R}}}^{g})\), which serves as the critical value. That is, \(H_{0}^{\ell }\) is rejected if and only if \(|T^{g}_{\ell }| > t_{1-\alpha , 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\). Moreover, H0 is rejected if and only if \(\max \{|{T^{g}_{1}}|,\ldots ,|{T^{g}_{q}}|\} > t_{1-\alpha , 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\).

Multiple comparison procedures having above properties are known as single-step procedures. In other words, the results for the overall comparison (H0) and specific contrasts (\(H_{0}^{\ell }\)) can be obtained simultaneously without any contradiction, unlike the popular procedures done in two steps. That is, rejection of at least one of \(H_{0}^{\ell }\), = 1,…,q, automatically implies rejection of H0 (a property known as coherent), and similarly, rejection of H0 automatically implies that at least one of \(H_{0}^{\ell }\), = 1,…,q, is rejected (a property known as consonant) (Gabriel, 1969). Coherence and consonance are not necessarily guaranteed in the popular procedures done in two steps, making the proposed single-step nonparametric MCTP much more interpretable and practical.

In addition, the adjusted p-values can be computed directly without relying on the Bonferroni adjustment. In particular, the adjusted p-value corresponding to \(H_{0}^{\ell }\) can be calculated by finding p for which \(t_{1-p_{\ell }, 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\) is equal to the observed value of \(|T^{g}_{\ell }|\). The overall adjusted p-value corresponding to H0 can be calculated by p = min{p1,…,pq}. Similar to the critical value, \(H_{0}^{\ell }\) and H0, respectively, are rejected if and only if p < α and p < α. As a remark, computations of p and \(t_{1-\alpha , 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\) can be easily done by using the R package mvtnorm (Hothorn et al., 2008).

We can also use \(t_{1-\alpha , 2, \nu ^{g}, \hat {\textit {\textbf {R}}}^{g}}\) to obtain approximate 100(1 − α)% SCIs for the treatment effects (effect sizes) g(p) (see Appendix D for a derivation). Note that, whereas a traditional 100(1-α)% confidence interval for a specific g(p) includes g(p) 100(1-α)% of the time if the experiment is performed repeatedly, SCIs must contain the vector of true population parameters g(p) 100(1-α)% of the time.

In general, approximate 100(1 − α)% SCIs for the treatment effects g(p), = 1,…,q, are given by

$$ \left[\textbf{\textit{g}}_{\ell}(\hat{\textbf{\textit{p}}}) - t_{1-\alpha, 2, \nu^{g}, \hat{\textbf{\textit{R}}}^{g}}\sqrt{\hat{v}^{g}_{\ell\ell}/N} \space , \space \textbf{\textit{g}}_{\ell}(\hat{\textbf{\textit{p}}}) + t_{1-\alpha, 2, \nu^{g}, \hat{\textbf{\textit{R}}}^{g}}\sqrt{\hat{v}^{g}_{\ell\ell}/N}\right]\!.$$

Simulation study

A simulation study was conducted to compare the sizes and powers of the nonparametric MCTP with the suggested log odds-type effect sizes (referred to as “Log Odds” in this section) to the ones suggested in Konietschke et al., (2012). These competing methods use g(x) = x without any additional transformation (referred to as “Student’s t” in this section) and with Fisher’s z-transformation on \(\textit {\textbf {c}}^{\prime }_{\ell }\hat {\textit {\textbf {p}}}\) (referred to as “Fisher” in this section). All the sizes and powers are calculated using 10,000 Monte Carlo simulations.

To ensure that the simulation study covers typical cases frequently encountered in real-life situations, we have a set of different sample size combinations, distributions, and four contrasts (i.e., a = 4). The sample size combinations (n1,n2,n3,n4) are (10,10,10,10), (7,10,13,16), and (25,20,15,10), covering both equal, increasing, and decreasing sample size cases. The distributions used were the normal, (scaled and shifted) Student’s t with 8 degrees of freedom, lognormal, and scaled beta with a scaling factor of 20, hence covering both symmetric and asymmetric, light- and heavy-tailed distributions. The means were chosen in such a way that (μ1,μ2,μ3,μ4) = (10,10,10,x) where x varies from 10 to 13 with an increment of 0.5 while the variances are all set equal to 9. The contrasts were performed via Tukey’s all-pairwise comparisons and Dunnett’s many-to-one comparisons with the first sample being the control group. The FWER is set at α = 0.05.

The results are summarized in graphs for easier comparisons. Figure 1 shows, via boxplots, the sizes of the tests corresponding to the cases with (μ1,μ2,μ3,μ4) = (10,10,10,10). Here, size refers to the probability of falsely rejecting the global null hypothesis (H0). Based on the simulations, the Student’s t method tends to be liberal in the equal and increasing sample size combinations while the Fisher and log odds method seem slightly conservative for the decreasing sample size combinations. Overall, the Fisher and log odds methods seem more robust to various sample size combinations than the Student’s t method.

Fig. 1
figure 1

Size of the test by sample size combinations. The dashed line indicates the significance level of 0.05

For the powers of the test, each of the 3 × 4 × 2 = 24 cases is compared using the power curves. Here, power refers to the probability of correctly rejecting the global null hypothesis (H0). Figures 2 and 3 represent typical situations. That is, the Fisher and log odds methods have very similar power curves while Student’s t method appears to be more powerful. However, the results need to be interpreted carefully because of the liberal nature of the Student’s t method. In other words, this phenomenon can be explained by the contribution of the inflated FWER of the Student’s t method. All the other results are displayed in the supplementary material.

Fig. 2
figure 2

Power of the test with Tukey’s all-pairwise comparisons

Fig. 3
figure 3

Power of the test with Dunnett’s many-to-one comparisons

Based on the observations above, we may summarize that the Fisher and log odds methods seem equally reliable and powerful while the Student’s t method tends to be liberal. As the log odds method directly calculates easily interpretable effect sizes, this method may be preferred in practice.

Even though the above simulations were run for homoscedastic samples, additional simulations were run for heteroscedastic samples to ensure that the above observations still hold. The results showed that, indeed, the Fisher and log odds methods seem equally reliable and powerful while the Student’s t method tends to be liberal. All the details can be found in the supplementary material.

As a remark, Marozzi (2016) considered quantifying the computation error of the sizes and powers calculated via Monte Carlo simulations of permutation tests. Here, assuming that the p-values are computed exactly from the distribution under the null hypothesis, the upper bound of the root mean squared error (RMSE) of the estimated power is \(0.5/\sqrt {MC}\), where MC is the number of Monte Carlo simulations. However, noting that the permutation tests provide estimatedp-values, the actual upper bound is close to \(0.6/\sqrt {MC}\), i.e., a 20% increase approximately.

In this paper, because the p-values are estimated via an approximate multivariate t-distribution, we also expect that the upper bound of the RMSE to be higher than \(0.5/\sqrt {MC}\). However, because the multivariate t-distribution is considered quite accurate in approximating the distribution of Tg under H0 (Brunner et al., 1997; Konietschke et al., 2012), we postulate that the upper bound of the RMSE to remain closer to \(0.5/\sqrt {MC}\) than \(0.6/\sqrt {MC}\). A more accurate assessment of the computation error will be considered in a future study.

Real-life application

To illustrate the use of the modified nonparametric MCTP, we reanalyzed data from a neuropsychological study. Bocanegra et al., (2015) examined 40 patients with Parkinson’s disease (PD) to determine whether cognitive deficits are language- or semantics-specific. Among them, 23 of those participants were diagnosed as not suffering from any mild cognitive impairment (PD-nMCI) and 17 were diagnosed as suffering some sort of cognitive impairment (PD-MCI). Each subgroup was matched with a control group (Control-nMCI and Control-MCI) of equal sample size, similar average age, average years of education, and proportional gender ratio (see Table 1 in Bocanegra et al., 2015). Thus, there were 40 PD patients and 40 control participants. For our purposes, we label the relative effects of PD-nMCI, PD-MCI, Control-nMCI, and Control-MCI as p1, p2, p3, and p4, respectively.

The tests the researchers used to evaluate the semantic representation of actions and objects were the Kissing and Dancing Test (KDT) and the Pyramids and Palm trees Test (PPT). We focus on the data related to the PPT, which consists of 52 cards showing triplets of images depicting a cue object-picture (the top image in each card), e.g., a pyramid, and two semantically related distractors (two side-by-side images below the cue object-picture), e.g., a palm tree and a pine tree. The participants’ task is to select the picture most related to the cue object-picture (in the examples above, the correct choice is the palm tree). Normal cognitive functioning is indicated by correctly choosing 47 or more of the 52 cards (i.e., 90% of the trials), while cognitive impairment is reflected in scores lower than 47Footnote 1.

Figure 4 shows the distribution of PPT scores in each of the four groups. Note that the control groups are highly left-skewed, and that there are outliers present in the PD groups at the lower end. Thus, the nonparametric MCTP can be used to obtain reliable conclusions.

Fig. 4
figure 4

Distribution of PPT scores in the four groups studied in Bocanegra et al., (2015)

Bocanegra et al., (2015) used the two-tailed Mann–Whitney U test with a significance level of 0.05 to evaluate differences between the groups’ adjusted PPT scores. They performed the following comparisons: 1. PD vs. Control, 2. PD-nMCI vs. Control-nMCI, 3. PD-MCI vs. Control-MCI, and 4. PD-nMCI vs. PD-MCI. For the first three tests, they found significant differences with Cohen’s d effect sizes higher than 1. For the fourth test, they did not find significant differences.

We applied the nonparametric MCTP with the suggested log odds-type effect sizes described in this paper to the same data, and added a fifth comparison not considered in Bocanegra et al., (2015): 5. Control-nMCI vs. Control-MCI. Table 1 shows the explicit hypotheses being tested as well as the contrast vectors used to test the hypotheses. The statistical results with effect sizes, 95% SCIs, and adjusted p-values are displayed in Table 2.

Table 1 Hypotheses tested for data from Bocanegra et al., (2015)
Table 2 Results of the nonparametric MCTP analyses using data from Bocanegra et al., (2015)

For three of the comparisons considered in Bocanegra et al., (2015) (PD vs. Control, PD-nMCI vs. Control-nMCI, and PD-MCI vs. Control-MCI), our nonparametric MCTP also found significant differences at α = 0.05, supporting their results. We also found a significant difference between PD-nMCI and PD-MCI, which their analysis did not find, suggesting a mild effect of MCI when PD patients are compared. Our fifth comparison did not yield a significant result, which strengthens the findings of Bocanegra et al., (2015), in that no difference between control groups would be expected if no neurological damage is present. In other words, if this comparison had been significant, then three of the pairwise comparisons carried out by them (those involving control groups) could have been influenced by an unknown factor underlying the control groups.

The effect sizes seen in Table 2 are slightly smaller than those found in Bocanegra et al., (2015), but this can be explained by the type of effect size used. Because Cohen’s d is found using a difference of means, it can be inflated by outliers, such as those found in the PD groups. On the other hand, log odds of relative effects are less affected by these outliers. Still, the effect sizes we found are large enough to show medium-large effects for all tests which had statistically significant differences.

Conclusions

In this paper, we have provided a comprehensive review of the nonparametric MCTP of Konietschke et al., (2012) and illustrated the advantages it has over traditional hypothesis testing procedures. In particular, the nonparametric MCTP uses an unweighted reference distribution to eliminate the rock-paper-scissors-like possibility of obtaining paradoxical, nontransitive results in multiple comparisons. Also, it provides a strong control of the FWER, allowing researchers to control the likelihood of type I errors appropriately. These advantages make the nonparametric MCTP a practical option for performing multiple comparisons without a need to make restrictive assumptions on the data.

Another important novel contribution discussed in this paper is a generalization of the nonparametric MCTP of Konietschke et al., (2012) to accommodate various effect size measures. In particular, the log odds-type effect size can be easily interpreted due to its similarity to Cohen’s d. We have also derived a reliable small-sample approximation of the generalized nonparametric MCTP, which is effective in real-life situations when larger samples are unavailable. Using that, the calculations of adjusted p-values and SCIs of effect size measures were discussed. Furthermore, the generalized nonparametric MCTP also possesses important theoretical properties of the original nonparametric MCTP including the strong control of the FWER, and our simulation study indicates that the power and robustness of the two are comparable. Finally, our reanalysis of the neuropsychological study in Bocanegra et al., (2015) illustrates that the suggested nonparametric MCTP facilitates a rigorous understanding of multiple treatment effects. The generalized nonparametric MCTP with the log odds-type effect sizes is implemented in the mctp function of the R package nparcomp Version 3.0.

Lastly, recall that the nonparametric MCTPs discussed in this paper are single-step procedures that take the correlation among the test statistics into account. Instead of the single-step procedures, step-down procedures such as Bonferroni–Holm (Holm, 1979; Pesarin and Salmaso, 2010), can be considered using the unadjusted p-values. On the other hand, step-up procedures, e.g., Hochberg (1988), are often valid only if the joint distribution of the test statistics is of a certain multivariate order, known as multivariate of totally positive order two (MTP2). For general contrasts, the joint distribution of the test statistics does not fulfill this requirement in general. Nevertheless, the investigation of step-up procedures and their validity in the general nonparametric Behrens–Fisher situation will be part of future research.