1 Introduction

Many research questions give rise to a two-sample problem of comparing distributions (or survival functions). Often researchers are interested in one-sided alternative hypotheses and the notion of stochastic dominance (stochastic ordering) is needed in order to formulate rigorously such hypotheses. Let X and Y be random variables with survival functions \(S_X(t) = \mathbb {P}\,(X>t)\) and \(S_Y(t) = \mathbb {P}\,(Y>t)\). We say that X stochastically dominates Y if

$$\begin{aligned} S_X(t) \ge S_Y(t) \ \text {for all }t\hbox { with strict inequality for some }t. \end{aligned}$$

We will sometimes skip the word stochastically and we will just say that X dominates Y. If X dominates Y, we write \(X \succ Y\), or equivalently, \(Y \prec X\). If there exist values \(c_1\) and \(c_2\) such that \(S_X(c_1) > S_Y(c_1)\) and \(S_X(c_2) < S_Y(c_2)\), we say that the survival functions of X and Y cross one another. Stochastic dominance induces four possible hypotheses: (i) X and Y have identical survival functions, (ii) X dominates Y, (iii) Y dominates X, and (iv) the survival functions of X and Y cross one another. The concept of stochastic dominance has been employed in many areas such as economics, psychology, and medicine (see, e.g., Davidson and Duclos 2000; Donald and Hsu 2016; Levy 2016; Ashby et al. 1993; Heck and Erdfelder 2016; Petroni and Wolfe 1994; Ledwina and Wyłupek 2012). As noted by Townsend (1990), stochastic dominance implies but is not implied by the same ordering of the means (if the means exist).

If we are interested in classifying the stochastic dominance relation (into the four cases specified above) based on observations of two random variables, a common procedure is the following (Whang 2019). First, perform two separate tests

$$\begin{aligned} H_{01}&: S_X(t) \ge S_Y(t) \;\; \text {for all }t, \,\hbox { against its negation and} \\ H_{02}&: S_Y(t) \ge S_X(t) \;\; \text {for all }t, \,\hbox { against its negation.} \end{aligned}$$

Then,

  1. (a) 

    if neither \(H_{01}\) nor \(H_{02}\) are rejected, decide that X and Y have identical survival functions;

  2. (b) 

    if \(H_{01}\) is rejected and \(H_{02}\) is not rejected, decide that X dominates Y;

  3. (c) 

    if \(H_{01}\) is not rejected and \(H_{02}\) is rejected, decide that Y dominates X;

  4. (d) 

    if both \(H_{01}\) are \(H_{02}\) are rejected, decide that the survival functions of X and Y cross.

However, with this procedure it is difficult to control the possible classification errors, e.g., inferring dominance when in fact the survival functions cross (see, e.g., Whang 2019, p. 106). Bennett (2013) proposed a four-hypothesis testing procedure which allows maintaining (asymptotic) control over the various error probabilities.

Most of the existing tests of stochastic dominance assume independent observations (see, e.g., the recent monograph of Whang 2019 and the references therein). However, many experiments involve repeated measurements from each subject and such observations are not independent. For these designs, appropriate statistical methods that account for the dependence structure of the data are needed. Reducing the repeated measurements to single observations by taking their means or medians is not advisable because the available data are not efficiently used (see, e.g., Roy et al. 2019). Such transformations of the observed data will result in different estimates of the survival functions; in particular, the estimated survival function will have fewer jumps and some information will be lost.

We are not aware of a dominance test with four hypotheses which is suitable for data with repeated measurements. Building upon the ideas of Bennett (2013) and Angelov et al. (2019b), we suggest four-decision testing procedures for repeated measurements data. In Sect. 2, we introduce the testing procedures. Section 3 reports a simulation study. In Sect. 4, the suggested procedures are applied to data from an experiment concerning the willingness to pay for a certain environmental improvement. Proofs and auxiliary results are given in the Appendix.

2 Testing procedures

Let us consider the following mutually exclusive hypotheses about the random variables X and Y:

$$\begin{aligned} H_0&: \;\; X\text { and }Y\text { have identical survival functions} , \\ H_{\succ }&: \;\; X\text { dominates }Y , \\ H_{\prec }&: \;\; Y\text { dominates }X , \\ H_{\mathrm {cr}}&: \;\; \text {the survival functions of }X\text { and } Y\text { cross one another} . \end{aligned}$$

We explore a four-hypothesis testing problem with null hypothesis \(H_0\) and three alternative hypotheses: \(H_{\succ }\), \(H_{\prec }\), and \(H_{\mathrm {cr}}\).

The survival and distribution functions of X and Y are denoted

$$\begin{aligned} S_X(t) = \mathbb {P}\,(X>t), \quad&F_X(t) = 1 - S_X(t), \\ S_Y(t) = \mathbb {P}\,(Y>t), \quad&F_Y(t) = 1 - S_Y(t). \end{aligned}$$

Throughout the paper it is assumed that we have observations

$$\begin{aligned} \{ x_{ij}, y_{ij} \}, \quad i=1,\ldots ,n, \quad j=1,\ldots ,k , \end{aligned}$$

where \(x_{ij}\) is the observed value of X for individual/subject i at occasion j and \(y_{ij}\) is the observed value of Y for individual/subject i at occasion j, i.e., \(\{ x_{ij} \}\) are observations from \(S_X\) and \(\{ y_{ij} \}\) are observations from \(S_Y\). We will also use the notation \(\mathbf{z} _1, \ldots , \mathbf{z} _n\), where \(\mathbf{z} _i = (x_{i1}, \ldots , x_{ik}, y_{i1}, \ldots , y_{ik})\), i.e., \(\mathbf{z} _i\) is the vector of observations for individual/subject i. For simplicity, the observations \(\{ x_{ij}, y_{ij} \}\) denote random variables or values of random variables, depending on the context. Note that the vectors \(\mathbf{z} _1, \ldots , \mathbf{z} _n\) are independent (and identically distributed) but the observations \((x_{i1}, \ldots , x_{ik}, y_{i1}, \ldots , y_{ik})\) within each subject i can be correlated.

The empirical distribution function based on the observations \(\{x_{ij}\}\) is

$$\begin{aligned} \widehat{F}_X (t) = (1/kn) \sum _{i,j} \mathbbm {1}\{x_{ij} \le t\} \end{aligned}$$

and the empirical survival function is \(\widehat{S}_X(t) = 1 - \widehat{F}_X (t)\). The functions \(\widehat{F}_Y (t)\) and \(\widehat{S}_Y(t)\) based on \(\{y_{ij}\}\) are defined analogously. Let us denote \(m = 2kn\), \(\{t_1, \ldots , t_{m}\} = \{ x_{ij}, y_{ij} \}\), \(t_1 \le t_2 \le \ldots \le t_{m}\), and \(z^{(+)} = \max \{z, 0\}\) for any real number z. Let \(\widehat{G}(t) = (1/m) \sum _{l} \mathbbm {1}\{t_{l} \le t\}\),

$$\begin{aligned} \psi ^{\bullet }_{\gamma }(t_l) &= {\left\{ \begin{array}{ll} (\widehat{G}(t_l)[1-\widehat{G}(t_l)])^{-1/\gamma } &{} \text {if}\;\; \widehat{G}(t_l) \in (0,1) \\ 0 &{} \text {otherwise} \end{array}\right. } \\ \psi _{\gamma }(t_l) & = \frac{ \psi ^{\bullet }_{\gamma }(t_l) }{ \sum _l \psi ^{\bullet }_{\gamma }(t_l) }, \end{aligned}$$

where \(\gamma >1\) is some real number.

We utilize the following test statistics:

  • Modified one-sided Cramér–von Mises statistics

    $$\begin{aligned} \begin{aligned} W_{X \succ Y}&= \frac{1}{m} \,\sum _{l=1}^{m} \left( \widehat{S}_X(t_l) - \widehat{S}_Y(t_l) \right) ^{(+)} , \\ W_{X \prec Y}&= \frac{1}{m} \,\sum _{l=1}^{m} \left( \widehat{S}_Y(t_l) - \widehat{S}_X(t_l) \right) ^{(+)} ; \end{aligned} \end{aligned}$$
  • Modified one-sided Anderson–Darling statistics

    $$\begin{aligned} \begin{aligned} A_{X \succ Y}^{\gamma }&= \sum _{l=1}^{m} \psi _{\gamma }(t_l) \left( \widehat{S}_X(t_l) - \widehat{S}_Y(t_l) \right) ^{(+)} , \\ A_{X \prec Y}^{\gamma }&= \sum _{l=1}^{m} \psi _{\gamma }(t_l) \left( \widehat{S}_Y(t_l) - \widehat{S}_X(t_l) \right) ^{(+)} ; \end{aligned} \end{aligned}$$
  • Modified one-sided Kolmogorov–Smirnov statistics

    $$\begin{aligned} \begin{aligned} D_{X \succ Y}&= \sup _{t} \left( \widehat{S}_X(t) - \widehat{S}_Y(t) \right) , \\ D_{X \prec Y}&= \sup _{t} \left( \widehat{S}_Y(t) - \widehat{S}_X(t) \right) . \end{aligned} \end{aligned}$$

Unlike the classical Cramér–von Mises statistic (see Anderson 1962), we use modified versions which do not take the squares of the differences. Our statistics are in fact one-sided versions of the statistic considered by Schmid and Trede (1995), which has shown quite similar performance as the classical Cramér–von Mises statistic for two-sample tests against general alternatives (see also Schmid and Trede 1996). The classical Anderson–Darling statistic (see Pettitt 1976) is a weighted version of the Cramér–von Mises statistic with weight \(( \widehat{G}(t_l)[1-\widehat{G}(t_l)] )^{-1}\). We consider some modifications in the weight \(\psi _{\gamma }(t_l)\) of the Anderson–Darling statistics (in Sect. 3, we investigate the properties of the corresponding tests for \(\gamma =2\) and \(\gamma =3\)). We define the statistics without any normalizing factor because such factor is not needed for applying our tests.

We will describe in detail the testing procedure with the statistics \((W_{X \succ Y} , W_{X \prec Y})\); the procedures with the other test statistics are analogous. A four-hypothesis testing problem implies four decision regions defined by four critical values (see Bennett 2013; Heathcote et al. 2010). Let \(w_{1,\alpha }\) and \(w_{2,\alpha }\) be defined so that \(\mathbb {P}\,( W_{X \succ Y} \ge w_{1,\alpha } \,|\, H_0 )=\alpha\) and \(\mathbb {P}\,( W_{X \prec Y} \ge w_{2,\alpha } \,|\, H_0 )=\alpha\). Similarly, \(w_{1,\alpha ^{\star }}\) and \(w_{2,\alpha ^{\star }}\) are such that \(\mathbb {P}\,( W_{X \succ Y} \ge w_{1,\alpha ^{\star }} \,|\, H_0 )=\alpha ^{\star }\) and \(\mathbb {P}\,( W_{X \prec Y} \ge w_{2,\alpha ^{\star }} \,|\, H_0 )=\alpha ^{\star }\), where \(\alpha ^{\star } > \alpha\). We adopt the following decision rule (cf. Bennett 2013; Angelov et al. 2019b).

Decision rule 1

  1. (a) 

    If \(W_{X \succ Y} < w_{1,\alpha }\) and \(W_{X \prec Y} < w_{2,\alpha }\), then retain \(H_0\).

  2. (b) 

    If \(W_{X \succ Y} \ge w_{1,\alpha }\) or \(W_{X \prec Y} \ge w_{2,\alpha }\), then

    1. (i) 

      if \(W_{X \succ Y} \ge w_{1,\alpha }\) and \(W_{X \prec Y} < w_{2,\alpha ^{\star }}\), then accept \(H_{\succ }\);

    2. (ii) 

      if \(W_{X \succ Y} < w_{1,\alpha ^{\star }}\) and \(W_{X \prec Y} \ge w_{2,\alpha }\), then accept \(H_{\prec }\);

    3. (iii) 

      if \(W_{X \succ Y} \ge w_{1,\alpha ^{\star }}\) and \(W_{X \prec Y} \ge w_{2,\alpha ^{\star }}\), then accept \(H_{\mathrm {cr}}\).

The decision rule is illustrated in Fig. 1. Essentially, \(H_{\succ }\) is accepted if \(W_{X \succ Y}\) is large enough and \(W_{X \prec Y}\) is small enough; similarly, \(H_{\prec }\) is accepted if \(W_{X \prec Y}\) is large enough and \(W_{X \succ Y}\) is small enough. The value of \(\alpha ^{\star }\) controls the discrimination between \(H_{\succ }\), \(H_{\prec }\), and \(H_{\mathrm {cr}}\). Increasing the value of \(\alpha ^{\star }\) results in larger acceptance region for \(H_{\mathrm {cr}}\) and smaller acceptance regions for \(H_{\succ }\) and \(H_{\prec }\).

To obtain the critical values or the corresponding \(p\textit{-}\)values, we employ a permutation-based approach (sometimes called randomization test approach, see Hemerik and Goeman 2021). That is, we generate random permutations of the data \(\mathbf{z} _1, \ldots , \mathbf{z} _n\), calculate the value of the test statistic for each generated permutation, and then use the resulting empirical distribution of the test statistic as an approximation of the null distribution (see Hemerik and Goeman 2018; Lehmann and Romano 2005, Ch. 15; Romano 1989). A random permutation of the data is generated by randomly choosing (with probability 1/2) between \((x_{i1}, \ldots , x_{ik}, y_{i1}, \ldots , y_{ik})\) and \((y_{i1}, \ldots , y_{ik}, x_{i1}, \ldots , x_{ik})\) for each i. The algorithm is given below. Let \((w_1, w_2) = \texttt {TS}\,(\mathbf{z} _1, \ldots , \mathbf{z} _n)\) denote the value of \(( W_{X \succ Y} , W_{X \prec Y} )\) for the observed data \(\mathbf{z} _1, \ldots , \mathbf{z} _n\). Similarly, \((w_1^{[r]}, w_2^{[r]}) = \texttt {TS}\,( \mathbf{z} _1^{[r]}, \ldots , \mathbf{z} _n^{[r]} )\) is the value of \(( W_{X \succ Y} , W_{X \prec Y} )\) for the dataset \(\mathbf{z} _1^{[r]}, \ldots , \mathbf{z} _n^{[r]}\).

figure a

Let us define

$$\begin{aligned} p_1 = \mathbb {P}\,( W_{X \succ Y} \ge w_1 \,|\, H_0 ) , \quad p_2 = \mathbb {P}\,( W_{X \prec Y} \ge w_2 \,|\, H_0 ) , \end{aligned}$$

which we call marginal \(p\textit{-}\)values. They can be estimated as follows:

$$\begin{aligned} \widetilde{p}_1 = \frac{1 + \sum _{r=1}^R \,\mathbbm {1} \{ w_1^{[r]} \ge w_1 \}}{R+1} , \quad \widetilde{p}_2 = \frac{1 + \sum _{r=1}^R \,\mathbbm {1} \{ w_2^{[r]} \ge w_2 \}}{R+1} \, . \end{aligned}$$

Then, Decision rule 1 can be expressed in terms of \(\widetilde{p}_1\) and \(\widetilde{p}_2\):

Decision rule 1’

  1. (a) 

    If \(\widetilde{p}_1 > \alpha\) and \(\widetilde{p}_2 > \alpha\), then retain \(H_0\).

  2. (b) 

    If \(\widetilde{p}_1 \le \alpha\) or \(\widetilde{p}_2 \le \alpha\), then

    1. (i) 

      if \(\widetilde{p}_1 \le \alpha\) and \(\widetilde{p}_2 > \alpha ^{\star }\), then accept \(H_{\succ }\);

    2. (ii) 

      if \(\widetilde{p}_1 > \alpha ^{\star }\) and \(\widetilde{p}_2 \le \alpha\), then accept \(H_{\prec }\);

    3. (iii) 

      if \(\widetilde{p}_1 \le \alpha ^{\star }\) and \(\widetilde{p}_2 \le \alpha ^{\star }\), then accept \(H_{\mathrm {cr}}\).

It should be noted that borderline cases may occur when the test statistic is close to the border of the decision region (respectively, a marginal \(p\textit{-}\)value is close to one of the thresholds \(\alpha\) and \(\alpha ^{\star }\)). Therefore, it is advisable to report the conclusion of the test together with the marginal \(p\textit{-}\)values \(\widetilde{p}_1\), \(\widetilde{p}_2\) and the thresholds \(\alpha\), \(\alpha ^{\star }\) (see Angelov et al. 2019b).

In a testing problem involving just a null hypothesis (the hypothesis of no difference) and an alternative hypothesis (the hypothesis of interest), the event of wrongly accepting the alternative hypothesis is called Type I error, while the event of not accepting the alternative when it is true is called Type II error. In our setting, if \(H_{\succ }\) is the hypothesis of interest, false detection of \(H_{\succ }\) (wrongly accepting \(H_{\succ }\)) and non-detection of \(H_{\succ }\) (not accepting the true \(H_{\succ }\)) can be viewed as analogues of Type I error and Type II error, respectively.

Let \(\mathrm {FDP}\) be the probability of a false detection of dominance (\(H_{\succ }\)) and let \(\mathrm {NDP}\) be the probability of a non-detection of dominance (\(H_{\succ }\)). These probabilities can be expressed as follows:

$$\begin{aligned} \mathrm {FDP}&= \mathbb {P}\,( \text{ accept } \; H_{\succ } \,|\, H_0 ) + \mathbb {P}\,( \text{ accept } \; H_{\succ } \,|\, H_{\mathrm {cr}} ) + \mathbb {P}\,( \text{ accept } \; H_{\succ } \,|\, H_{\prec } ) \\&= \mathrm {FDP}_1 + \mathrm {FDP}_2 + \mathrm {FDP}_3 , \\ \mathrm {NDP}&= \mathbb {P}\,( \text{ retain } \; H_0 \,|\, H_{\succ } ) + \mathbb {P}\,( \text{ accept } \; H_{\mathrm {cr}} \,|\, H_{\succ } ) + \mathbb {P}\,( \text{ accept } \; H_{\prec } \,|\, H_{\succ } ) \\&= \mathrm {NDP}_1 + \mathrm {NDP}_2 + \mathrm {NDP}_3 . \end{aligned}$$

The power to detect dominance (\(H_{\succ }\)) is defined as \(\mathbb {P}\,( \text{ accept } \; H_{\succ } \,|\, H_{\succ } ) = 1-\mathrm {NDP}\).

Let \(U_n\) be a generic notation for the test statistics defined above.

Assumption 1

There exist a nonrandom sequence \(\tau _n\) and a nondegenerate random variable U such that \(\tau _n \longrightarrow \infty\) and under the null hypothesis \(\tau _n U_n\) converges in distribution to U as \(n \longrightarrow \infty\).

Assumption 2

The distribution function of U is continuous and strictly increasing at \(u_{\alpha }\), where \(\mathbb {P}\,( U \ge u_{\alpha } \,|\, H_0 )=\alpha\).

Assumption 3

The distribution functions \(F_X\) and \(F_Y\) are continuous.

Assumptions similar to Assumption 1 are common in the literature on subsampling (see, e.g., Politis et al. 1999). Assumption 3 is needed only for the Anderson–Darling test, where it is used for showing that \(m\psi _{\gamma }(t_l), \, l<m\), is asymptotically bounded away from zero (almost surely).

Some results concerning the error probabilities \(\mathrm {FDP}\) and \(\mathrm {NDP}\) are established in the following theorems.

Theorem 1

Suppose that Assumptions 1and 2are satisfied. Then the following are true for the proposed Cramér–von Mises test.

  1. (a) 

    \(\mathrm {FDP}_1 \le \alpha\).

  2. (b) 

    \(\mathrm {FDP}_2 + \mathrm {FDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

  3. (c) 

    \(\mathrm {NDP}_1 + \mathrm {NDP}_2 + \mathrm {NDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

Theorem 2

Suppose that Assumptions 1, 2, and 3are satisfied. Then the following are true for the proposed Anderson–Darling test.

  1. (a) 

    \(\mathrm {FDP}_1 \le \alpha\).

  2. (b) 

    \(\mathrm {FDP}_2 + \mathrm {FDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

  3. (c) 

    \(\mathrm {NDP}_1 + \mathrm {NDP}_2 + \mathrm {NDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

Theorem 3

Suppose that Assumptions 1and 2are satisfied. Then the following are true for the proposed Kolmogorov–Smirnov test.

  1. (a) 

    \(\mathrm {FDP}_1 \le \alpha\).

  2. (b) 

    \(\mathrm {FDP}_2 + \mathrm {FDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

  3. (c) 

    \(\mathrm {NDP}_1 + \mathrm {NDP}_2 + \mathrm {NDP}_3 \longrightarrow 0\) as \(n \longrightarrow \infty\).

Fig. 1
figure 1

Decision rule

3 Simulation study

3.1 Setup

We conducted simulations to examine the behavior of the suggested tests in terms of false detection of dominance and power to detect dominance.

Let \(\mathbf{Z} = (X_1, \ldots , X_k, Y_1, \ldots , Y_k)\), where each \(X_j, \, j=1, \ldots ,k,\) is distributed like X and each \(Y_j, \, j=1, \ldots ,k,\) is distributed like Y. Let \(\varvec{\mu } = ( \mu _X, \ldots , \mu _X, \mu _Y, \ldots , \mu _Y )\), \(\varvec{\Sigma }_{XY}\) be a \(k \times k\) matrix with entries \(\rho _{XY}\,\sigma _X \sigma _Y\),

$$\begin{aligned}&\varvec{\Sigma }_{X} = \begin{pmatrix} \sigma _X^2 &{} \rho _{12}\,\sigma _X^2 &{} \ldots &{} \rho _{1k}\,\sigma _X^2 \\ \rho _{12}\,\sigma _X^2 &{} \sigma _X^2 &{} \ldots &{} \rho _{2k}\,\sigma _X^2 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{1k}\,\sigma _X^2 &{} \rho _{2k}\,\sigma _X^2 &{} \ldots &{} \sigma _X^2 \end{pmatrix} , \;\;\; \varvec{\Sigma }_{Y} = \begin{pmatrix} \sigma _Y^2 &{} \rho _{12}\,\sigma _Y^2 &{} \ldots &{} \rho _{1k}\,\sigma _Y^2 \\ \rho _{12}\,\sigma _Y^2 &{} \sigma _Y^2 &{} \ldots &{} \rho _{2k}\,\sigma _Y^2 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{1k}\,\sigma _Y^2 &{} \rho _{2k}\,\sigma _Y^2 &{} \ldots &{} \sigma _Y^2 \end{pmatrix} , \\&\varvec{\Sigma } = \begin{pmatrix} \varvec{\Sigma }_{X} &{} \varvec{\Sigma }_{XY} \\ \varvec{\Sigma }_{XY} &{} \varvec{\Sigma }_{Y} \end{pmatrix} . \end{aligned}$$

Let \(\text{ N } (\mu , \sigma )\) and \(\text{ La } (\mu , \sigma )\) denote, respectively, normal distribution and Laplace distribution with mean \(\mu\) and standard deviation \(\sigma\), while \(\text{ LN } (\mu , \sigma )\) denotes lognormal distribution with parameters \(\mu\) and \(\sigma\) such that \(X \sim \text{ LN } (\mu , \sigma )\) \(\Longleftrightarrow\) \(\log (X) \sim \text{ N } ( \mu , \sigma ).\)

We generated data from the following distributions:

  1. (a) 

    Multivariate normal distribution with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\), \(\mathbf{Z} \sim \text{ MVN } ( \varvec{\mu }, \varvec{\Sigma } )\).

  2. (b) 

    Multivariate lognormal distribution, \(\mathbf{Z} \sim \text{ MVLN } ( \varvec{\mu }, \varvec{\Sigma } )\) \(\Longleftrightarrow\) \(\log (\mathbf{Z} ) \sim \text{ MVN } ( \varvec{\mu }, \varvec{\Sigma } ) .\)

  3. (c) 

    Multivariate Laplace distribution with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\), \(\mathbf{Z} \sim \text{ MVLa } ( \varvec{\mu }, \varvec{\Sigma } )\), see, e.g., Kotz et al. (2001).

Figure 2 depicts survival functions corresponding to some of the scenarios in the simulations. For generating random numbers from the multivariate normal and the multivariate lognormal, the R package MASS was used (see Venables and Ripley 2002), while for the multivariate Laplace distribution, we used the R package LaplacesDemon (see Statisticat 2018).

All computations were performed with (see R Core Team 2019). The R code can be obtained from the corresponding author upon request. The results are based on 3000 simulated datasets under each setting; the number of generated random permutations for each dataset is \(R = 4000\), \(\alpha = 0.05\), and \(\alpha ^{\star } = 0.96\) (cf. Angelov et al. 2019b).

3.2 Results

Let CvM, AD2, AD3, and KS denote, respectively, the Cramér–von Mises test, the Anderson–Darling test with \(\gamma =2\), the Anderson–Darling test with \(\gamma =3\), and the Kolmogorov–Smirnov test.

Simulation results concerning false detection of dominance when the truth is \(H_0\) are presented in Fig. 3. For all tests and all sample sizes, the probability of a false detection (\(\mathrm {FDP}_1\)) is less than \(\alpha = 0.05\) (in most cases, it is even not greater than \(\alpha /2 = 0.025\)). One should not forget that under \(H_0\) three types of erroneous decisions may occur: accepting \(H_{\succ }\), accepting \(H_{\prec }\), and accepting \(H_{\mathrm {cr}}\). The corresponding error probabilities add up to \(\mathbb {P}\,( \text{ reject } \; H_0 \,|\, H_0 )\), which is not greater than \(2\alpha\).

Figure 4 depicts the probability of a false detection of dominance when the truth is \(H_{\mathrm {cr}}\). The probability of a false detection (\(\mathrm {FDP}_2\)) tends to zero as the sample size increases. For smaller sample sizes, \(\mathrm {FDP}_2\) is smallest for the Anderson–Darling test with \(\gamma =2\), followed by the Anderson–Darling test with \(\gamma =3\), the Cramér–von Mises test, and the Kolmogorov–Smirnov test.

Power curves for \(n=70\) are shown in Fig. 5, where the power to detect dominance is plotted against \(\delta = \mu _Y - \mu _X\). We see that the power gets closer to one as \(\delta\) increases. Overall, the Cramér–von Mises test is the most powerful. For \(\rho _{XY} = 0.8\), the Kolmogorov–Smirnov test has the lowest power and Anderson–Darling tests have quite similar performance as the Cramér–von Mises test. For \(\rho _{XY} = 0.2\), the four tests do not differ that much in terms of power.

Let us consider the following scenarios for the correlation structure:

Scenario (3e):

  \(k=3\),   \(\rho _{12} = \rho _{23} = 0.5\),     \(\rho _{13} = 0.5\),    \(\rho _{XY} = 0.5\);

Scenario (3ar):

\(k=3\),   \(\rho _{12} = \rho _{23} = 0.62\),   \(\rho _{13} = 0.38\),   \(\rho _{XY} = 0.5\);

Scenario (2e):

  \(k=2\),   \(\rho _{12} = 0.5\),    \(\rho _{XY} = 0.5\);

Scenario (2ar):

\(k=2\),   \(\rho _{12} = 0.62\),   \(\rho _{XY} = 0.5\).

In Scenario (3e), all correlations are equal to 0.5, while in Scenario (3ar), \(\rho _{12}\), \(\rho _{23}\), and \(\rho _{13}\) are in accordance with an autoregressive process of order one. Scenarios (2e) and (2ar) are defined in analogy to (3e) and (3ar) but with \(k=2\). We performed simulations to investigate the power to accept a fixed hypothesis of dominance for different sample sizes, under the scenarios specified above. The results are illustrated in Fig. 6. The power approaches one as the sample size increases. The Cramér–von Mises test has the highest power. The Anderson–Darling test with \(\gamma =3\) is slightly less powerful. For the normal and the lognormal distributions, the Kolmogorov–Smirnov test has the lowest power. For the Laplace distribution, the Anderson–Darling test with \(\gamma =2\) is the least powerful.

The results for the Cramér–von Mises test under the four scenarios are presented in Fig. 7. We see that the power is higher for \(k=3\) than for \(k=2\). Also, for each k, the scenario where all correlations are equal to 0.5 leads to higher power than the other scenario. In order to further investigate how the correlations between \(X_1, \ldots , X_k\) (respectively, \(Y_1, \ldots , Y_k\)) affect power, we considered the following scenarios:

Scenario (3w):

  \(k=3\),   \(\rho _{12} = \rho _{23} = 0.38\),   \(\rho _{13} = 0.38\),   \(\rho _{XY} = 0.5\);

Scenario (3s):

   \(k=3\),   \(\rho _{12} = \rho _{23} = 0.62\),   \(\rho _{13} = 0.62\),   \(\rho _{XY} = 0.5\).

The correlations \(\rho _{12}\), \(\rho _{23}\), and \(\rho _{13}\) are ’weak’ in Scenario (3w) and ’strong’ in Scenario (3s). The results show that the power is higher when the correlations between \(X_1, \ldots , X_k\) (respectively, \(Y_1, \ldots , Y_k\)) are weaker (see Figs. 8 and 9).

In summary, the Cramér–von Mises test is the most powerful. The Anderson–Darling test with \(\gamma =3\) is slightly less powerful but has lower probability of a false detection of dominance for small sample sizes compared with the Cramér–von Mises test.

Fig. 2
figure 2

Survival functions

Fig. 3
figure 3

Probability of a false detection of dominance when the truth is \(H_0\). Results for different sample sizes with \(k=3\), \(\rho _{XY} = \rho _{12} = \rho _{13} = \rho _{23} = 0.8\)

Fig. 4
figure 4

Probability of a false detection of dominance when the truth is \(H_{\mathrm {cr}}\). Results for different sample sizes with \(k=3\), \(\rho _{XY} = \rho _{12} = \rho _{13} = \rho _{23} = 0.8\), under the settings depicted in Fig. 2 (first row)

Fig. 5
figure 5

Power to detect dominance. Results for different values of \(\delta\) with \(n=70\), \(k=3\); for all distributions \(\mu _X = 0\), for the normal and the Laplace \(\sigma _X = \sigma _Y = 1\), for the lognormal \(\sigma _X = \sigma _Y = 0.6\). In the first row, \(\rho _{XY} = \rho _{12} = \rho _{13} = \rho _{23} = 0.8\), while in the second row, \(\rho _{XY} = \rho _{12} = \rho _{13} = \rho _{23} = 0.2\)

Fig. 6
figure 6

Power to detect dominance. Results for different sample sizes under Scenarios (3e), (3ar), (2e), (2ar). The underlying survival functions are shown in Figure 2 (second row)

Fig. 7
figure 7

Power to detect dominance. The results for the Cramér–von Mises test under Scenarios (3e), (3ar), (2e), (2ar)

Fig. 8
figure 8

Power to detect dominance. Results for different sample sizes under Scenarios (3w), (3s). The underlying survival functions are shown in Figure 2 (second row)

Fig. 9
figure 9

Power to detect dominance. The results for the Cramér–von Mises test under Scenarios (3w), (3s)

4 Real data example

We apply the proposed tests to data from an experiment where participants were asked about their willingness to pay for an improved outdoor sound environment. The dataset is available at Mendeley Data (Angelov et al. 2019a). In a sound laboratory, the participants listened to recordings of outdoor sound environments and had to imagine that each recording was the noise they hear while sitting on their balcony. They were asked how much they would be willing to pay for a noise reduction that would change a given sound environment with road-traffic noise to an environment without the road traffic noise. Each participant was requested to answer by means of: (i) a self-selected point (SSP), i.e., the amount in Swedish kronor he/she would be willing to pay per month for the improvement, and (ii) a self-selected interval (SSI), i.e., the lowest and highest amounts he/she would be willing to pay. The experiment included five main scenarios (referred to as Scenario 1, 2, 3, 4, and 5) with systematically increasing noise levels: Scenario 1 corresponds to the smallest noise reduction, while Scenario 5 corresponds to the largest. Each participant gave answers for each scenario four times: at two SSP sessions and two SSI sessions.

The following variables are of main interest:

  • \(\texttt {pt}\) is the point answer at the first or the second SSP session.

  • \(\texttt {low}\) and \(\texttt {upp}\) are, respectively, the lower bound and the upper bound of the interval answered at the first or the second SSI session.

  • \(\texttt {mid}\) is the midpoint of the interval answered at the first or the second SSI session.

Each variable was observed under the five scenarios and these are denoted, e.g., \(\texttt {pt[1]}, \ldots , \texttt {pt[5]}\).

Our analysis is based on \(n=59\) participants (just as in Angelov et al. 2019b), \(\alpha = 0.05\), \(\alpha ^{\star } = 0.96\), and \(R = 20000\).

We are interested in whether the survival function of SSP lies between the survival functions of the lower and the upper bounds of SSI. The conducted dominance tests confirm this in most cases (see Table 1).

We also want to find out whether the respondents are willing to pay more for higher levels of noise reduction. This implies that the willingness to pay under Scenario 2 stochastically dominates the willingness to pay under Scenario 1; similarly, the willingness to pay under Scenario 3 stochastically dominates the willingness to pay under Scenario 2, and so on. The empirical survival functions for each consecutive pair of scenarios are displayed in Fig. 10. We conducted dominance tests (see Table 2) and in most cases the tests conclude that the willingness to pay for the higher level of noise reduction dominates the willingness to pay for the lower level.

The four tests in most cases led to the same conclusion. In Angelov et al. (2019b), analogous tests were performed but separately for the first and the second session, while here we apply the new tests for repeated measurements. If we compare the results, we see that the hypothesis of dominance is accepted slightly more often with the new testing procedures, but overall the conclusions are to a great extent similar.

Table 1 Comparison of self-selected points and self-selected intervals. Conclusion of the test together with the marginal \(p\textit{-}\)values \(\widetilde{p}_1\) and \(\widetilde{p}_2\)
Table 2 Comparison of willingness to pay for different levels of noise reduction. Conclusion of the test together with the marginal \(p\textit{-}\)values \(\widetilde{p}_1\) and \(\widetilde{p}_2\)
Fig. 10
figure 10

Empirical survival functions

5 Concluding remarks

We proposed permutation-based four-decision tests of stochastic dominance for repeated measurements data. We proved under certain regularity conditions that as the sample size increases, the probability to detect dominance tends to one and the probability of a false detection of dominance does not exceed a pre-specified level. Our simulations indicated good performance of the testing procedures for a range of sample sizes.