Tests of stochastic dominance with repeated measurements data

The paper explores a testing problem which involves four hypotheses, that is, based on observations of two random variables X and Y, we wish to discriminate between four possibilities: identical survival functions, stochastic dominance of X over Y, stochastic dominance of Y over X, or crossing survival functions. Four-decision testing procedures for repeated measurements data are proposed. The tests are based on a permutation approach and do not rely on distributional assumptions. One-sided versions of the Cramér–von Mises, Anderson–Darling, and Kolmogorov–Smirnov statistics are utilized. The consistency of the tests is proven. A simulation study shows good power properties and control of false-detection errors. The suggested tests are applied to data from a psychophysical experiment.


Introduction
Many research questions give rise to a two-sample problem of comparing distributions (or survival functions).Often researchers are interested in one-sided alternative hypotheses and the notion of stochastic dominance (stochastic ordering) is needed in order to formulate rigorously such hypotheses.Let X and Y be random variables with survival functions S X (t) = ℙ (X > t) and S Y (t) = ℙ (Y > t) .We say that X sto- chastically dominates Y if We will sometimes skip the word stochastically and we will just say that X dominates Y.If X dominates Y, we write X ≻ Y , or equivalently, Y ≺ X .If there exist val- ues c 1 and c 2 such that S X (c 1 ) > S Y (c 1 ) and S X (c 2 ) < S Y (c 2 ) , we say that the survival functions of X and Y cross one another.Stochastic dominance induces four possible hypotheses: (i) X and Y have identical survival functions, (ii) X dominates Y, (iii) Y dominates X, and (iv) the survival functions of X and Y cross one another.The concept of stochastic dominance has been employed in many areas such as economics, psychology, and medicine (see, e.g., Davidson and Duclos 2000;Donald and Hsu 2016;Levy 2016;Ashby et al. 1993;Heck and Erdfelder 2016;Petroni and Wolfe 1994;Ledwina and Wyłupek 2012).As noted by Townsend (1990), stochastic dominance implies but is not implied by the same ordering of the means (if the means exist).
If we are interested in classifying the stochastic dominance relation (into the four cases specified above) based on observations of two random variables, a common procedure is the following (Whang 2019) However, with this procedure it is difficult to control the possible classification errors, e.g., inferring dominance when in fact the survival functions cross (see, e.g., Whang 2019, p. 106).Bennett (2013) proposed a four-hypothesis testing procedure which allows maintaining (asymptotic) control over the various error probabilities.Most of the existing tests of stochastic dominance assume independent observations (see, e.g., the recent monograph of Whang 2019 and the references therein).However, many experiments involve repeated measurements from each subject and such observations are not independent.For these designs, appropriate statistical methods that account for the dependence structure of the data are needed.Reducing the repeated measurements to single observations by taking their means or medians is not advisable because the available data are not efficiently used (see, e.g., Roy et al. 2019).Such transformations of the observed data will result in different estimates of the survival functions; in particular, the estimated survival function will have fewer jumps and some information will be lost.
We are not aware of a dominance test with four hypotheses which is suitable for data with repeated measurements.Building upon the ideas of Bennett (2013) and Angelov S X (t) ≥ S Y (t) for all t with strict inequality for some t.
H 01 ∶ S X (t) ≥ S Y (t) for all t, against its negation and H 02 ∶ S Y (t) ≥ S X (t) for all t, against its negation.

3
Tests of stochastic dominance with repeated measurements… et al. (2019b), we suggest four-decision testing procedures for repeated measurements data.In Sect.2, we introduce the testing procedures.Section 3 reports a simulation study.In Sect.4, the suggested procedures are applied to data from an experiment concerning the willingness to pay for a certain environmental improvement.Proofs and auxiliary results are given in the Appendix.

Testing procedures
Let us consider the following mutually exclusive hypotheses about the random variables X and Y: We explore a four-hypothesis testing problem with null hypothesis H 0 and three alternative hypotheses: H ≻ , H ≺ , and H cr .
The survival and distribution functions of X and Y are denoted Throughout the paper it is assumed that we have observations where x ij is the observed value of X for individual/subject i at occasion j and y ij is the observed value of Y for individual/subject i at occasion j, i.e., {x ij } are obser- vations from S X and {y ij } are observations from S Y .We will also use the notation 1 , … , n , where i = (x i1 , … , x ik , y i1 , … , y ik ) , i.e., i is the vector of observations for individual/subject i.For simplicity, the observations {x ij , y ij } denote random variables or values of random variables, depending on the context.Note that the vectors 1 , … , n are independent (and identically distributed) but the observations (x i1 , … , x ik , y i1 , … , y ik ) within each subject i can be correlated.
The empirical distribution function based on the observations {x ij } is and the empirical survival function is where  > 1 is some real number.We utilize the following test statistics: • Modified one-sided Cramér-von Mises statistics   Anderson 1962), we use modified versions which do not take the squares of the differences.Our statistics are in fact one-sided versions of the statistic considered by Schmid and Trede (1995), which has shown quite similar performance as the classical Cramér-von Mises statistic for two-sample tests against general alternatives (see also Schmid and Trede 1996).The classical Anderson-Darling statistic (see Pettitt 1976) is a weighted version of the Cramér-von Mises statistic with weight We consider some modifications in the weight (t l ) of the Anderson-Darling statistics (in Sect.3, we investigate the properties of the corresponding tests for = 2 and = 3 ).We define the statistics without any normal- izing factor because such factor is not needed for applying our tests.We will describe in detail the testing procedure with the statistics (W X≻Y , W X≺Y ) ; the procedures with the other test statistics are analogous.A four- hypothesis testing problem implies four decision regions defined by four critical values (see Bennett 2013;Heathcote et al. 2010).Let w 1, and w 2, be defined so , ; , ; 1 3 Tests of stochastic dominance with repeated measurements… where  ⋆ >  .We adopt the following decision rule (cf.The decision rule is illustrated in Fig. 1.Essentially, H ≻ is accepted if W X≻Y is large enough and W X≺Y is small enough; similarly, H ≺ is accepted if W X≺Y is large enough and W X≻Y is small enough.The value of  ⋆ controls the discrimination between H ≻ , H ≺ , and H cr .Increasing the value of  ⋆ results in larger acceptance region for H cr and smaller acceptance regions for H ≻ and H ≺ . To obtain the critical values or the corresponding p-values, we employ a per- mutation-based approach (sometimes called randomization test approach, see Hemerik and Goeman 2021).That is, we generate random permutations of the data 1 , … , n , calculate the value of the test statistic for each generated permutation, and then use the resulting empirical distribution of the test statistic as an approximation of the null distribution (see Hemerik and Goeman 2018;Lehmann and Romano 2005, Ch. 15;Romano 1989).A random permutation of the data is generated by randomly choosing (with probability 1/2) between (x i1 , … , x ik , y i1 , … , y ik ) and (y i1 , … , y ik , x i1 , … , x ik ) for each i.The algorithm is given below.Let (w 1 , w 2 ) = ( 1 , … , n ) denote the value of (W X≻Y , W X≺Y ) for the observed data 1 , … , n .Similarly, (w [r]  1 , w [r]  2 ) = ( [r]  1 , … , [r]  n ) is the value of (W X≻Y , W X≺Y ) for the dataset [r]  1 , … , [r]  n .
for r = 1, . . ., R Let us define which we call marginal p-values.They can be estimated as follows: Then, Decision rule 1 can be expressed in terms of p1 and p2 : Decision rule 1' It should be noted that borderline cases may occur when the test statistic is close to the border of the decision region (respectively, a marginal p-value is close to one of the thresholds and  ⋆ ).Therefore, it is advisable to report the conclusion of the test together with the marginal p-values p1 , p2 and the thresholds ,  ⋆ (see Angelov et al. 2019b).
In a testing problem involving just a null hypothesis (the hypothesis of no difference) and an alternative hypothesis (the hypothesis of interest), the event of wrongly accepting the alternative hypothesis is called Type I error, while the event of not accepting the alternative when it is true is called Type II error.In our setting, if H ≻ is the hypothesis of interest, false detection of H ≻ (wrongly accepting H ≻ ) and non- detection of H ≻ (not accepting the true H ≻ ) can be viewed as analogues of Type I error and Type II error, respectively.
1 3 Tests of stochastic dominance with repeated measurements… Let FDP be the probability of a false detection of dominance ( H ≻ ) and let NDP be the probability of a non-detection of dominance ( H ≻ ).These probabilities can be expressed as follows: The power to detect dominance ( Let U n be a generic notation for the test statistics defined above.
Assumption 1 There exist a nonrandom sequence n and a nondegenerate random variable U such that n ⟶ ∞ and under the null hypothesis n U n converges in dis- tribution to U as n ⟶ ∞.

Assumption 2
The distribution function of U is continuous and strictly increasing at u , where Assumption 3 The distribution functions F X and F Y are continuous.
Assumptions similar to Assumption 1 are common in the literature on subsampling (see, e.g., Politis et al. 1999).Assumption 3 is needed only for the Anderson-Darling test, where it is used for showing that m  (t l ), l < m , is asymptotically bounded away from zero (almost surely).
Some results concerning the error probabilities FDP and NDP are established in the following theorems.
Theorem 1 Suppose that Assumptions 1 and 2 are satisfied.Then the following are true for the proposed Cramér-von Mises test.
Theorem 2 Suppose that Assumptions 1, 2, and 3 are satisfied.Then the following are true for the proposed Anderson-Darling test.
Theorem 3 Suppose that Assumptions 1 and 2 are satisfied.Then the following are true for the proposed Kolmogorov-Smirnov test.
3 Simulation study

Setup
We conducted simulations to examine the behavior of the suggested tests in terms of false detection of dominance and power to detect dominance.
We generated data from the following distributions: (a) Multivariate normal distribution with mean vector and covariance matrix , (c) Multivariate Laplace distribution with mean vector and covariance matrix , ∼ MVLa ( , ) , see, e.g., Kotz et al. (2001).
Figure 2 depicts survival functions corresponding to some of the scenarios in the simulations.For generating random numbers from the multivariate normal and the multivariate lognormal, the R package MASS was used (see Venables and Ripley 2002), while for the multivariate Laplace distribution, we used the R package LaplacesDemon (see Statisticat 2018).All computations were performed with (see R Core Team 2019).The R code can be obtained from the corresponding author upon request.The results are based on 3000 simulated datasets under each setting; the number of generated random 1 3 Tests of stochastic dominance with repeated measurements… permutations for each dataset is R = 4000 , = 0.05 , and  ⋆ = 0.96 (cf.Angelov  et al. 2019b).
Simulation results concerning false detection of dominance when the truth is H 0 are presented in Fig. 3.For all tests and all sample sizes, the probability of a false detection ( FDP 1 ) is less than = 0.05 (in most cases, it is even not greater than ∕2 = 0.025 ).One should not forget that under H 0 three types of erroneous deci- sions may occur: accepting H ≻ , accepting H ≺ , and accepting H cr .The corresponding error probabilities add up to ℙ ( reject H 0 | H 0 ) , which is not greater than 2 .
Figure 4 depicts the probability of a false detection of dominance when the truth is H cr .The probability of a false detection ( FDP 2 ) tends to zero as the sample size increases.For smaller sample sizes, FDP 2 is smallest for the Anderson-Darling test with = 2 , followed by the Anderson-Darling test with = 3 , the Cramér-von Mises test, and the Kolmogorov-Smirnov test.
Power curves for n = 70 are shown in Fig. 5, where the power to detect domi- nance is plotted against = Y − X .We see that the power gets closer to one as increases.Overall, the Cramér-von Mises test is the most powerful.For XY = 0.8 , In Scenario (3e), all correlations are equal to 0.5, while in Scenario (3ar), 12 , 23 , and 13 are in accordance with an autoregressive process of order one.Scenarios (2e) and (2ar) are defined in analogy to (3e) and (3ar) but with k = 2 .We performed simulations to investigate the power to accept a fixed hypothesis of dominance for different sample sizes, under the scenarios specified above.The results are illustrated in Fig. 6.The power approaches one as the sample size increases.The Cramér-von Mises test has the highest power.The Anderson-Darling test with = 3 is slightly less powerful.For the normal and the lognormal The results for the Cramér-von Mises test under the four scenarios are presented in Fig. 7.We see that the power is higher for k = 3 than for k = 2 .Also, for each k, the scenario where all correlations are equal to 0.5 leads to higher power than the other scenario.In order to further investigate how the correlations between X 1 , … , X k (respectively, Y 1 , … , Y k ) affect power, we considered the fol- lowing scenarios: The correlations 12 , 23 , and 13 are 'weak' in Scenario (3w) and 'strong' in Scenario (3s).The results show that the power is higher when the correlations between X 1 , … , X k (respectively, Y 1 , … , Y k ) are weaker (see Figs. 8 and 9).In summary, the Cramér-von Mises test is the most powerful.The Anderson-Darling test with = 3 is slightly less powerful but has lower probability of a false detection of dominance for small sample sizes compared with the Cramér-von Mises test.Tests of stochastic dominance with repeated measurements…

Real data example
We apply the proposed tests to data from an experiment where participants were asked about their willingness to pay for an improved outdoor sound environment.The dataset is available at Mendeley Data (Angelov et al. 2019a).In a sound  laboratory, the participants listened to recordings of outdoor sound environments and had to imagine that each recording was the noise they hear while sitting on their balcony.They were asked how much they would be willing to pay for a noise reduction that would change a given sound environment with road-traffic noise to an environment without the road traffic noise.Each participant was requested to answer by means of: (i) a self-selected point (SSP), i.e., the amount in Swedish kronor he/ she would be willing to pay per month for the improvement, and (ii) a self-selected interval (SSI), i.e., the lowest and highest amounts he/she would be willing to pay.The experiment included five main scenarios (referred to as Scenario 1, 2, 3, 4, and 5) with systematically increasing noise levels: Scenario 1 corresponds to the smallest noise reduction, while Scenario 5 corresponds to the largest.Each participant gave answers for each scenario four times: at two SSP sessions and two SSI sessions.
The following variables are of main interest: • is the point answer at the first or the second SSP session.
• and are, respectively, the lower bound and the upper bound of the interval answered at the first or the second SSI session.
• is the midpoint of the interval answered at the first or the second SSI session.
Each variable was observed under the five scenarios and these are denoted, e.g., Our analysis is based on n = 59 participants (just as in Angelov et al. 2019b), = 0.05 ,  ⋆ = 0.96 , and R = 20000.We are interested in whether the survival function of SSP lies between the survival functions of the lower and the upper bounds of SSI.The conducted dominance tests confirm this in most cases (see Table 1).
We also want to find out whether the respondents are willing to pay more for higher levels of noise reduction.This implies that the willingness to pay under Scenario 2 stochastically dominates the willingness to pay under Scenario 1; similarly, the willingness to pay under Scenario 3 stochastically dominates the willingness to pay under Scenario 2, and so on.The empirical survival functions for each consecutive pair of scenarios are displayed in Fig. 10.We conducted dominance tests (see Table 2) and in 1 3 Tests of stochastic dominance with repeated measurements… most cases the tests conclude that the willingness to pay for the higher level of noise reduction dominates the willingness to pay for the lower level.The four tests in most cases led to the same conclusion.In Angelov et al. (2019b), analogous tests were performed but separately for the first and the second session, while here we apply the new tests for repeated measurements.If we compare the results, we see that the hypothesis of dominance is accepted slightly more often with the new testing procedures, but overall the conclusions are to a great extent similar.

Concluding remarks
We proposed permutation-based four-decision tests of stochastic dominance for repeated measurements data.We proved under certain regularity conditions that as the sample size increases, the probability to detect dominance tends to one and the probability of a false detection of dominance does not exceed a pre-specified level.Our simulations indicated good performance of the testing procedures for a range of sample sizes.

Appendix
Proof Follows directly from the result of Tucker (1959).Tests of stochastic dominance with repeated measurements… Proof The claim is equivalent to sup t ( FY (t) − FX (t)) a.s.
. Using some trivial inequalities and Lemma 1, we obtain ◻ Lemma 3 The following hold true.
(a) If H ≻ is true, then, with probability one, (b) If H cr is true, then, with probability one, and Proof Part (a).Let  > 0 and Thus, for every  > 0 , there exists a sufficiently large (random) N such that with probability one, That is, for small enough , for all n > N , and for each t l ∈ A we have, with prob- ability one, Recall that m = 2kn .Then, by the strong law of large numbers, with probability one, .
which was to be shown.
If H cr is true, A and B are nonempty for small enough and ℙ (y i1 ∈ A  ) > 0 , ℙ (x i1 ∈ B  ) > 0 .Using similar arguments as in Part (a), we get that for small enough , there exists a sufficiently large (random) N 1 such that for all n > N 1 and for each t l ∈ A we have, with probability one, Similarly, for small enough , there exists a sufficiently large (random) N 2 such that for all n > N 2 and for each t l ∈ B we have, with probability one, Then, by the strong law of large numbers, with probability one, and which completes the proof.◻ Lemma 4 Suppose that Assumption 3 is satisfied.Then the following hold true.
1 3 Tests of stochastic dominance with repeated measurements… (a) If H ≻ is true, then, with probability one, (b) If H cr is true, then, with probability one, and Proof Part (a).Using similar arguments as in the proof of Lemma 3, we get that for small enough , for all n > N , and for each t l ∈ A we have, with probability one, For  > 1 , we have, almost surely, where Γ(⋅) is the gamma function.Then, by the strong law of large numbers, with probability one, which was to be shown.Part (b).Using similar arguments as above, we get that for small enough , there exists a sufficiently large (random) N 1 such that for all n > N 1 and for each t l ∈ A we have, with probability one, Also, for small enough , there exists a sufficiently large (random) N 2 such that for all n > N 2 and for each t l ∈ B we have, with probability one, Then, by the strong law of large numbers, with probability one, and which completes the proof.Let a 1, , a 2, , a 1, ⋆ , and a 2, ⋆ be defined so that Let d 1, , d 2, , d 1, ⋆ , and d 2, ⋆ be defined so that
Let q be a generic notation for the estimated quantile of U n (under the null hypothesis) based on the permutation procedure.Recall that u is the quantile of U.
The following result will be of use.
Lemma 6 Suppose that Assumptions 1 and 2 are satisfied.Then under the null Proof Follows from Theorem 15.2.3 in Lehmann and Romano (2005).◻ Lemma 7 Let W n and wn be nonnegative random variables such that W n a.s.
Proof Boundedness in probability implies that for every  > 0 , there exists a posi- tive constant B and an integer N such that ℙ (w n ≤ B) ≥ 1 − for all n > N .We have where ℙ Because is arbitrary, the claim follows.
◻ Proposition 1 Suppose that Assumptions 1 and 2 are satisfied.Then the following hold true.
Proposition 2 Suppose that Assumptions 1, 2, and 3 are satisfied.Then the following hold true.

Proof (a)
Using Lemma 4, we get that if H ≻ is true, then  n A X≻Y a.s.

(d)
Using Lemma 4, we get that if H cr is true, then  n A X≻Y a.s.
⟶ ∞ and  n A X≺Y a.s.

Fig. 8
Fig. 8 Power to detect dominance.Results for different sample sizes under Scenarios (3w), (3s).The underlying survival functions are shown in Figure 2 (second row)

⟶ 0
as n ⟶ ∞.Proof Part (a) and Part (b) follow using similar reasoning as in the proofs of Lemma 3 and Lemma 4. Part (c) follows directly from Lemma 2. ◻

Table 1
Comparison of self-selected points and self-selected intervals.Conclusion of the test together with the marginal p-values p1 and p2

Table 2
Comparison of willingness to pay for different levels of noise reduction.Conclusion of the test together with the marginal p-values p1 and p2 d) Using Lemma 3, we get that if H cr is true, then  n W X≻Y a.s.∞ as n ⟶ ∞ .From Lemma 6,  n � w 1, ⋆ and  n � w 2, ⋆ are bounded in probability.Then, using Lemma 7, ℙ (W X≻Y ⟶