Testing for the presence of treatment effect under selection on observables

The evaluation of the possible effects of a treatment on an outcome plays a central role in both theoretical and applied statistical and econometrical literature. This paper focuses on nonparametric tests for possible difference in the distribution of potential outcomes due to receiving or not receiving a treatment. The approach is based on weighting observed data on the basis on the estimated propensity score. Kolmogorov–Smirnov type and Wilcoxon–Mann–Whitney type tests are constructed, and their limiting distributions are studied. Rejection regions are obtained by inverting confidence intervals. This involves the study of appropriate estimators of the limiting variance of test statistics. Approximations of quantiles via subsampling are also considered. The merits of the different tests are studied by Monte Carlo simulation. An application to the construction of tests for stochastic dominance is provided.


General aspects
The evaluation of the possible effects of a treatment on an outcome plays a central role in both theoretical and applied statistical and econometrical literature; cfr. the excellent review papers by Athey and Imbens (2017) and Imbens and Wooldridge (2009).The main source of difficulty is that data are usually observational, so that the estimation of the treatment effect by simply comparing outcomes for treated vs. control subjects is prone to a relevant source of bias: receiving a treatment is not a "purely random" event, and there could be relevant differences between treated and control subjects.This motivates the need to account for confounding covariates.
As it appears from Sect. 3 of Imbens and Wooldridge (2009), the literature is mainly concerned with estimation of the difference between the expected value of outcomes for treated and control (untreated) subjects, i.e.ATE (Average Treatment Effect).Another quantity of interest is the effect of treatment on outcome quantiles, which is summarized by QTE (Quantile Treatment Effect).Several different techniques have been proposed to estimate ATE, under various assumptions (see Athey andImbens 2017, Imbens andWooldridge 2009 andreferences therein).As far as QTE is concerned, cfr. the paper by Firpo (2007).
Much less effort is devoted to testing for hypotheses on treatment effect, as stressed in Imbens and Wooldridge (2009).Using the symbols of Sect.1.2, One question of interest is whether there is any effect of the program, that is whether the distribution of Y(1) differs from that of Y(0).This is equivalent to the hypothesis that not just the mean, but all moments, are identical in the two treatment groups.(cfr.Imbens and Wooldridge (2009)).Noticeable exceptions are in Abadie (2002), where tests are studied in settings with randomized experiments, and possibly with instrumental variables, and Crump et al. (2008), where tests for the hypothesis ATE = 0 , as well as tests for the null hypothesis that there is no effect on average outcome conditional on the covariates, are proposed.In the present paper, we propose new nonparametric tests for the presence of a treatment effect.Such tests are essentially based on nonparametric estimates of the distribution functions of potential outcomes.In particular, in the present paper, nonparametric Wilcoxon-Mann-Whitney type and Kolmogorov-Smirnov type tests for two-group comparison are considered.Their main merit is to go beyond the simple difference in expectations of potential outcomes, i.e. beyond testing for the treatment effect on the basis of ATE to capture the possible difference between treated and untreated subjects due to difference in the shape of their distributions.
Testing the hypotheses of treatment effect has received considerable attention mainly in the case of a complete randomization scheme for the assignment-to-treatment mechanism; cfr.Ding (2017) Li et al. (2018), where permutation tests are proposed.Similarities and differences with the present paper are stressed in Sect.3.

Problem description
Let Y be an outcome of interest, observed on a sample of n independent subjects.Some of the sample units are treated with an appropriate treatment (treated group); the other sample units are untreated (control group).If T denotes the treatment indicator variable, then whenever T = 1 , Y (1) is observed; otherwise, if T = 0 , Y (0) is observed.Here, Y (1) and Y (0) are the potential outcomes due to receiving or not receiving the treatment, respectively.The observed outcome is then equal to 1 3 Testing for the presence of treatment effect under selection… Y = TY (1) + (1 − T)Y (0) .In the sequel, F 1 (y) = P(Y (1) ≤ y) will denote the distribu- tion function (d.f.) of Y (1) , and F 0 (y) = P(Y (0) ≤ y) the d.f. of Y (0) .
Since receiving a treatment is not a purely random event, as in experimental framework, there could be relevant differences between treated and untreated subjects, due to the presence of confounding covariates.In the sequel, we will denote by X the (random) vector of relevant covariates, that is assumed to be observed.
In order to get consistent estimates, identification restrictions are necessary.The relevant restriction assumed in the sequel is that selection of treatment is based on observable variables: given a set of observed covariates, assignment either to the treatment group or to the control group is random.Formally speaking, let p(x) = P(T = 1|X = x) be the conditional probability of receiving the treatment given covariates X; it is the propensity score.The marginal probability of being treated, In the sequel, the main assumption is strong ignorability (cfr.Rosenbaum and Rubin 1983).In more detail, consider the joint distribution of ( Y (1) , Y (0) , T, X ), and denote by X the support of X.The following assumptions are assumed to hold.
Assumption (i) is also known as Conditional Independence Assumption (CIA).
For the sake of simplicity, we will use in the sequel the notation From (i)-(iii), the basic relationships ] .The estimation of ATE is a problem of primary importance in the literature, and several different approaches have been proposed ( Athey and Imbens 2017 and references therein).Another parameter of interest is the Quantile Treatment Effect (QTE), which is the difference between quantiles of F 1 and F 0 : Firpo (2007).In particular, when p = 1∕2 , it reduces to the Median Treatment Effect.
(1) p 1 (x) = p(x), p 0 (x) = 1 − p(x). (2) As already remarked, in the present paper, we focus on testing for treatment effect, where the null hypothesis is the equality of F 0 and F 1 (absence of treat- ment effect).Now, testing for a treatment effect has received considerable attention within the complete randomization scheme ( Ding 2017, Li et al. 2018).Let The basic assumption of the above mentioned papers is that the distribution of (T 1 , … , T n ) , given the covariates X i s, is such that each value (t 1 , … , t n ) ∈ {0, 1} n has probability n 1 !n 0 !∕n! that does not depend on the values of any observed (or unobserved) covariates.On the contrary, in the present paper, the "selection on observable" assumption is made.
A second important difference is that, if Y i,(0) , Y i,(1) are the potential outcomes for sample unit i, in Ding (2017), Li et al. (2018) Y i,(0) and Y i,(1) are considered as fixed, although unknown.The only involved probability distribution is that of (T 1 , … , T n ) .The main hypotheses of the treatment effect are essentially two: the sharp hypothesis Y i,(0) = Y i,(1) for all is, and the weak hypothesis 1) ∕n.In the present paper, an extra source of variability is considered, namely the probability distribution of Y i,(0) and Y i,(1) , that can be viewed as a superpopulation model (cfr.Cassel et al. 1977).The hypothesis F 0 = F 1 is in a sense in between the sharp and the weak hypotheses, because it is equivalent to test Y i,(0) 1) , where d = denotes equality in distribution.

Basic limiting results
The basic approach to the estimation of F 1 , F 0 is in Donald and Hsu (2014).A cru- cial point consists in estimating the propensity score p(x) = P(T = 1|X = x) .A non- parametric estimator based on a logit series estimation is developed in Hirano et al. (2003).The essential idea consists in writing the propensity score p(x) in the form L(h 0 (x)) , where L(z) = e z ∕(1 − e z ) is the logit function.In the second place, h 0 (x) is approximated through a (linear) sieve h K (x) = H K (x) T K (with K depending on the sample size), H K (x) being a polynomial in xs.The K-dimensional vector ̂ K is esti- mated by maximum likelihood method: In Kim (2013Kim ( , 2019)), a generalization including the case of splines is considered.
For notational simplicity, and similarly to (1), define: In order to estimate F 1 and F 0 , in Donald and Hsu (2014), the following "Hájek - type" estimators are considered: 1 3 Testing for the presence of treatment effect under selection… where The large sample distribution of the above estimators is studied via the bivariate process: that plays the same role as the empirical process in classical nonparametric statistics.
The subsequent result is a minor generalization of Donald and Hsu (2014), based on Theorem 3.1 in Kim (2013).Its main interest is that it covers the case of propensity scores nonparametrically estimated through arbitrary link functions (for instance, Probit instead of Logit) and constructed through sieves not necessarily based on polynomials (for instance splines, as in Kim (2013)).
(5) w Due to the continuity of F 1 , F 0 , the weak convergence of Proposition 1 also holds in the space D[−∞, +∞] 2 of ℝ 2 -valued càdlàg functions equipped with the Skorok- hod topology.
The limiting process W(⋅) in Proposition 1 is a Gaussian process, possessing tra- jectories that are a.s.continuous.This result will be used in the next Sections.◻ Proposition 2 If F 0 and F 1 are continuous, the limiting process W(⋅) = [W 1 (⋅), W 0 (⋅)] possesses trajectories that are continuous and bounded with probability 1.
If, in addition, the cross-covariance matrix C(y, t) = E W(y) ⊗ W(t) is such that C(y, y) is a positive-definite matrix, for every real y, then the functionals: have absolutely continuous distribution on (0, +∞).
Proof See "Appendix".◻ The paper is organized as follows: In Sect.1.2, the problem is described, and basic preliminary results in the literature are provided in Sect.1.3.Section 2 is devoted to the construction of a Wilcoxon-Mann-Whitney type test for the treatment effect, and in Sect.3, a Kolmogorov-Smirnov type test for the same problem is considered.In Sect.4, a test for stochastic dominance of the treatment is introduced and studied.The finite sample performance of the proposed methodologies is studied via Monte Carlo simulation in Sect.5, where comparisons are made with other commonly used tests.An empirical application is presented in Sect.6.Finally, Sect.7 is devoted to conclusions.
2 Testing for the presence of a treatment effect: two (sub)sample Wilcoxon test

Wilcoxon-type statistic
In nonparametric statistics, a problem of considerable relevance consists in testing for the possible difference between two samples.Among several proposals, the two-sample Wilcoxon (or Wilcoxon-Mann-Whitney) test plays a central role in applications, mainly because of its properties.The goal of the present section is to propose a Wilcoxon-type statistic to test for the possible difference between the (sub)sample of treated subjects and the (sub)sample of untreated subjects.In other terms, we aim at developing a Wilcoxon-type statistic to test for the possible presence of a treatment effect.
From now on, we will assume F 0 and F 1 are both continuous.As in the classical Wilcoxon two-sample test, in order to measure the difference between the distributions of Y (1) and Y (0) , we consider: 1 3 Testing for the presence of treatment effect under selection… The parameter 01 (12) possesses a natural interpretation, because it is equal to the probability that a treated subject possesses a y-value greater than the y-value for an independent, untreated subject.A few properties of 01 are listed as follows: (1) 01 depends only on the marginal d.f.s F 0 , F 1 (not on the way Y (0) , Y (1) are associ- ated in the same subject).( 2) If F 0 = F 1 then 01 = 1 2 .
(3) Using 01 is equivalent to use 10 = ∫ F 1 (y) dF 0 (y) , as it is seen by an integration by parts.
The Wilcoxon-type statistic considered here is obtained in two steps, essentially by a plug-in approach.
Step 1. Estimation of the marginal d.f.s F 1 , F 0 : Step 2. Estimation of 01 : e. iff i is treated and k is untreated.This shows that ̂ 01 is based on the comparison treated/ untreated.
The limiting distribution of the statistic ( 14) is obtained as a consequence of Proposition 1.
Proposition 3 Assume that the conditions of Proposition 1 are fulfilled.Then, where and Proof See "Appendix".Before closing the present section, a few remarks.◻ Remark 1 We notice in passim that F 0 ≡ F 1 implies 01 = 1∕2 , but the converse is false.In other words, 01 could take the value 1/2 even when F 0 and F 1 do not coin- cide.As a consequence, and similarly to what happens in "usual" nonparametric statistics, the Wilcoxon-type test developed here is not consistent for all departures from F 0 ≡ F 1 .
Remark 2 From a practical point of view, rejecting the null hypothesis 01 = 1∕2 in favor of  01 > 1∕2 means that the outcome for treated subjects tends to be larger than the outcome for untreated subjects.The higher 01 , the larger the gap, in terms of outcomes, of untreated subjects when compared to treated subjects.The opposite occurs when the null hypothesis 01 = 1∕2 is rejected in favor of  01 < 1∕2.
Remark 3 A referee asked whether it possible to extend the Wilcoxon-type test to the case when the treatment assignment is endogenous, but there is a binary Instrumental Variable available, as in Hsu et al. (2020).In principle, Theorem 3.1 in Hsu et al. ( 2020) could be used in place of Proposition 1 of the present paper, and the technique of Proposition 3 still applies provided that the trajectories of the limiting process are continuous.For the sake of brevity, we do not pursue this topic here.

Variance estimation
The asymptotic variance V appearing in ( 16) contains unknown terms, that can be consistently estimated on the basis of sample data.In particular, the estimation of on x i , i = 1, … , n , and by estimating the regression function via a method ensur- ing consistency (e.g.local polynomials, Nadaraya-Watson kernel regression, spline).The resulting estimator ̂ 01,n (x) is uniformly consistent on compact sets of xs under few regularity conditions.In the same way, 10 (x) can be consistently estimated by ̂ 10,n (x) , say.As a consequence the term V x ( 10 (x) − 01 (x)) can be estimated by: Note that as an alternative estimator, one could consider: (17) 1 3 Testing for the presence of treatment effect under selection… Next, we have to estimate: The term E x [p(x) −1 01 (x) 2 ] can be estimated with: The term: can be estimated by means of a nonparametric regression of: with respect to x i s.The resulting estimator M01,n (x) is consistent under few condi- tions.In the same way, an estimator M10,n (x) of: is obtained.The asymptotic variance of ̂ 10,n can be finally estimated by: 2.3 Testing the equality of F 1 and F 0 via Wilcoxon-type statistic A test for the equality of F 1 and F 0 can be constructed via the statistic ̂ 01,n (14).
As already seen, when F 1 and F 0 coincide, 01 is equal to 1/2.Hence, the idea is to construct a test for the hypotheses problem On the basis of Proposition 3, and the variance estimator (20), the region: (where z 2 is the (1 − 2 ) quantile of the standard Normal distribution) is an accept- ance region of asymptotic significance level .

Subsampling approach
As an alternative to variance estimation, one could approximate the quantiles of the distribution of ̂ 01,n using the subsampling technique.Generally speaking, subsam- pling possesses several important properties (cfr.Politis and Romano 1994).First of all, its computational burden is frequently less heavy than bootstrap, because replications are taken for subsamples of size m < n .Secondly, and more importantly, it is asymptotically first-order correct (namely, it recovers the asymptotic distribution of the statistic under consideration) without imposing extra regularity conditions, such as bootstrap (cfr.van der Vaart (1998), p. 333).Define and consider all the n m subsamples of size m of (A 1 , … , A n ) .The subsampling pro- cedure, in the present case, can be described as follows: 1. Select M independent subsamples of size m from the sample of (X i , T i , Y i ) s, i = 1, … , n. 2. Denote by F1,m;l (y) , F0,m;l (y) the estimates of F 1 , F 0 , respectively, from subsample l, and let ̂ 01,m;l (y) be equal to the Wilcoxon statistic ( 14) for the lth subsample.3. Compute the subsample statistics: 4. Compute the corresponding empirical d.f.:

Compute the corresponding quantile:
Assuming that m∕n → 0 as n → ∞ , and using Th.2.1 in Politis and Romano (1994), we have: 1 3 Testing for the presence of treatment effect under selection… where Φ denotes the Standard Normal d.f.The convergence in ( 22) is uniform in z.Moreover, from the continuity and strict monotonicity of Φ , it follows that the empirical quantile R−1 n,m (p) = inf{z ∶ Rn,m (z) ≥ p} converges in probability to the quantile of order p of the Standard Normal distribution: From the above results, the asymptotically exact approximation: is obtained.As a consequence, the interval: is a confidence interval for 01 of asymptotic level 1 − .Hence, the test consisting in rejecting H 0 whenever the interval (24) does not contain 1/2, possesses asymptotic significance level .
Before ending the present section, we remark that an alternative to subsampling is the multiplier method by Donald and Hsu (2014).From a theoretical point of view, subsampling does not require Assumption 3.1-1 and requires a weaker version of Assumption 3.3-2 in Donald and Hsu (2014).

Testing for the presence of a treatment effect: two (sub)sample Kolmogorov-Smirnov test
In this section, we deal with the construction of a Kolmogorov-Smirnov test of (asymptotic) size for the hypotheses problem: where Δ(y) = F 1 (y) − F 0 (y) .The main merit of this test, as it will be clear in the sequel, is that it is consistent for all alternatives, i.e. for all departures from F 0 ≡ F 1 .
Similarly to what was done at the end of the above section, a simple idea to construct a test for the hypotheses problem ( 25) is to invert a confidence region for Δ(⋅) .The null hypothesis H 0 is rejected whenever the confidence region has empty inter- section with H 0 .More formally, the test procedure we consider here is defined as follows: ≠ 0 for at least a point y ∈ ℝ (i) Compute a confidence region for Δ(⋅) of (at least asymptotic) level 1 − .(ii) Reject H 0 if the confidence region for Δ(⋅) and H 0 are disjoint, i.e. if for at least a real y the region does not contain the value zero.

Define:
From Proposition 1, √ n( Δn (⋅) − Δ(⋅)) converges weakly to a Gaussian process that can be represented as W 1 (⋅) − W 0 (⋅) .Define next: Assuming that both F 0 , F 1 are continuous d.f.s., from Proposition 2, it follows that: Moreover, again assuming the continuity of F 0 , F 1 , as a further consequence of Proposition 2, the r.v.D (26) is absolutely continuous with strictly positive density.Hence, for every 0 <  < 1 , there exists a unique d 1− such that: The quantile d 1− can be estimated by the subsampling technique (cfr.Politis and  Romano 1994).Define again A i = (X i , T i , Y i ) , i = 1, … , n , and consider all the n m subsamples of size m of (A 1 , … , A n ) .Similarly to Sect. 2.4, the subsampling proce- dure, in the present case, can be described as follows: 1. Select M independent subsamples of size m from the sample of (X i , T i , Y i ) s, i = 1, … , n. 2. Denote by F1,m;l (y) , F0,m;l (y) the estimates of F 1 , F 0 , respectively, from subsample l, and let Δm;l (y) be equal to F1,m;l (y) − F0,m;l (y).

Compute the subsample statistics:
4. Compute the corresponding empirical d.f.:

Compute the corresponding quantile:
Under the same regularity conditions as in Sect.2.4, it is easy to see that: Testing for the presence of treatment effect under selection… where the convergence in ( 29) is uniform in d.In addition, from the continuity and strict monotonicity of P(D ≤ d) , it follows that the empirical quantile R −1 n,m (p) = inf{d ∶ R n,m (d) ≥ p} converges in probability to the pth quantile of the distribution of D: From the above results, the asymptotically exact approximation: holds.Hence, the region: is a confidence band of (asymptotic) level 1 − for Δ(⋅) .The null hypothesis H 0 is rejected whenever the confidence band (31) does not intersect 0 for some real y.It is immediate to see that the constructed test has (asymptotic) size .

The problem
In evaluating the effect of a treatment, it is sometimes of interest to test whether the treatment itself has an effect on the whole distribution function of Y, i.e. whether the treatment improves the behavior of the whole d.f. of Y. Various forms of stochastic dominance are discussed in McFadden (1989), Anderson (1996).In particular, in the present section, we will focus on testing for first-order stochastic dominance.The d.f.F 1 first-order stochastically dominates F 0 if F 1 (y) ≤ F 0 (y) ∀ y ∈ ℝ .Our main goal is to construct a test for the (unidirectional) hypotheses: where In econometrics and statistics, there is an extensive amount of literature on testing for stochastic dominance, since the papers by Anderson (1996), Davidson and Duclos (2000).In Linton et al. (2005), a Kolmogorov-Smirnov type test is proposed, and a method to construct critical values based on subsampling is proposed.For further bibliographic reference, and a deep analysis of contributions to testing for stochastic dominance, cfr. the recent paper by Donald and Hsu (2016).
In the present paper, we confine ourselves to a simple, intuitive procedure to test for unidirectional dominance.

Approach based on Kolmogorov-Smirnov statistic
A simple idea to construct a test for the hypotheses problem of Sect.4.1 is to invert a confidence region for Δ(⋅) .The null hypothesis H 0 is rejected whenever the confidence region has empty intersection with H 0 .More formally, the test procedure we consider here is defined as follows: (i) Compute a confidence region for Δ(⋅) of (at least asymptotic) level 1 − ; (ii) Reject H 0 if the confidence region for Δ(⋅) and H 0 are disjoint, that is if for at least a real y the region has lower bound greater than zero.
From now on, we will assume that both F 0 , F 1 are continuous d.f.s.Using the arguments in Sect.3, it is possible to see that the r.v.: has absolutely continuous distribution, with P sup y W 1 (y) − W 0 (y) ≥ 0 = 1 .Hence, there exists a single d 1− such that: The quantile d 1− can be estimated by subsampling, as outlined in Sect. 3. Define: A subsampling procedure to estimate d 1− is described as follows: 1. Select M independent subsamples of size m from the sample of (X i , T i , Y i ) s, i = 1, … , n. 2. Compute the subsample statistics: 3. Compute the corresponding empirical d.f.: 1 3 Testing for the presence of treatment effect under selection… 4. Compute the corresponding quantile: The arguments in Sect. 3 show that: Hence, the asymptotically exact approximation holds.As a consequence, the region: is a confidence region for Δ(⋅) with asymptotic level 1 − .The null hypothesis H 0 is rejected whenever: The main feature of the test developed here is that it is computationally simpler than the test(s) proposed in Donald and Hsu (2014).Its relative merits will be evaluated by simulation in Sect. 5.

Approach based on Wilcoxon statistic
As remarked by a referee, the unidirectional Wilcoxon-type test proposed in 2 may be used to construct a simplified test for stochastic dominance, easier to implement if compared to that of Sect.4.2.More precisely, if F 1 (y) ≤ F 0 (y) ∀ y ∈ ℝ , then 01 ≥ 1∕2 , so that the stochastic dominance problem may be transformed into: Using the same reasoning of Sect.2.4, a rejection region of asymptotic significance level is as follows: z being the (1 − ) quantile of the standard Normal distribution.
Alternatively, we may resort to the subsampling approach of Sect.2.4.In this case, the idea is to construct a unidirectional confidence region for 01 , and in rejecting H 0 whenever such a region is within the interval [0, 1∕2) .With the usual sym- bols, at an asymptotic significance level , the stochastic dominance hypothesis is rejected whenever �  01,n − 1 1 3 Testing for the presence of treatment effect under selection… M = 1000 subsamples of size m = n 0.8 have been drawn by simple random sam- pling from each of the N = 1000 original samples.
The exact distribution function of Y (j) is as follows: The d.f.F j (35), and the corresponding density functions f j , are depicted in Fig. 1.The score, in this case, is as follows: (34) . This is clearly due to the con- founding effect of X, and makes it difficult to detect the absence of treatment effect.
In scenario IV (presence of treatment effect), the potential outcome Y (0) is specified as in (34) with j = 0 .The potential outcome Y (1) is specified as: where X has a Bernoulli distribution X ∼ Be(0.5) and U 0 , U 1 have a Uniform distri- bution U 1 ∼ U[−10;10] .The r.v.s X, U 0 , U 1 are mutually independent.
The exact distribution function of Y (1) is reported as follows: (36) V ( H 0 false-shift and shape alternative) 0.62 75 Testing for the presence of treatment effect under selection… and depicted in Fig. 2.
The propensity score is as follows: so that E[Y|T = 0] = 77.5 and E[Y|T = 1] = 77.5 even if ATE ≠ 0 .As in scenario I, this is due to the confounding effect of X that makes it difficult to detect a treatment effect through a naive analysis.Scenario III is similar to scenario IV, but with E[Y (1) ] = 76 .Since the shift of F 1 w.r.t.F 0 in scenario IV is higher than in scenario III, detecting treatment effect in scenario III is more difficult than in scenario IV.
In Scenario II, the treatment effect is due to a shape difference of Again, this makes it difficult to detect a treatment effect through ATE.
Scenario V is generated as scenario IV with a shape effect in addition to the shift effect.
As an overall comment, in scenarios II-V ( H 0 false), the propensity score is cho- sen to compensate the effect of shape and shift giving rise to a confounding no treatment effect.In scenarios III and IV, the treatment effect is due to a shift of F 1 w.r.t.F 0 , so that ATE is non-null.In scenario II, detecting treatment effect is difficult, because it is only due to a difference if shape of F 1 w.r.t.F 0 , with ATE = 0 .Finally, scenario V mixes together shift and shape in the treatment effect.
Table 2 summarizes the rejection probabilities of the null hypothesis for different scenarios and sample sizes.
The results show that the Wilcoxon-type test and the Kolmogorov-Smirnov test are better than the test based on estimated ATE, in terms of both actual significance level (scenario I) and power (scenarios II-V).Wilcoxon-type test with quantiles estimated by subsampling seems to offer the best performance in terms of power, although its actual significance level seems to be slightly worse than in the case of estimated variance.Among the others, the test based on the estimator of ATE proposed in Hirano et al. (2003) and the conditional randomization tests in Branson et al. (2019) and in Rosenbaum (1984) do not exhibit performances as good as Wilcoxon-type test with quantiles estimated via subsampling.As an overall remark, Wilcoxon test seems to offer good performance in terms of both simplicity and power.
As far as the test for stochastic dominance is concerned, the test procedures of Sects.4.2, 4.3 have been studied under scenarios I -III, and for sample sizes 50, 100, 200, 400, together with the test of stochastic dominance proposed by Donald and Hsu (Donald and Hsu 2014).The corresponding rejection probabilities are  Testing for the presence of treatment effect under selection… shown in Table 3.Even if all tests do have an actual significance level larger than the nominal level 5% , Wilcoxon test exhibits rejection rates under H 0 slightly better than other tests, especially for a sample size n ≤ 200 .When the null hypothesis H 0 of stochastic dominance is false, all tests perform similarly for a sample size n ≥ 200 .However, for sample sizes n = 50, 100 , the Wilcoxon has slightly better rejection rates.

Empirical study
In the present section, the test of stochastic dominance developed in Sect. 4 is applied to data from National Supported Work Demonstration (NSW) job training program described in LaLonde (1986) and analyzed by Dehejia and Wahba (1999), Wooldridge (2001).The data set we use corresponds to the subsample termed "RE74 subset" in Dehejia and Wahba (1999).The treatment variable T is equal to 1 if the individual participates in the job training.The outcome variable is "Earnings in 1978".RE74 subset contains an experimental sample from a randomized evaluation of the NSW program, in which 185 individuals received the treatment and 260 did not.
As in Donald and Hsu (2014), our tests have been applied for the whole group to RE74 subset, because the treatment is randomly assigned in this subset, which implies the distribution functions of Y (0) , Y (1) for the whole group are the same as the distribution functions for the treated group.As in Donald and Hsu (2014), the  Proof of Proposition 2 Let Q j (u) = F −1 j (u) = inf{y ∶ F j (y) ≥ u} , j = 1, 0 .Then, W j (⋅) possesses continuous trajectories almost surely if B j (u) = W 1 (Q(u)) possesses continuous trajectories almost surely.From the proof of Proposition 1, it is not difficult to see that the inequality holds, c being an appropriate constant.Hence, we may write The continuity of the trajectories of B j (⋅) follows from (38) and formula (6) in Lead- better and Weissner (1969).
As far as boundedness is concerned, from the structure of the covariance kernel of W(⋅) , t is now seen that from which the almost sure boundedness of the trajectories of W j (⋅) follows.
Assume now that the cross-covariance matrix C(y, t) = E W(y) ⊗ W(t) is is pos- itive-definite for every real y.Under this condition, it is possible to show ( Lifshits (1982)) that the functional can only have an atom at the point On the other hand, V(W j (y)) = 0 only when y → ±∞ , and, from Th. 8.1 in Dud- ley (1973)  Testing for the presence of treatment effect under selection… and hence where W j,n (y) = √ n( Fj,n (y) − F j (y)) , j = 1 , 0. Now, if F 0 (y), F 1 (y) are continuous, the limiting process W = [W 1 , W 0 ] � possesses trajectories that are continuous (and bounded) with probability 1, so that it is concentrated on C(ℝ) 2 , that is separable and complete if equipped with the sup-norm.Using then the Skorokhod Representation Theorem (cfr.Billingsley 1999, p. 70), there exist processes Wn = [ W1,n , W0,n ] � , n ≥ 1 , and W = [ W1 , W0 ] � , defined on a probability space ( Ω, F, P) such that and where the symbol d = denotes equality in distribution.From ( 40) and (39), the relationship follows.

Table 3
Rejection probabilities (nominal significance level 0.95) n = 50 n = 100 n = 200 n = 400 n = 500 In the three cases, the hypothesis that the 1978 real earning under job training stochastically dominates the 1978 real earning without job training is accepted.The p values approximated by 1000 repetitions are equal to 1.The results are robust to different specifications of the propensity score.The results are coherent with Donald and Hsu (2014).