Properties of least squares estimator in estimation of average treatment effects

Treatment effects are often estimated by the least squares estimator controlling for some covariates. This paper investigates its properties. When the propensity score is constant, it is a consistent estimator of the average treatment effects if it is viewed as a semiparametric partially linear regression estimator, but it is not necessarily more efficient than the simple difference-of-means estimator. If it is literally viewed as a least squares estimator with a finite number of controls, it is equal to the weighted average of conditional average treatment effects with potentially negative weights, although the negative weight issue does not exist under semiparametric interpretation. It is shown that the negative weight issue can be avoided by use of logit specification.


Introduction
The semiparametric efficiency bound for the average treatment effects for the standard case (where the treatment is randomly assigned conditional on some covariates) is well understood.The efficient estimator there takes various forms, 1 but none takes the form of the least squares regression of the dependent variable on the binary treatment controlling for some linear function of covariates.Even then, the latter specification is commonly employed in the literature.It would therefore make sense to investigate the properties of the least squares estimator that controls for the covariates.
This paper makes contributions in this regard by examining the semiparametric efficiency properties of the least squares estimator.For this purpose, the least squares SERIEs (2023) 14:301-313 specification is given a semiparametric interpretation, where the researcher is assumed to have made a " nonparametric promise" to control for the covariates more and more flexibly as the sample size increases to infinity.Under such a promise, it can be shown that the least squares estimator can be interpreted to be an estimator of some weighted average of the treatment effects.When the propensity score is constant, it can be shown that the least squares consistently estimates the average treatment effects (ATE), as long as the nonparametric promise is kept.It is shown that even with such a nonparametric promise, the least squares estimator does not necessarily have an obvious advantage over the naive difference-of-means estimator from an efficiency perspective.Variants of this results in parametric frameworks have been known in the literature, but the current paper makes a contribution by deriving the results by explicitly adopting a nonparametric framework.The paper also discusses the interpretation of the least squares estimator when the nonparametric promise is not kept.It is well known that the least squares can be interpreted to be the weighted average of the treatment effects, and the literature has recently begun to pay attention to the fact that the weights can be negative.The negative weight is due to the implicit linear probability specification of the treatment indicator on the covariates.It is shown that the problem can be eliminated by using a logit specification.The interpretation is related to a recent discussion by Blandhol et al. (2022, Proposition 1). 2hroughout the paper, we adopt the assumption that the treatments are independent of potential outcomes given covariates.To be more specific, we consider the model where (Y (0) , Y (1)) is independent of the binary treatment indicator D given the covariates X .We do not impose the constant treatment effects assumption where Y (1) − Y (0) is a fixed constant.In this model, it is convenient to write where the us have mean equal to 0 and variance equal to 1 conditional on X .We can then write the observed outcome where α (X ) ≡ μ 0 (X ) and β (X ) ≡ μ 1 (X ) − μ 0 (X ).We assume that the propensity score π (X ) ≡ E [ D| X ] as well as α (X ) , β (X ) , σ 0 (X ) , σ 1 (X ) are nonparametrically specified.We will also assume that . .are independent and identically distributed (IID), and that the researcher observes

Interpretation of partially linear regression specification
We will consider the interpretation of the partially linear regression of Y on D using X as the control variable.To be more precise, we consider computing the estimate of β L S by fitting a semiparametric model where g (X ) is nonparametrically specified.Let βLS denote the estimated coefficient of D in such a regression. 3We first examine the pseudo-parameter β L S that βLS estimates, i.e., its probability limit.It is straightforward to recognize that the pseudoparameter is the population regression estimate of Y on D − π (X ), where π (X ) is the propensity score.We present an interpretation of β L S as a weighted average of the β (X ), i.e., the average treatment effects (ATE) conditional on X .4Angrist (1998) derived such a representation for the case where X has a multinomial distribution, and the representation below is a nonparametric generalization when X has an arbitrary distribution: Proposition 1 The interpretation (3) is from the semiparametric perspective, based on the "nonparametric promise" that the covariates will be controlled by richer and richer specification as a function of the sample size.Note that it is a weighted average of β (X ) Because the π (X ) denotes the true conditional probability of D given X , the weight is always nonnegative.Therefore, the negative weight problem discussed in Blandhol et al. (2022, Section 4) does not exist under the nonparametric specification/interpretation of g in (2).
Suppose that a practitioner does not make or keep such a " nonparametric promise," and that he/she adopts a literally parametric approach where g (X ) is linear in X .We can show that the probability limit of βLS is now equal to where X γ is the linear projection of D on X .If the true α (X ) is linear in X , we can show that the estimand (4) simplifies to which implies that it is a weighted average of the conditional expectation β (X ) of the treatment effects given X .Because X γ can lie outside of the (0, 1) range, it raises the possibility of negative weights.See, e.g., Blandhol et al. (2022, Section 4).This results from the partitioned regression interpretation of multiple regression, and the implicit linear probability specification there; the estimate of the coefficient of D in the regression of Y on D and X is equivalent to the estimate when Y is regressed on the residual when D is regressed on X , and because the regression of D on X uses the linear specification, the fitted value can exceed the (0,1) interval, thereby leading to the possibility that the residual can be negative.
One way to avoid this problem is to use logit specification instead of the linear probability specification.Suppose that we adopt a logit specification where Pr [ D = 1| X ] is specified as X δ , where (t) = e t / 1 + e t .If we use X δ as the fitted value (instead of X γ which would be used if the linear probability model is adopted) and regress Y on D − X δ , we would get the estimator which converges in probability to where δ denotes the probability limit of δ. Lee (2018) also consider using the nonlinear parametric specification for the propensity score, although he proposes a different estimator.Lee (2018) shows that his estimand can be interpreted to be the weighted average of where p (X ) denotes the probability limit of the parametric specification of E [ Y | X ], but the current paper makes a contribution by providing an interpretation of the estimand as a weighted average of the conditional treatment effects β (X ). 5or this purpose, write by conditional independence, we conclude that the third term above is equal to zero.The last term is also equal to zero by the same reasoning.We recall that the logit MLE implies the first-order condition (FOC) such that6 It follows that n i=1 D − X δ g (X ) = 0 for any linear function g (X ) of X .In particular, it will be satisfied for α (X ) if it were indeed linear, which implies that E D − X δ α (X ) = 0 as long as α (X ) is linear in X .We would also have X δ is a correct specification of the true propensity score.Therefore, we can see that the pseudo-parameter that the new estimator estimates is which will retain the " positive weight" feature, as long as α (X ) is literally linear (or the propensity score indeed has a logit specification). 7ecause "weight" in (7) does not necessarily add up to 1.This problem can be eliminated by considering instead an IV estimator , where the weights are nonnegative and add up to 1. On a related note, (Blandhol et al. 2022, Proposition 1) discussed the interpretation of 2SLS without any " nonparametric promise" under the assumption that are linear in X .It was shown that the pseudo-parameter estimated by 2SLS can be decomposed into two terms.The first term is a weighted average of the conditional treatment effects among compliers, and the weight ω (CP, X ) can be negative.The second term is a weighted average of the conditional treatment effects among always takers, and the weight ω (AT, X ) may not be equal to zero.Their analysis is based on the equivalence that the 2SLS is equal to the IV estimator using the residual Z from the regression of binary instrument Z on X , which implicitly adopts a linear probability model specification.If the nonparametric promise is made and kept, this is not an issue, but without such a promise, it would lead to the phenomenon that the fitted value from the regression of Z on X can exceed 1, which leads to the possibility of negative weight ω (CP, X ).This issue can be resolved by using a logit specification of Z on X , and replacing Z in their equation (1) by Z − X δ . 8As for the nonzero ω (AT, X ), it can be attributed to two reasons (1) the nonparametric promise is not kept; or (2) the implicit linear probability specification is incorrect.

Efficiency of semiparametric regression adjustments in randomized experiments
In this section, we examine the semiparametric efficiency properties of βLS in the semiparametric specification (2).In particular, we consider the case where the propensity score π (•) is constant, and try to understand the efficiency properties of the regression adjustments from a semiparametric perspective.Freedman (2008a, b) considered comparison of parametric version of βLS with the difference-in-means estimator.More precisely, he considered the case where g (X ) in ( 2) is specified as a linear function of X , and concluded that efficiency ranking is impossible adopting an asymptotic framework where the finite population with size n changes as a function of n. 9 Lin (2013) adopted an identical framework, and investigated how efficiency improvement is possible, 10 which is confirmed by Negi and Wooldridge (2021) under an IID framework.
Negi and Wooldridge (2021) also use the linear parametric specification of g (X ), so the result in this section can be understood to be a fully semiparametric generalization of the earlier results in a familiar IID asymptotic framework.
In order to understand the efficiency properties of βLS , it is useful to derive its asymptotic variance.Below is a characterization of the asymptotic properties of βLS , obtained under the nonparametric specification of α (X ) , β (X ) , σ 0 (X ) , σ 1 (X ), and π (X ).

Proposition 2 The asymptotic variance of βLS
8 Because X and Z − X δ are orthogonal due to the first order condition of the logit MLE as in ( 6), their proof of Proposition 2 goes through without any modification.The only change is that the negative ω (CP, X ) phenomenon disappears. 9Therefore, in his framework, the n is the size of population.
10 He does so by using a parametric variant of the semiparametric efficient estimator developed in Hahn (1998).
So far, we allowed for the possibility that π (X ) is not a constant, which meant that β L S may not be the ATE.Now, let us assume that π (X ) = π .If so, we have β L S = E [β (X )] = β AT E by (3).Specializing asymptotic variance formula in Proposition 2 for this situation, we find that the asymptotic variance of β L S now simplifies to Relative to the efficiency bound for the ATE in Hahn ( 1998) specialized for the case where π (X ) is constant, it can be shown that ( 8) is greater than or equal to (9) in general.In other words, the βLS is not semiparametrically efficient, even under the special assumption of constant propensity score. 11e now consider the difference-in-means estimator for the same case where π (X ) is constant and equal to π .It is trivial to show that its asymptotic variance 12 Again making the same comparison with the efficiency bound (9), we can show that (10) is greater than or equal to (9) in general, so the difference-in-means estimator is not semiparametrically efficient either. 13fficiency ranking between the partially linear projection estimator and the difference-in-means estimator boils down to the comparison of ( 8) and (10).Because both estimators do not achieve the semiparametric efficiency bound even under the case where the propensity score is constant, we can make an educated guess that efficiency ranking between these two estimators is impossible in general.In the appendix, we present two numerical examples to confirm this educated guess, confirming the Freedman (2008a, b) critique in the nonparametric framework.
On the other hand, we can find a reasonably interesting situation where the partially linear regression specification does lead to efficiency: We can see that the asymptotic variance of the partially linear regression achieves the efficiency bound when π = 1/2.Therefore, if the treatment is assigned by flipping a fair coin, it would be sensible to adopt the partially linear regression model specification.The special role of π = 1/2 was discussed by Freedman (2008b) and Lin (2013), as well as Negi and Wooldridge (2021) for the case where g (X ) is given a linear parametric specification.As noted above, Freedman (2008b) and Lin (2013) adopt an asymptotic framework characterized by a sequence of finite populations, which makes it different from Negi and Wooldridge (2021).Therefore, Proposition 3 can also be understood to be a fully semiparametric generalization of the earlier results in an IID asymptotic framework.
along with the law of iterated expectations, we can see that the estimand is Proof of ( 4) and ( 5) We have If α (X ) is linear in X , we have Furthermore, we have by conditional independence.Therefore, we have from which (5) follows.
If α (X ) is not linear in X , we have which is not equal to zero in general.Therefore, the interpretation in ( 5) is incorrect in general without the linearity of α (X ), and we have in general.

Proof of Proposition 2
Using a result in Newey and Robins (2018), we find that the influence function for βLS is equal to SERIEs (2023) 14:301-313 We have The three terms on the right have mean zero, and their cross products have also mean zero.Moreover, we have which means that the asymptotic variance is equal to from which the conclusion follows.
Proposition 4 Suppose that the propensity score is constant.The βLS is not semiparametrically efficient Proof We show that (8) is greater than or equal to (9).We have that by Cauchy-Schwartz, so the asymptotic variance (8) from the partially linear regression is larger than the asymptotic variance bound (9) in general.
Proposition 5 Suppose that the propensity score is constant.The difference-in-means estimator is not semiparametrically efficient Footnote 15 continued " model."It is a special case of the partially linear " projection" considered by Newey and Robins (2018).
Because the resultant estimators are identical, we continue to call βLS an estimator for the partially linear regression model, reflecting the familiarity of the terminology.In other words, we may adopt a practical point of view, and interpret that the underlying " model" is Y = Dβ L S + g (X ) + ε, where ε is defined in (11) and g (X ) = π (X ) (β (X ) − β L S ) + α (X ).
Proof We show that (10) is greater than or equal to (9).When π (X ) is constant and equal to π , we have Proof of Proposition 3 If π = 1/2, we see that i.e., the efficiency bound (9).

B Numerical comparison for Sect. 3
From ( 8) and ( 10), we can see that efficiency ranking between the two estimators boils down to comparison between and Var [μ 0 (X )] 1 − p + Var [μ 1 (X )] p .