## Abstract

This paper examines the wide-spread practice where data envelopment analysis (DEA) efficiency estimates are regressed on some environmental variables in a second-stage analysis. In the literature, only two statistical models have been proposed in which second-stage regressions are well-defined and meaningful. In the model considered by Simar and Wilson (J Prod Anal 13:49–78, 2007), truncated regression provides consistent estimation in the second stage, where as in the model proposed by Banker and Natarajan (Oper Res 56: 48–58, 2008a), ordinary least squares (OLS) provides consistent estimation. This paper examines, compares, and contrasts the very different assumptions underlying these two models, and makes clear that second-stage OLS estimation is consistent only under very peculiar and unusual assumptions on the data-generating process that limit its applicability. In addition, we show that in either case, bootstrap methods provide the only feasible means for inference in the second stage. We also comment on ad hoc specifications of second-stage regression equations that ignore the part of the data-generating process that yields data used to obtain the initial DEA estimates.

### Similar content being viewed by others

## Notes

One could perhaps assume that the joint density of input-output vectors includes a probability mass along the frontier, but given the bias of the DEA frontier estimator and the resulting mass of observations for which the corresponding DEA efficiency estimate will equal unity, it is difficult to imagine how such a model could be identified from the model in Kneip et al. (2008). In addition, the properties of DEA estimators in such a model are unknown.

In the model considered by SW, inefficiency explicitly depends on the environmental variables which may account for heteroskedasticity in the inefficiency process. SW did not consider heteroskedasticity in the error term of the second stage regression, but this could be modeled using standard techniques; i.e., \(\sigma_{\varepsilon}^2\) appearing in Assumption A3 of SW could be parameterized in terms of additional covariates. See also Park et al. (2008).

The Meghalaya plateau in northeastern India is considered to be one of the rainiest places on earth (Murata et al. 2007).

On p. 50, in the fourth through seventh lines after equation no. 2), it is stated that

The contextual variables are measured such that the weights \(\beta_s,s=1,\ldots,\,S,\) are all nonnegative—i.e., the higher the value of the contextual variables, the higher is the inefficiency of the DMU.

This is false due to the structure in (8) and the independence of

and*Z**U*.BN write (18) as \(\log\widehat{\widetilde{\theta}}=\widetilde{\beta}_0-\user2{Z}\widetilde{\varvec{\beta}}+\widetilde{\delta}\) in their equation (11), but substitution of the right-hand side of (17) for \(\widetilde{\theta}\) on the left-hand side of (16) does not change the parameters on the right-hand side of (16). Equation (17) appears as equation (A3) in BN2, where it is noted that η ≥ 0.

In addition, if

*V*^{M}is not constant, it is equally unclear what is estimated in the first stage.Erhemjamts and Leverty (2010) is not alone in taking statements in BN uncritically and without question. Both McDonald, (2009, p. 797) and Ramalho et al. (2010, Sect. 2, eighth paragraph) state that the DGP proposed by BN is less restrictive than that considered by SW, without mentioning the various restrictions required by the BN model. This issue is revisited below in Sect. 5

In the statement of their Proposition 1, BN correctly define

as Plim(*Q**n*^{−1}′*Z*), but in equation (A4) of the proof appearing in BN2,*Z*is implicitly defined as*Q**n*^{−1}′*Z*. We use the definition*Z*= Plim(*Q**n*^{−1}′*Z*) in all that follows.*Z*In their proof appearing in BN2, BN ignore the role of the intercept β

_{0}. Consequently, their expression for the variance of their OLS estimator would be wrong even if the rest of their derivations were correct, which they are not. In addition, in their Monte Carlo experiments, BN considered only the case where*p*=*q*= 1 with VRS, and consequently did not notice the errors in their proof of their Proposition 1.Most, if not all, of the papers that have used OLS to regress DEA efficiency scores on environmental variables while citing BN for justification have numbers of dimensions greater than three in their first-stage estimation. To give just a few examples, Cummins et al. (2010) use

*p*+*q*= 8 or 9; Banker et al. (2010a) use*p*+*q*= 6; Banker et al. (2010b) use*p*+*q*= 5. Each of these rely on the usual OLS standard error estimate to make inference in the second-stage regressions, and consequently the inference in these papers is invalid.

## References

Aly HY, Grabowski CPRG, Rangan N (1990) Technical, scale, and allocative efficiencies in US banking: an empirical investigation. Rev Econ Stat 72:211–218

Banker RD, Cao Z, Menon N, Natarajan R (2010a) Technological progress and productivity growth in the US mobile telecommunications industry. Ann Oper Res 173:77–87

Banker RD, Lee SY, Potter G, Srinivasan D (2010b) The impact of supervisory monitoring on high-end retail sales productivity. Ann Oper Res 173:25–37

Banker RD, Morey RC (1986) Efficiency analysis for exogenously fixed inputs and outputs. Oper Res 34:513–521

Banker RD, Natarajan R (2008a) Evaluating contextual variables affecting productivity using data envelopment analysis. Oper Res 56:48–58

Banker RD, Natarajan R (2008b) Online companion for “evaluating contextual variables affecting productivity using data envelopment analysis”—appendix: proofs of consistency of the second stage estimation. Oper Res online supplement, 1–6. Available at http://or.journal.informs.org/cgi/data/opre.1070.0460/DC1/1

Barkhi R, Kao YC (2010) Evaluating decision making performance in the GDSS environment using data envelopment analysis. Decis Support Syst 49:162–174

Bădin L, Daraio C, Simar L (2010) Optimal bandwidth selection for conditional efficiency measures: a data-driven approach. Eur J Oper Res 201:633–664

Chang H, Chang WJ, Das S, Li SH (2004) Health care regulation and the operating efficiency of hospitals: evidence from taiwan. J Account Public Policy 23:483–510

Chang H, Choy JL, Cooper WW, Lin MH (2008) The sarbanes-oxley act and the production efficiency of public accounting firms in suppying accounting auditing and consulting services: an application of data envelopment analysis. Int J Serv Sci 1:3–20

Cummins JD, Weiss MA, Xie X, Zi H (2010) Economies of scope in financial services: a DEA efficiency analysis of the US insurance industry. J Banking Finance 34:1525–1539

Daraio C, Simar L (2005) Introducing environmental variables in nonparametric frontier models: a probabilistic approach. J Prod Anal 24:93–121

Daraio C, Simar L (2006) A robust nonparametric approach to evaluate and explain the performance of mutual funds. Eur J Oper Res 175:516–542

Daraio C, Simar L, Wilson PW (2010) Testing whether two-stage estimation is meaningful in non-parametric models of production. Discussion paper #1031. Institut de Statistique, Université Catholique de Louvain, Louvain-la-Neuve, Belgium

Davutyan N, Demir M, Polat S (2010) Assessing the efficiency of turkish secondary education: heterogeneity, centralization, and scale diseconomies. Socio-Econ Plan Sci 44:3–44

Erhemjamts O, Leverty JT (2010) The demise of the mutual organizational from: An investigation of the life insurance industry. J Money Credit Banking 42:1011–1036

Farrell MJ (1957) The measurement of productive efficiency. J Royal Stat Soc A 120:253–281

Gstach D (1998) Another approach to data envelopment analysis in noisy environements. J Prod Anal 9:161–176

Hoff A (2007) Second stage dea: comparison of approaches for modelling the dea score. Eur J Oper Res 181:425–435

Jeong SO, Park BU, Simar L (2010) Nonparametric conditional efficiency measures: asymptotic properties. Ann Oper Res 173:105–122

Jondrow J, Lovell CAK, Materov IS, Schmidt P (1982) On the estimation of technical inefficiency in the stochastic frontier production model. J Econ 19:233–238

Kneip A, Park B, Simar L (1998) A note on the convergence of nonparametric DEA efficiency measures. Econ Theory 14:783–793

Kneip A, Simar L, Wilson PW (2008) Asymptotics and consistent bootstraps for DEA estimators in non-parametric frontier models. Econ Theory 24:1663–1697

Kneip A, Simar L, Wilson PW (2011a) Central limit theorems for DEA scores: when bias can kill the variance. Discussion paper, Institut de Statistique Biostatistique et Sciences Actuarielles, Université Catholique de Louvain, Louvain-la-Neuve, Belgium

Kneip A, Simar L, Wilson PW (2011b) A computationally efficient, consistent bootstrap for inference with non-parametric DEA estimators. Comput Econ. (Forthcoming)

Korostelev A, Simar L, Tsybakov AB (1995a) Efficient estimation of monotone boundaries. Ann Stat 23:476–489

Korostelev A, Simar L, Tsybakov AB (1995b) On estimation of monotone and convex boundaries. Publications de l’Institut de Statistique de l’Université de Paris XXXIX 1:3–18

McDonald J (2009) Using least squares and tobit in second stage dea efficiency analyses. Eur J Oper Res 197:792–798

Murata F, Hayashi T, Matsumoto J, Asada H (2007) Rainfall on the Meghalaya plateau in northeastern India—one of the rainiest places in the world. Nat Hazards 42:391–399

Park BU, Jeong S-O, Simar L (2010) Asymptotic distribution of conical-hull estimators of directional edges. Ann Stat 38:1320–1340

Park BU, Simar L, Zelenyuk V (2008) Local likelihood estimation of truncated regression and its partial derivative: Theory and application. J Econ 146:185–2008

Ramalho EA, Ramalho JJS, Henriques PD (2010) Fractional regression models for second stage DEA efficiency analyses. J Prod Anal 34:239–255

Shephard RW (1970) Theory of cost and production functions. Princeton, Princeton University Press

Simar L, Wilson PW (2000) Statistical inference in nonparametric frontier models: the state of the art. J Prod Anal 13:49–78

Simar L, Wilson PW (2007) Estimation and inference in two-stage, semi-parametric models of productive efficiency. J Econ 136:31–64

Simar L, Wilson PW (2010) Estimation and inference in cross-sectional, stochastic frontier models. Econ Rev 29:62–98

Simar L, Wilson PW (2011) Inference by the

*m*out of*n*bootstrap in nonparametric frontier models. J Prod Anal. (Forthcoming)Sufian F, Habibullah MS (2009) Asian financial crisis and the evolution of korean banks efficiency: a DEA approach. Glob Econ Rev 38:335–369

## Acknowledgments

Financial support from the ``Inter-university Attraction Pole'', Phase VI (No. P6/03) from the Belgian Government (Belgian Science Policy) and from l'Institut National de la Recherche Agronomique (INRA) and Le Groupe de Recherche en Economie Mathématique et Quantitative (GREMAQ),Toulouse School of Economics, Toulouse, France are gratefully acknowledged. Part of this research was done while Wilson was a visiting professor at the Institut de Statistique Biostatistique et Sciences Actuarielles, Université Catholique de Louvain, Louvain-la-Neuve, Belgium. We have benefited from discussions with Valentin Zelenyuk; of course, any remaining errors are solely our responsibility.

## Author information

### Authors and Affiliations

### Corresponding author

## Appendix: OLS estimation in BN’s second stage

### Appendix: OLS estimation in BN’s second stage

The first stage estimation in BN’s approach provides an estimator \(\widehat{\widetilde{\theta}}_i\le1\) of \(\widetilde{\theta}_i\) for \(i=1, \ldots,\,n\) where *i* indexes observations. The properties of DEA estimators have been developed by Korostelev et al. (1995a, b), Kneip et al. (1998), Kneip et al. (2008, 2011b), Park et al. (2010) and Simar and Wilson (2011), and depend on assumptions about returns to scale. In particular, if variable returns to scale (VRS) are assumed, then the DEA estimator converges at rate *n*
^{2/(p+q+1)}, which is slower than the usual parametric rate *n*
^{1/2} for *p* + *q* > 3. BN ignore this in the proof (appearing in BN2) of their Proposition 1, and this leads to important errors and false statements.

BN suggest re-writing (17) as

and using the right-hand side of this to replace \(\log\widetilde{\theta}\) in (16) to obtain (18). Then the error term \(\widetilde{\delta}\) appearing in (18) is equal to δ + η. BN propose estimating (18) by OLS, and claim in their proof of their Proposition 1 that

where ** Q** = Plim(

*n*

^{−1}

**′**

*Z***).**

*Z*^{Footnote 8}As shown below, these claims are false.

Recall that η_{
i
} ≥ 0 for all \(i=1, \ldots,\,n,\) with *i* indexing the sample observations. Simar and Wilson (2011) and Kneip et al. (2011a) prove, under mild regularity conditions,

where \(G(\cdot)\) is an unknown, non-degenerate distribution with mean μ_{0} > 0 and variance σ
^{2}_{0}
> 0 (both finite and unknown), and γ = 2/(*p* + *q* + 1) for the VRS case (or γ = 2/(*p* + *q*) for the constant returns to scale (CRS) case). In addition, as shown in Kneip et al. (2008, 2011a, b), the asymptotic covariances between η_{
i
} and η_{
j
} is asymptotically non-zero for a number of observations \(j=1,\ldots,\,n, j\ne i,\) which is of order *O*(*n*
^{γ}). To summarize, as \(n\to\infty,\)

and

for some bounded but unknown constant α.

Recall that the error term \(\widetilde{\delta}\) in (18), i.e., the equation that BN estimate by OLS, equals δ + η as shown above. Consequently, the properties of η play an important role in determining the properties of the OLS estimator \(\widehat{\varvec{\beta}}\) of \(\varvec{\beta}.\) Let \({\varvec{\fancyscript{Z}}}\) be an *n* × (*r* + 1) matrix with *i*th row given by \(\left[\begin{array}{ll} 1&-\user2{Z}_i\end{array}\right], \) and let \({\varvec{\fancyscript{Y}}=\left[\begin{array}{lll} \log\widehat{\widetilde\theta}_1&\cdots& \log\widehat{\widetilde\theta}_n\end{array}\right]^{\prime}.}\) In addition, let \(\varvec{\beta}^*=\left[\begin{array}{ll}\beta_0&\varvec{\beta}^{\prime}\end{array}\right]^{\prime}\) and \(\widehat{\varvec{\beta}}^*=\left[\begin{array}{ll}\widehat{\beta}_0&\widehat{\varvec{\beta}}^{\prime}\end{array}\right]^{\prime}. \) Then OLS estimation on (18) yields

where \(\widetilde{\varvec{\delta}}= \left[\begin{array}{lll}\widetilde{\delta}_1&\ldots&\widetilde{\delta}_n\end{array}\right]^{\prime}. \) Taking expectations,

as \(n\to\infty, \) where *c*
_{1} is a non-zero, bounded constant, due to the result in (24) and since (by BN’s assumptions) \({E(\varvec{\eta}\mid\varvec{\fancyscript{Z}})=E(\varvec{\eta}), E(\varvec{\delta}\mid\varvec{\fancyscript{Z}})=0,}\) and where \(\varvec{\delta}=\left[\begin{array}{lll}\delta_1&\ldots&\delta_n\end{array}\right]^{\prime}\) and \(\varvec{\eta}=\left[\begin{array}{lll}\delta_1&\ldots&\delta_n\end{array}\right]^{\prime}.\)

From the last line in (28) it is clear that as \(n\to\infty, \)

which is rather different from what is claimed in the proof of Proposition 1 of BN (as noted earlier, BN claim that (19) holds). Recall that γ = 2/(*p* + *q* + 1) for the VRS case. As shown below, the left-hand side of (29) converges to a non-degenerate random variable with constant variance for *p* + *q* ≤ 3, and to a random variable with variance approaching infinity as \(n\to\infty\) for *p* + *q* > 3. In the CRS case, γ = 2/(*p* + *q*), and hence \(\sqrt{n}\left(\varvec{\beta}^*-\varvec{\beta}^*\right)\) converges to a non-degenerate random variable with constant variance for *p* + *q* ≤ 4, and to a random variable with variance approaching infinity as \(n\to\infty\) for *p* + *q* > 4. For *p* + *q* = 3 in the VRS case and for *p* + *q* = 4 in the CRS case, the left-hand side of (29) converges to a random variable with constant variance, but which is not normally distributed as shown below. In their Monte Carlo experiments, BN considered only the case where *p* = *q* = 1 with VRS, and consequently did not notice the errors in their proof of their Proposition 1.

Combining the results in (24–26), and using standard central-limit theorem arguments (see Kneip et al. 2011a for mathematical details), we have

where \(\zeta_n\) is a random variable such that \(\sqrt{n}\zeta_n=o_p(1)\) if γ > 1/2 or \(\sqrt{n}\zeta_n=O_p\left(n^{1/2-\gamma}\right)\) otherwise, and σ^{2} = VAR(δ) = VAR(*V*) + VAR(*U*) (as in BN).^{Footnote 9} The result in (30) is very different from (22), which is the result claimed at the end of the proof appearing in BN2 of BN’s Proposition 1. Although the OLS estimator \(\widehat{\varvec{\beta}}^*\) of \(\varvec{\beta}^*\) is consistent, (22) cannot be used for valid (asymptotic) inference. Moreover, even if γ = 1/2, (30) contains unknown constants and an unknown, bounded random variable. The left-hand side of (30) does not converge to anything that is bounded if γ < 1/2. Bootstrap methods appear to provide the only feasible avenue toward valid inference or hypothesis testing in the second-stage regression.^{Footnote 10}

The preceding discussion also illustrates how the numerous restrictive assumptions imposed on the BN model are crucial for consistency of OLS estimation in the second-stage regression. For example, if ** Z** and

*U*—which determines inefficiency—are correlated, then the error terms δ and \(\widetilde{\delta}\) must be correlated with

**, in which case OLS estimation in (18) would yield inconsistent estimates. As another example, if**

*Z**V*

^{M}, the bound on the noise process, is not constant, then OLS estimation may be problematic. If \(V^M=\overline{V^M}+\zeta,\) where \(\overline{V^M}\) is constant and \(\zeta\) is random with \(E(\zeta)=0,\) then β

_{0}can be written as \(\beta_0=E(V-U)-\overline{V^M}, \) but δ would have to be written as \(\delta=V-U-E(V-U)-\zeta.\) If \(E(\zeta)\ne0,\) then OLS estimation of β

_{0}will be biased and inconsistent. Worse, regardless of whether \(E(\zeta)=0,\) if \(\zeta\) is not independent of

**, then OLS estimation in (18) would yield inconsistent estimates of both β**

*Z*_{0}and \(\varvec{\beta}. \) If the environmental variables are related to the size of firms, and if the error bounds vary with firm size, the

**and \(\zeta\) would clearly be correlated; this is likely to be the case in some applications.**

*Z*Even more troubling is the assumption that *V*
^{M} is finite, which implies that the noise term *V* is symmetrically truncated at −*V*
^{M} and *V*
^{M}. Suppose, for example, that *V*∼ *N*(0, σ
^{2}_{
V
}
), and suppose the researcher has a sample of *n* iid draws \(\{V_1,\ldots,\,V_n\}\) from the *N*(0,σ
^{2}_{
V
}
) distribution. Of course, one can easily find the sample maximum, and the maximum value in a normal sample of finite size will certainly be less than infinity. But, it is necessarily difficult, and maybe impossible, to test whether the distribution is truncated at a finite value. In situations in econometrics where truncated regression is used, the truncation typically arises from features of the sampling mechanism (e.g., survey design) or model structure (e.g., in SW, truncation arises from the fact that inefficiency has a one-sided distribution; it would make little sense to assume otherwise). Imposing finite bounds on a two-sided noise process, however, is a far more uncertain prospect.

If *V*
^{M} is infinite, then the first-stage estimation using DEA estimators is inconsistent. From (13), it is clear that if *V*
^{M} is infinite, then \(\widetilde{\phi}(X)\) must be infinite. Re-arranging terms in (15) indicates that \(\widetilde{\theta}=Y/\widetilde{\phi}(X)\) for the case of a univariate output considered by BN; hence if *V*
^{M} is infinite, then \(\widetilde{\theta}\) is undefined, in which case BN’s second-stage regression is an ill-posed problem without meaning.

## Rights and permissions

## About this article

### Cite this article

Simar, L., Wilson, P.W. Two-stage DEA: *caveat emptor*
.
*J Prod Anal* **36**, 205–218 (2011). https://doi.org/10.1007/s11123-011-0230-6

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11123-011-0230-6