1 Introduction

Frontier analysis is concerned with the measurement of efficiency relative to an estimated production or cost frontier. The presence of noise in the sample is potentially problematic in two ways: it affects the position of each decision making unit (DMU) relative to the frontier, and it affects the precision of the estimates of the shape of the frontier itself. The magnitudes of these effects vary from one method to another: deterministic methods, such as data envelopment analysis (DEA) (Charnes et al. 1978) and corrected ordinary least squares (COLS) are particularly sensitive, given that they make no allowance for noise. In contrast, Stochastic Frontier (SF) Analysis explicitly controls for noise, mitigating its impact on the estimated frontier and on individual efficiency scores.

The range of efficiency scores may still be very large in the presence of data with outlying observations. In terms of empirical motivation for this paper, this paper is in response to the authors’ finding of an implausibly wide range of efficiency scores in our work studying cost drivers and cost efficiency in the highways maintenance operations of local authorities in England, which utilises detailed data on operating and capital expenditure provided by each authority. This can be narrowly explained as being due to a combination of under-reporting and over-reporting of expenditure, unobserved investment cycle effects, and extreme weather events. However we also came across this issue in a number of other datasets, across numerous sectors including hotels in Taiwan (Chen 2007), container ports (Cullinane et al. 2006), regional iron and steel production in China (Lin and Wang 2014), crop farms in Poland (Latruffe et al. 2004), French insurance companies (Fecher et al. 1993), Belgian municipalities (De Borger and Kerstens 1996), US banks (Bauer et al. 1998), and police forces in England and Wales (Drake and Simper 2003). In general, as observed by Gupta and Nguyen (2010), financial data are often heavy tailed. For example Berger and Humphrey (1991) observe heavy tailed distribution of costs in banking, an industry to which efficiency analysis is often applied. As such the issue in this paper has broad applicability across the performance literature.

In this paper we outline and discuss the merits and empirical application of a new stochastic frontier model which accommodates the influence of outlying observations. This is the stochastic frontier model with a Student’s t distribution for the noise term. The advantages of this model over previous proposals lie in its flexibility, since the degree of kurtosis is no longer fixed but allowed to vary with the degrees of freedom parameter. In fact, the Student’s t distribution nests (or more precisely, contains as a limiting case) the normal distribution as the degrees of freedom parameter approaches infinity. For any given distribution of ui, our model encompasses the standard SF model. This enables testing down against the standard model, in contrast to previous proposals which utilise non-nested specifications, and lets the data determine the extent to which outlying observations influence the kurtosis of the noise error term. Thus our model is an original and significant contribution to the literature, not just in being able to better accommodate outlying observations in efficiency analysis relative to the standard SF model, but it is the first contribution to contain as a testable limiting case the standard SF model. As such our model provides a natural extension to the tools of practitioners in the field.

The structure of this paper is as follows: Section 2 reviews existing methods available in the literature to handle a large number of outliers in frontier analysis. Section 3 introduces a t-truncated normal stochastic frontier model for dealing with heavy-tailed noise and discusses efficiency prediction. Testing of the normal distribution hypothesis is considered in Section 4. Section 5 presents Monte Carlo evidence on the performance of the maximum simulated likelihood estimator of the model. Section 6 applies the Student’s t-half normal model to data on highways maintenance costs in England and compares the results to those obtained from normal-half normal model, and Section 7 gives our summary and conclusions. Some technical results appear in the Appendices.

2 Approaches to outliers in stochastic frontier analysis

The standard SF model, as developed by Aigner et al. (1977) and Meeusen and van Den Broeck (1977), is as follows:

$$y_i \,=\, f\left( {x_i,\beta } \right) + \varepsilon _i,\quad \varepsilon _i \,=\, v_i - su_i$$
(1)

Where the subscript i denotes the observation, yi is the dependent variable, xi is a vector of independent variables, and εi is an error term with two components. The noise component, vi, follows a symmetric distribution centred at zero, and the inefficiency component, ui, is drawn from a one-sided distribution. In the production case s = 1, while in the cost case s = −1. Many alternative distributions have been proposed for u, for example the half normal or exponential (Aigner et al. 1977), truncated normal (Stevenson 1980), or gamma (Greene 2003) distributions. In comparison, vi is almost always assumed to follow a normal distribution, although both distributional assumptions are crucial with regards to the both the robustness of the parameter estimates to outliers and the decomposition of the estimated residual into noise and inefficiency.

Concerning robustness to outliers, it is worth pointing out that one of the original motivations behind SF analysis as an alternative to ordinary least squares (OLS)—which after all yields unbiased estimates of the frontier parameters apart from the intercept—was to obtain estimates that are more robust to skewness of the residuals implied by inefficiency. It was not until the Jondrow et al. (1982) paper that a method of obtaining observation-specific predictions of efficiency was introduced. On the decomposition of the residuals, obtaining these observation-specific efficiency scores proceeds by making some prediction of ui based on the conditional distribution of uii Dropping the subscript i from here on, this is given by

$$f_{\left. u \right|\varepsilon }\left( {u{\mathrm{|}}\varepsilon } \right) \,=\, \frac{{f_v\left( {\varepsilon + su} \right)f_u\left( u \right)}}{{f_\varepsilon \left( \varepsilon \right)}}$$
(2)

where fv and fu are the probability density functions of v and u respectively, and fε is the probability density function of the composed error derived as the convolution of the two error components. The usual approach to efficiency prediction is to use the mean of this distribution according to exp[−E(u|ε)] (Jondrow et al. 1982) or exp[E(−u|ε)] (Battese and Coelli 1988). Clearly, any efficiency prediction derived in this way will depend upon the distribution not only of u but also of v: Wang and Schmidt (2009) derive the distribution of E(u|ε) in the normal-half normal case, and show that E(u|ε)→u only as VAR(v) → 0, while E(u|ε)→E(u) as VAR(v) → ∞. The conditional mean predictor is therefore a shrinkage of u towards its mean with the degree of shrinkage depending upon the distribution of v.

Given that the presence of outliers, where these are attributed to noise rather than inefficiency, implies a leptokurticFootnote 1 noise distribution, the normal distribution usually assumed for v is inadequate since it is mesokurtic for any given parameter values. Intuitively, we may therefore expect outliers in the data to result in an exaggerated spread of efficiency scores for two reasons: first, because of an inflated estimate of the scale of the distribution of u, and second because of insufficient shrinkage of u towards its mean, especially at the extremes. That is to say, if leptokurtosis in the noise term due to outliers is not taken into account, residuals will be attributed disproportionately to inefficiency rather than noise, particularly in outlying observations. This motivates the development of alternative SF models that can accommodate outliers.

A review of the literature in this area provided by Stead et al. (2018) covers several different approaches to outliers: the use of alternative efficiency predictors, thick frontier analysis, heteroskedastic SF models, and the use of alternative noise distributions. Other possible approaches include right truncation of the inefficiency distribution to restrict the range of possible efficiency predictions, and detecting and removing outliers.

The detection and removal (or reweighting) of outliers is a common approach to dealing with outliers in regression analysis, with outliers identified on the basis of the extent to which an observation has a disproportionate effect on parameter estimates, captured by Cook’s distance (Cook 1977), Mahalanobis distance (Mahalanobis 1936), and similar measures. However, existing methods may not be appropriate in the case of skewness of the composed error. For example, Cook’s Distance explicitly assumes normally distributed errors, while in SF analysis the composed error has a skew normal distribution (Azzalini 1985), and in the normal-exponential case it follows what is known as an exponentially modified Gaussian distribution (Grushka 1972). One approach to outliers could be to derive measures of influence and leverage appropriate for SF models, although these would depend upon the particular specification used.

Finally, we can account for outliers by assuming an appropriate distribution for v. In principle, any distribution that is symmetric, centred around zero and unimodal is an appropriate candidate for the distribution of v. Although far more attention has been paid in the literature to the distribution of u, several such alternatives have been suggested. Outside of the SF literature, Lange et al. (1989) suggest the use of the Student’s t distribution for the error terms as a robust alternative to OLS. Lange and Sinsheimer (1993) discuss estimation when errors are drawn from the logistic, slash, t, and contaminated normal distributions. Note that the latter is simply the mixture of two normal distributions with the same mean but differing variances. All of these distributions have heavier tails than the normal distribution, and therefore offer greater robustness to outliers. Reviews of the SF literature by Greene (2008) and Parmeter and Kumbhakar (2014) include some discussion of alternative noise distributions.

In the context of SF analysis, Tancredi (2002) proposes a model in which v is drawn from a Student’s t distribution and u from a half t distribution. Tchumtchoua and Dey (2007) study the Student’s t-half t model in a Bayesian setting. Griffin and Steel (2007) and Hajargasht and Griffiths (2018) also discuss estimation of Bayesian SF models with Student’s t noise using Markov Chain Monte Carlo (MCMC) and variational Bayes methods, respectively. Nguyen (2010) introduces two additional alternative heavy tailed distributions for v: the Laplace distribution and the Cauchy distribution – note that the latter is a Student’s t distribution with degrees of freedom equal to one – and derive formulae for Cauchy-truncated Cauchy and Cauchy-half Cauchy SF models. Applications are shown in Gupta and Nguyen (2010). Horrace and Parmeter (2018) study Laplace-truncated Laplace and Laplace-exponentialFootnote 2 SF models. Noting that the Laplace distribution is ordinary smooth, in contrast to the normal distribution, which is supersmooth, they conjecture that the Laplace distribution may be advantageous with respect to the deconvolution of ε into v and u, since optimal convergence rates for deconvolution problems decrease with the smoothness of the noise distribution, being in particular much slower when the noise distribution is supersmooth rather than ordinary smooth (Fan 1991; 1992).

The performance of these models in the presence of outliers is interesting, but under-explored. Following the previous discussion, given the use of heavy-tailed distributions for v, we might expect these models to be more robust to outliers in terms of parameter estimation, and in terms of yielding less extreme efficiency predictions for outlying observations. Indeed, Tancredi (2002) shows that as i→∞, fu|ε (u|ε) becomes completely flat in the Student’s t-half t model, in contrast to how fu|ε (u|ε) becomes a degenerate distribution at zero as  → ∞ in the normal-half normal model. Similarly, Horrace and Parmeter (2018) show that fu|ε (u|ε) – and hence E(u|ε) – is constant for positive values of in the Laplace-exponential case. Focusing more explicitly on outliers, Stead et al. (2018) propose a model in which v follows a logistic distribution and u follows a half normal distribution, and show that the model results in a smaller estimate for VAR(u) and yields a considerably narrower spread of efficiency scores than the normal-half normal model, with little change in exp[E(−u|ε)] for large ||. Each of these cases suggest that the choice of a heavy tailed distribution for v significantly affects the prediction of efficiency, and the uncertainty in prediction, at one or both tails, producing less extreme efficiency predictions.

3 The Student’s t-truncated normal SF model

In this section, we propose a robust SF model in which v follows a Student’s t distribution and u follows a truncated normal distribution. The Student’s t distribution is for v has a number of attractive properties in the present context. First, it is a heavy tailed distribution, and thus more robust to the presence of outliers. Second, it avoids the arbitrary assumptions on the degree of kurtosis in v embedded in existing models—the normal, logistic, and Laplace distributions have excess kurtosis of 0, 1.2, and 3, respectively, regardless of parameter values. Ultimately, the kurtosis of v is an empirical question, and therefore we should ideally use a distribution for v for which kurtosis is flexible. The kurtosis of the Student’s t distribution depends upon the degrees of freedom parameter. Third, the Student’s t distribution nests the Cauchy distribution when the degrees of freedom parameter is equal to one and the normal distribution as it approaches infinity. Therefore an SF model with a Student’s t distribution for v encompasses a model in which v follows a Cauchy or normal distribution for any given distribution of u. This enables testing against these alternatives. In the latter case, we are testing against the standard SF model, which could be interpreted as a test of robustness to outliers.

Below, we derive simulated log likelihood functions and efficiency predictors for the Student’s t-truncated normal SF model, and discuss estimation and hypothesis testing. Results for the Student’s t-half normal and Student’s t-exponential models can be obtained via some simple modifications. Extensions to other distributions of u are straightforward if the quantile function of that distribution has a closed form, while in many other cases—as with the Student’s t-gamma—the simulated log likelihood function becomes slightly more complex.

3.1 Formulation and estimation

In SF models, the error ε (noting again that we drop the subscript i for notational simplicity) is composed of a symmetric noise component v and an inefficiency component u which is drawn from a one-sided distribution:

$$\varepsilon \,=\, v - su$$
(3)

In this study, we assume that v is drawn from a non-standardized Student’s t distribution—which includes a scale parameter σv—and that u follows a truncated normal distribution, so that the probability density functions fv and fu are given by

$$f_v\left( v \right) \,=\, \frac{{{\mathrm{\Gamma }}\left( {\frac{{a \,+\, 1}}{2}} \right)}}{{{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)\sqrt {\pi a} \sigma _v}}\left[ {1 \,+\, \frac{1}{a}\left( {\frac{v}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a \,+\, 1}}{2}}$$
(4)
$$f_u\left( u \right) \,=\, \left\{ {\begin{array}{*{20}{c}} {\frac{{\frac{1}{{\sigma _u}}\phi \left( {\frac{{u \,-\, \mu }}{{\sigma _u}}} \right)}}{{{\mathrm{\Phi }}\left( {\frac{\mu }{{\sigma _u}}} \right)}},u \,{>}\, 0} \\ {0,\quad \quad \quad \quad \quad u \le 0} \end{array}} \right.$$
(5)

where μ and σu are the mean and standard deviation of the pre-truncation distribution of u, respectively, a is a shape parameter that determines the kurtosis of the Student’s t distribution, and Γ is the gamma function. If μ is set to zero, then the model reduces to a Student’s t-half normal model as applied to our data in Section 6. As noted previously, as a → ∞ the Student’s t distribution approaches the normal distribution. Thus, our model encompasses as a limiting case the normal-truncated normal model. Similarly, when a = 1 we have a Cauchy distributed noise component.

The joint density of ε and u is given by

$$f_{u,\,\varepsilon }\left( {u,\,\varepsilon } \right) = \left\{ {\begin{array}{*{20}{c}} {\frac{{{\mathrm{\Gamma }}\left( {\frac{{a + 1}}{2}} \right)}}{{{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)\sqrt {\pi a} \sigma _v}}\left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}\frac{{\frac{1}{{\sigma _u}}\phi \left( {\frac{{u - \mu }}{{\sigma _u}}} \right)}}{{{\mathrm{\Phi }}\left( {\frac{\mu }{{\sigma _u}}} \right)}},} & {u \,{> }\, 0} \\ {0,} & {u \le 0} \end{array}} \right.$$
(6)

and the marginal density of ε is given by the convolution

$$f_\varepsilon \left( \varepsilon \right) \,=\, {\int}_0^\infty {\frac{{{\mathrm{\Gamma }}\left( {\frac{{a + 1}}{2}} \right)}}{{{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)\sqrt {\pi a} \sigma _v}}\left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}\frac{{\frac{1}{{\sigma _u}}\phi \left( {\frac{{u - \mu }}{{\sigma _u}}} \right)}}{{{\mathrm{\Phi }}\left( {\frac{\mu }{{\sigma _u}}} \right)}}du}$$
(7)

which lacks a convenient closed form, hindering ML estimation. As an alternative, we may use simulation to approximate the integral and arrive at a simulated log likelihood function—see Train (2009) for an introduction to maximum simulated likelihood—as proposed by Greene (2003) for the normal-gamma model and Stead et al. (2018) for the logistic-half normal model. We begin by noting that (7) is the expectation of fv(ε + su), where u is drawn from a truncated-normal distribution,

$$h\left( u \right) \,=\, E\left[ {\left. {f_v\left( {\varepsilon + su} \right)} \right|u \ge 0} \right],\quad u\,{\mathrm{\sim }}\,N\left( {\mu ,\sigma _u} \right)$$
(8)

that can be estimated by

$$\widehat h \,=\, \frac{1}{Q}\mathop {\sum}\nolimits_{q = 1}^Q {f_v\left( {\varepsilon + su_q} \right)}$$
(9)

where uq is a draw from a truncated normal distributionFootnote 3. This gives us the simulated probability density function for ε

$$\widehat {f_\varepsilon \left( \varepsilon \right)} \,=\, \frac{1}{Q}\frac{{{\mathrm{\Gamma }}\left( {\frac{{a + 1}}{2}} \right)}}{{{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)\sqrt {\pi a} \sigma _v}}\mathop {\sum }\limits_{q = 1}^Q \left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}$$
(10)

and the simulated log-likelihood function is

$$\begin{array}{*{20}{c}} {\ln SL \,=\, - N\ln Q + N\ln \left[ {{\mathrm{\Gamma }}\left( {\frac{{a + 1}}{2}} \right)} \right] - N\ln \left[ {{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)} \right]} \\ { - \frac{N}{2}\left( {\ln \pi + \ln a} \right) - N\ln \sigma _v + \mathop {\sum}\nolimits_{i = 1}^N {\ln \mathop {\sum}\nolimits_{q = 1}^Q {\left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}} } } \end{array}$$
(11)

This may be maximised like any conventional log-likelihood function, provided we have our draws from the truncated normal distribution. From Geweke et al. (1997) and Greene (2003), we have the quantile function of a truncated normal random variable:

$$u_q \,=\, \mu + \sigma _u{\mathrm{\Phi }}^{ - 1}\left\{ {\left[ {{\mathrm{\Phi }}\left( {\frac{{t_r - \mu }}{{\sigma _u}}} \right) - {\mathrm{\Phi }}\left( {\frac{{t_l - \mu }}{{\sigma _u}}} \right)} \right]F_q + {\mathrm{\Phi }}\left( {\frac{{t_l - \mu }}{{\sigma _u}}} \right)} \right\}$$
(12)

Where tl and tr are the left and right truncation points, respectively, and Fq is draw q from the standard uniform distribution. Since we know that tl = 0, tr = ∞, this simplifies to

$$u_q \,=\, \mu + \sigma _u{\mathrm{\Phi }}^{ - 1}\left[ {1 + {\mathrm{\Phi }}\left( {\frac{\mu }{{\sigma _u}}} \right)\left( {F_q - 1} \right)} \right]$$
(13)

In order to modify the model so that the one-sided error follows some other distribution, we need only change uq such that we instead obtain draws from the chosen distribution. The most obvious choices are the exponential and gamma distributions. For the t-exponential case, we have the quantile function

$$u_q \,=\, - \sigma _u\ln F_q$$
(14)

And by substituting (13) or (14) into (11), we have the simulated log-likelihood function for the Student’s t-truncated-normal and Student’s t-exponential models respectively. Other proposed distributions for u, such as the Weibull (Tsionas 2007) and Rayleigh (Hajargasht 2015) distributions also have closed form quantile functions. However, even in cases in which the quantile distribution has no analytical expression, such as the gamma distribution, forming the simulated log-likelihood is possible—see Greene (2003).

Proceeding with the Student’s t-truncated normal variant of the model, (11) and (13) give the simulated log-likelihood function. First order conditions for maximisation are given in Appendix B. One remaining issue is the method of taking random draws: we prefer to use Halton draws, which aim for good coverage of the unit interval rather than randomness: this significantly reduces the number of draws needed to approximate the integral (see Greene (2003) for a fuller discussion).

Recalling the discussion of Horrace and Parmeter (2018) around convergence rates and the smoothness property of the noise distribution in deconvolution problems, a further attraction of the Student’s t distribution is the potential advantage in terms of estimating fu. Fan (1991) discusses the smoothness properties of several distributions. The normal distribution is supersmooth of order 2, while the Cauchy distribution is supersmooth of order 1. The smoothness of the Student’s t distribution is not explicitly discussed, but may be determined as follows. From Fan (1991), the distribution of a random variable is said to be supersmooth of order β if its characteristic function, φ(t), satisfies

$$d_0\left| t \right|^{\beta _0}\exp \left( { - \frac{{\left| t \right|^\beta }}{\gamma }} \right) \le \varphi \left( t \right) \le d_1\left| t \right|^{\beta _1}\exp \left( { - \frac{{\left| t \right|^\beta }}{\gamma }} \right),\quad t \to \infty$$
(15)

where d0, d1, β, and γ are positive constants, and β0 and β1 are constants. Expression 4.8 in Hurst (1995) gives the following result

$$\varphi _v\left( t \right) \,\cong\, \frac{{\sqrt \pi }}{{{\mathrm{\Gamma }}\left( {\frac{a}{2}} \right)2^{\frac{{a - 1}}{2}}}}\left( {\sqrt a \left| t \right|} \right)^{\frac{{a - 1}}{2}}\exp \left( { - \sqrt a \left| t \right|} \right),\quad t \to \infty$$
(16)

where φv(t) denotes the characteristic function of a Student’s t distribution with degrees of freedom parameter a. Substituting (16) into (15), we see that the Student’s t distribution is supersmooth of order 1 when a is finite—which encompasses the Cauchy case where a = 1. From the results of Fan (1991), this implies that optimal convergence rates are faster when noise is Student’s t distributed than when noise is normally distributed. It should be noted however that ordinary smooth distributions, such as the Laplace distribution, have convergence rates that are faster still, being polynomial functions of N, as opposed to logarithmic functions of N as in the case supersmooth distributions (Fan 1992). For more discussion of the differences between smooth and supersmooth noise distributions in stochastic frontier analysis, see Horrace and Parmeter (2011; 2018).

3.2 Parameter identification

To demonstrate that the model parameters are identified, we show that consistent method of moments estimators exist. For the sake of tractability, we restrict attention to the case where μ = 0. As in the standard SF model, the OLS estimator \(\widehat \beta\) is a consistent estimate of β, with the exception that \(\widehat \beta _0\) yields a consistent estimator of β0sE(u). Since v and u are independent, the first four moments of ε are given by

$$E\left( \varepsilon \right) \,=\, E\left( v \right) - sE\left( u \right) \,=\, - s\sigma _u\sqrt {2/\pi } ,\quad a \,{> }\, 1$$
(17)
$$E\left( {\varepsilon ^2} \right)\,=\, E\left( {v^2} \right) + E\left( {u^2} \right) = \sigma _v^2\frac{a}{{a - 2}} + \sigma _u^2\frac{{\pi - 2}}{\pi },\quad a \,{> }\, 2$$
(18)
$$E\left( {\varepsilon ^3} \right) \,=\, E\left( {v^3} \right) - sE\left( {u^3} \right) \,=\, - s\sqrt {2{\mathrm{/}}\pi } \frac{{\left( {4 - \pi } \right)}}{\pi }\sigma _u^3,\quad a > 3$$
(19)
$$\begin{array}{l}E\left( {\varepsilon ^4} \right) \,=\, E\left( {v^4} \right) + E\left( {u^4} \right) + 6E\left( {v^2} \right)E\left( {u^2} \right) \,=\, \\ \frac{6}{{a - 4}} + 3 + \frac{{3\pi ^2 - 4\pi - 12}}{{\pi ^2}}\sigma _u^4 + 6\sigma _v^2\frac{a}{{a - 2}}\sigma _u^2\frac{{\pi - 2}}{\pi },\quad a \,{> }\, 4\end{array}$$
(20)

Setting these equal to the corresponding raw sample moments μ2, μ3, and μ4 then rearranging gives the following method of moments estimators—note the correction to the estimated intercept.

$$\widetilde \sigma _u \,=\, \left[ { - s\mu _3\sqrt {\pi /2} \frac{\pi }{{\left( {4 - \pi } \right)}}} \right]^{\frac{1}{3}},\quad a > 3$$
(21)
$$\widetilde a \,=\, \frac{6}{{\mu _4 - 3 - \widetilde \sigma _u^4\frac{{20\pi - 3\pi ^2 - 36}}{{\pi ^2}} - 6\mu _2\widetilde \sigma _u^2\frac{{\pi - 2}}{\pi }}} + 4,\quad a \,{> }\, 4$$
(22)
$$\widetilde \sigma _v \,=\, \sqrt {\left( {\mu _2 - \widetilde \sigma _u^2\frac{{\pi - 2}}{\pi }} \right)\frac{{\widetilde a - 2}}{{\widetilde a}}} ,\quad a \,{> }\, 2$$
(23)
$$\widetilde \beta _0 \,=\, \widehat \beta _0 - s\widetilde \sigma _u\sqrt {2/\pi } ,\quad \widetilde \beta _j \,=\, \widehat \beta _j,\quad j \,=\, 1, \ldots ,k,\quad a \,{> }\, 1$$
(24)

Note that since the moments of the Student’s t distribution are variously undefined or infinite for small a, this is also the case for moments of ε, so these estimators exist only when a is sufficiently large. Further, just as the method of moments approach to estimation of the standard SF model breaks down when μ3 ≤ 0, μ4 must be sufficiently large that the denominator in (22) is positive for method of moments to be used in this case.Footnote 4In Section 5, we undertake Monte Carlo simulations to analyse the performance of the model for a variety of values of α = 2, a = 4, a = 5, a = 10, and a → ∞, the latter corresponding to v ~ N(0,1).

Well-known results from Waldman (1982) show that the OLS estimator is a stable stationary point of the log-likelihood function for the standard model, and that the identification of the σu parameter hinges on the skewness of the OLS residuals. Horrace and Parmeter (2018) show that, while these results do not apply to the SF model with Laplace noise since OLS is not the limiting estimator as Var(u) → 0, an analogous result applies: the least absolute deviations (LAD) estimator is a stationary point of the log-likelihoodFootnote 5. It is trivial to show that a similar result applies to our model, i.e. that the ML estimator of a regression model with Student’s t errors is a stationary point in our log-likelihood function. Horrace and Wright (forthcoming) show that such a stationary point exists under very weak distributional assumptions about v and u. This suggests that identification of σu in our model depends upon the skewness of the residuals from the Student’s t regression model. However, while Horrace and Parmeter (2018) point out that in the Laplace case the LAD stationary point is not stable due to non-differentiability of the log-likelihood function in the limiting case, this does not appear to apply in the Student’s t case, in which the log-likelihood function is everywhere differentiable.

3.3 Efficiency prediction

As discussed in previous sections, the usual approach to generating observation-specific efficiency scores is to predict values based on the distribution of u|ε, which is given by:

$$f_{\left. u \right|\varepsilon }\left( {u{\mathrm{|}}\varepsilon } \right) \,=\, \frac{{f_v\left( {\varepsilon + su} \right)f_u\left( u \right)}}{{f_\varepsilon \left( \varepsilon \right)}}$$
(25)

The most widely used predictors are the mean of this conditional distribution according to exp[−E(u|ε)]:

$$\exp \left[ { - E\left( {u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, \exp \left[ { - {\int}_0^\infty {u\frac{{f_v\left( {\varepsilon + su} \right)f_u\left( u \right)}}{{f_\varepsilon \left( \varepsilon \right)}}du} } \right]$$
(26)

Or, E[exp(−u|ε)]:

$$E\left[ {\exp \left( { - u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, {\int}_0^\infty {\exp \left( { - u} \right)\frac{{f_v\left( {\varepsilon + su} \right)f_u\left( u \right)}}{{f_\varepsilon \left( \varepsilon \right)}}du}$$
(27)

Note that in both cases, since fε(ε) is not a function of u, it can be moved outside the integral. Bearing in mind that fε(ε) is the convolution of fv(ε+su) and fu(u), we therefore have

$$\exp \left[ { - E\left( {u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, \exp \left[ { - \frac{{\mathop {\int }\nolimits_0^\infty uf_v\left( {\varepsilon + su} \right)f_u\left( u \right)du}}{{\mathop {\int }\nolimits_0^\infty f_v\left( {\varepsilon + su} \right)f_u\left( u \right)du}}} \right]$$
(28)

And

$$E\left[ {\exp \left( { - u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, \frac{{\mathop {\int }\nolimits_0^\infty \exp \left( { - u} \right)f_v\left( {\varepsilon + su} \right)f_u\left( u \right)du}}{{\mathop {\int }\nolimits_0^\infty f_v\left( {\varepsilon + su} \right)f_u\left( u \right)du}}$$
(29)

For the model we are considering, none of these integrals have closed form solutions, so we approximate them via simulation. The simulated integral in the denominators of both formulae are given by (10), and the integrals in the numerators are the expectations of ufv(ε+su) and exp(−u)fv(ε+su) given that u is a random variable with the probability density function fu(u). We therefore have, with some simplifying and rearranging:

$$\exp \left[ { - \widehat E\left( {u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, \exp \left\{ { - \frac{{\mathop {\sum }\nolimits_{q = 1}^Q u_q\left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}}}{{\mathop {\sum }\nolimits_{q = 1}^Q \left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}}}} \right\}$$
(30)
$$\widehat E\left[ {\exp \left( { - u{\mathrm{|}}\varepsilon } \right)} \right] \,=\, \frac{{\mathop {\sum }\nolimits_{q = 1}^Q \exp \left( { - u_q} \right)\left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}}}{{\mathop {\sum }\nolimits_{q = 1}^Q \left[ {1 + \frac{1}{a}\left( {\frac{{\varepsilon + su_q}}{{\sigma _v}}} \right)^2} \right]^{ - \frac{{a + 1}}{2}}}}$$
(31)

Where uq is given by (13) in the Student’s t-truncated normal and (14) in the Student’s t-exponential models, for example.

4 Hypothesis testing concerning the noise distribution

As discussed previously, an attraction of the Student’s t distribution in the current context is that the normal distribution is the limiting case as a → ∞. Therefore a stochastic frontier model in which v follows a Student’s t distribution encompasses a model in which v follows a normal distribution. This allows us to test down from a model with Student’s t noise to a standard SF model, which could be interpreted as a testing for heavy tails—or the significance of outliers in the data—and used for model selection. For this purpose, the likelihood ratio test statistic is an obvious choice. This is defined as

$${\mathrm{LR}} \,=\, - 2\left( {\ln L_A - \ln L_0} \right)$$
(32)

Where lnLA is the simulated log-likelihood from the Student’s t model and lnL0 is the log-likelihood from the null model with normally distributed v. The standard result that this statistic has a limiting χ2 distribution with degrees of freedom equal to the difference in dimensionality between the alternative and null models does not apply, since under the null hypothesis the degrees of freedom parameter a lies on the boundary of the parameter space. Case 5 in Self and Liang (1987)—see also Case 2 in Chen and Liang (2010)—shows that the likelihood ratio statistic follows an asymptotic 50:50 mixture of \(\chi _0^2\) and \(\chi _1^2\) distributions, denoted \(\chi _{1:0}^2\). Economou (2011) applies this result to an analogous problem in survival analysis of testing down from a three parameter Burr XII distribution to a two parameter Weibull distribution. A further analogue in stochastic frontier analysis is testing for the presence of an upper bound on u, since an SF model with a tail truncated distribution for u nests the standard SF model as the tail truncation point B → ∞.

We present simulation evidence regarding the distribution of the LR, under the null hypothesis that v is normally distributed, in Appendix C. We do this to verify that the results of Chen and Liang (2010) do apply in this case. Note that we could reparameterise the model to include an inverse degrees of freedom parameter, 1/a, in which case the standard model is the limiting case as 1/a → 0; however, this boundary remains an open boundary. In particular, we note that in this setting, and the two other examples appearing in the literature discussed above (Weibull and upper truncation point in the inefficiency distribution), the null hypothesis corresponds to a parameter value at an open (and not closed) boundary. Chen and Liang (2010) state that their result still holds when the boundary is an open boundary, but we wish to verify this. The evidence presented in Appendix B lends support to the idea that \(LR\sim \chi _{1:0}^2\) under the null hypothesis. We therefore consider this test to be appropriate for this purpose.

5 Monte Carlo simulations

In this section, we examine the performance of the Student’s t-half normal model under various data generating processes (DGPs). We are primarily interested in how the model performs in two different scenarios: when v is Student’s t distributed, and when v is normally distributed. In the former case, we wish to see how the model performs for various values of a, the degrees of freedom parameter. In the Student’s t case, the model should be able to estimate the parameters of the DGP with a reasonable degree of accuracy for a range of values of a, and the model should also be able to approximate the case where v is normally distributed. The DGP in each case is as follows:

$$\begin{array}{l}y \,=\, 1 + x + v + u,\\ x\sim N\left( {0,1} \right),\quad u = \left| w \right|,\quad w\sim N\left( {0,1} \right)\end{array}$$
(33)

And v is a standard Student’s t random variable with degrees of freedom parameter a = 2, a = 5, a = 10, and a → ∞, respectively. Note that in the latter case, v is normally distributed, so we draw from the standard normal distribution. For each of these DGPs, we then vary the number of observations used in each replication. We consider the cases N = 100, N = 200, and N = 1000. This will give some insight into the small sample performance of the model.

We estimate a Student’s t-half normal stochastic cost frontier model using the simulated data generated for each of one thousand replications per DGP. As a point of comparison, we also estimate a normal-half normal stochastic cost frontier model using the data from each replication. The comparison between the results from the two models helps to put the results from our model into context, allowing us to judge its performance relative to that of the standard model. When comparing the results from the two models, it should be kept in mind that the interpretation of the σv parameter differs. In the normal-half normal model, \(\sigma _v^2\) is the variance of the noise term v, while in the Student’s t-half normal model, the variance of v also depends on the degrees of freedom parameter, a. Specifically, the variance of v in our model is given by

$${\mathrm{Var}}\left( v \right) \,=\, \sigma _v^2\frac{a}{{a - 2}},\quad a \,{> }\, 2$$
(34)

And is undefined for a ≤ 2. It is therefore to be expected that, when the true DGP involves a Student’s t distributed v, the estimated σv parameter from the normal-half normal model will tend to be greater than the value used in the DGP. Table 1 summarises the results from each simulation. The mean, median, and standard deviation of the estimates for each parameter across the 1000 replications are shown from both the normal-half normal and Student’s t-half normal models. The first two columns show the parameters and corresponding values in each DGP.

Table 1 Normal-half normal vs. Student's t-half normal model under various data generating processes—1000 replications

Across each of the four DGPs, and regardless of the number of observations, the mean parameter estimates from the Student’s t-half normal model are similar to those from the normal-half normal model, and both tend to be reasonably close to the true values. An exception to this is that in the N = 100 and N = 200 cases for the DGPs with finite a, the mean estimates of the degrees of freedom parameter are many times greater than the true values from the DGP.

However, the median estimates of a are much closer to the true values. The reason for the large difference between the mean and median estimates of a is that the distributions of the estimates are skewed by a small number extremely large values. We find that these very large values occur in repetitions in which the kurtosis of the OLS residuals is low, and specifically where the excess kurtosis— i.e. kurtosis over and above that of the normal distribution—is near to or less than zero. This is significant since, as with the discussion in Section 3.23.2, it further suggests a link between excess kurtosis and identification of the degrees of freedom parameter a, analogous to the link between skewness and identification of the σu parameter in the standard model.

In the N = 1000 repetitions, this ‘wrong kurtosis’ issue does not arise in the a = 4 or a = 5 cases, but does in the a = 10 case. In the α = 2 case, ‘wrong kurtosis’ does not arise in either the N = 200 or N = 1000 repetitions. Intuitively, this is due to the fact that the Student’s t distribution with a = 10 already closely resembles the normal distribution, and further increases in a yield relatively small changes to the shape of the distribution. Just as the probability of ‘wrong skew’ arising in the standard model decreases as the number of observations or the signal to noise ratio, or both, increase (Simar and Wilson 2010), it appears that the probability of ‘wrong kurtosis’ is negatively related to sample size and positively related to the degrees of freedom parameter.

Finally, on the subject of ‘wrong kurtosis’, it is important to note that whilst this phenomenon is similar to the ‘wrong skew’ problem in terms of potentially arising due to the luck of small sample draws from a ‘correct kurtosis’ distribution, the practical implications of this occurrence are not as severe as in the wrong skew case. In the wrong kurtosis case, the estimation of the Student’s t-half normal stochastic frontier models recovers the normal-half normal model estimates, whilst under ‘wrong skew’ the estimate of the variance of the inefficiency distribution approaches zero, indicating no evidence of inefficiency.

Estimates of σv from the normal-half normal model for DGPs 1–4 tend to be greater than 1, as expected. For DGPs 2–4, in which VAR(v) is defined, the standard deviations of v are 1.414, 1.291, and 1.118, respectively. Interestingly, the mean estimates of σv from the normal-half normal model correspond closely to these values. This suggests that, although the normal-half normal model cannot capture the kurtosis of v, it approximates the variance of the noise term relatively well.

Looking at the standard deviations of the parameter estimates from our model, we can see that the sampling variation is generally less than that from the normal-half normal for the a = 2, a = 4 and a = 5 cases. This reflects the greater sensitivity of the standard model to outlying observations, and suggests that our model is indeed more robust to the presence of heavy tails. For the a = 10 and v ~ N(0,1) cases, the estimates from the normal-half normal model generally have lower standard deviations than those from our model.

The final DGP corresponds to a model in which v ~ N(0,1). In this case, our model performs well when the true DGP is the normal-half normal model, as evidenced by the very large mean and median estimates of a across all sample sizes, and by the similarity between the frontier and scale parameter estimates produced by the model and those produced by the normal-half normal model. In this case, we are also interested in how the results typically differ from those of the normal-half normal model. Table 2 summarises the within-replication differences between the estimates of β0, β1, σv, and σu from the Student’s t-half normal and normal-half normal models, respectively, showing the mean difference and the standard deviation of the differences. A positive (negative) mean indicates that the Student’s t-half normal model tended to produce a higher (lower) estimate of a given parameter than the normal-half normal model.

Table 2 Summary of within-replication differences in parameter estimates

From the above, we can see that, as expected following the discussion in Section 2, the Student’s t-half normal model tends to produce lower estimates of σu for a given replication. This holds for DGPs 1, 2, and 3, in which the kurtosis of v is greatest. Also, for DGPs 1–3, we see that the normal-half normal model yields higher estimates of σv; however, given the discussion around the differing interpretations of σv in the two models, we have also looked at the differences in \(\sqrt {{\mathrm{Var}}\left( v \right)}\) directly. Table 2 shows that, while the Student’s t-half normal model tends to yield lower estimates of σv when a is small, its estimates of the standard deviation of v in fact tend to be higher. This supports the expectation, following the discussion in Section 2 and in Stead et al. (2018), that a model with a heavy-tailed distribution in v should attribute more of the overall error variance to noise than to inefficiency.

6 Application to highways maintenance costs in England

In this section, we apply a Student’s t-half normal model to the dataset on highway maintenance costs in England used by Stead et al. (2018). In England, responsibility for road maintenance is divided between Highways England—until 2015 the Highways Agency—a publicly-owned company responsible for the strategic ‘trunk road network’, and the county councils and unitary authorities, who maintain the non-trunk roads within their boundaries. Our data are from the CQC Efficiency NetworkFootnote 6 and consist of costs and cost drivers associated with local authorities’ highway maintenance activities.

The majority of previous studies of road maintenance costs have focussed on the issue of marginal costs, and their implications for road pricing, rather than relative efficiency. These use data on motorways and canton roads in Switzerland (Schreyer et al. 2002), motorways in Austria (Sedlacek and Herry 2002), roads in Sweden (Haraldsson 2006; Jonsson and Haraldsson 2008), trunk roads in Poland (Bak et al. 2006; Bak and Borkowski 2009), and motorways and federal roads in Germany (Link 2006; 2009; 2014).

With respect to efficiency studies, Wheat (2017) undertakes the first study of local road maintenance costs in Britain and utilised a forerunner of the dataset under consideration in this paper. The author considered the optimal scale of operation as well as evidence for the cost efficiency of local highway authorities. A normal-half normal stochastic frontier model was used. A further study relating to efficiency in road maintenance is that of Fallah-Fini et al. (2009), which applies DEA to data on eight counties of the US state of Virginia, with expenditure, traffic and equivalent single axle loads as inputs, road area and quality indicators as outputs, and climate factors as non-discretionary variables.

We used an unbalanced panel consisting of data on the 70 English unitary authorities and county councils that were members of the CQC Efficiency Network in 2015–16 and supplied data for at least one year from 2009–10 to that year; this gives us 327 observations in total. Cost data were supplied to the network by each authority according to definitions agreed by a working group of network members, and relate to carriageway maintenance activities only, i.e. they exclude costs associated with related activities such as winter service and footway maintenance. The dataset is updated annually for a new round of analysis. In this study we use the dataset from the 2015–16 round, which was the first year that the network ran. We observe large differences in unit costs, with a large number of extreme outliers in both directions, which are clearly the result of reporting errors. As a result, standard SF models yield a wide range of efficiency predictions, motivating the development of robust SF methods.

In line with the previous literature mentioned above—see Link (2014) for a summary—we use road length and traffic variables as output variables. Detailed breakdowns of local authorities’ total road lengths into urban and rural and by classification are publicly available from the Department for Transport (DfT). Roads in the UK are classified as: M (motorways), A, B, classified unnumbered, or unclassified; we hereafter refer to the latter two as C and U, respectively. The trunk road network, maintained in England by Highways England, consists of motorways and trunk A roads, leaving non-trunk A roads and all B, C, and U roads as the responsibility of local authorities. Our road length data therefore exclude motorways and trunk A roads, and likewise we use traffic data supplied by DfT which relate only to local authority maintained roads.

We separate total network length into urban and rural road lengths, URL and RRL, and also include a set of proportion variables relating to each road classification. To control for network quality, we include three road condition indicators, also available from the DfT: RDCA, RDCBC, and RDCU relating to the proportion of A roads, B and C roads, and U roads respectively where maintenance should be considered; we weight these by the corresponding share of their road classifications in total network length. Our input price variables are: WAGE, the median hourly wage in civil engineering by NUTS1 region from the Annual Survey of Hours and Earnings (ASHE), published by the Office for National Statistics (ONS), and a national index of materials prices in road construction that were published by the Department of Business, Innovation and Skills (BIS), ROCOSM. We use a modified Cobb-Douglas functional form including second-order terms relating to urban and rural road length. We estimate the following cost frontier:

$$\begin{array}{*{20}{l}}\ln \,TOTEX \hfill & = & \beta _0 + \beta _1\ln URL + \beta _2\ln RRL + \beta _3\ln URL^2 \\ & & + \beta _4\ln RRL^2 + \beta _5\ln URL\,\ln RRL\hfill\\ \hfill & \hfill & + \beta _6\ln TRAFFIC + \beta _7RDCA\hfill \\ \hfill & \hfill & + \beta _8RDCBC + \beta _9RDCU + \beta _{10}PROP_{UA}\hfill\\ \hfill & \hfill & + \beta _{11}PROP_{UB} + \beta _{12}PROP_{UC} + \beta _{13}PROP_{UU}\hfill\\ \hfill & \hfill & + \beta _{14}PROP_{RA} + \beta _{15}PROP_{RB} + \beta _{16}PROP_{RC}\hfill\\ \hfill & \hfill & + \beta _{17}YEAR + \beta _{18}\ln WAGE + \beta _{19}\ln ROCOSM + \varepsilon\hfill \end{array}$$
(35)

Where TOTEX is total expenditure, PROPUA through to PROPRC are urban A roads, urban B roads, etc. as proportions of the total network length, with the proportion of rural unclassified roads omitted, and YEAR is a time trend. All variables are mean-centred, and linear homogeneity in input prices is imposed.

Table 3 compares results from the Student’s t-half normal and normal-half normal models, showing parameter estimates, standard errors and significance levels. The Student’s t-half normal model was estimated using a developer’s version of LIMDEP, along with formulae for efficiency predictions as discussed in the previous section, and the model was estimated based upon 100 Halton draws.

Table 3 Outputs from the Student’s t-half normal and normal-half normal models

We can see from Table 3 that both models result in generally similar parameter estimates; the main difference is in the standard errors of these estimates, which are approximately a third smaller in the t-half normal model than in the normal-half normal model. While the σu parameters are comparable between the two models, as discussed in Section 5 the σv parameters are not. The variances of u is given in both cases by:

$${\mathrm{VAR}}\left( u \right) \,=\, \frac{{\pi - 2}}{\pi }\sigma _u^2$$
(36)

while VAR(v) is given by \(\sigma _v^2\) in the normal-half normal case, and by (34) in the Student’s t-half normal case. The estimates of these are compared in Table 4.

Table 4 Estimated error variances

Table 4 shows that, in line with our expectations, and the findings of Stead et al. (2018) relating to the logistic-half normal model, that the Student’s t-half normal model results in a lower estimated variance in inefficiency than the normal-half normal model, and that more of the total error variance is attributed to v. The overall error variance is also slightly lower, likewise mirroring the results from the logistic-half normal model. Following from this we expect to find a considerably narrower distribution of efficiency predictions from the Student’s t-half normal model owing to both the lower estimated VAR(u) and the greater shrinkage towards the mean resulting from the higher estimated VAR(v).

Table 5 compares the mean, minimum and maximum efficiency scores from the Student’s t-half normal and normal-half normal models obtained via exp[−E(u|ε)]. As expected, the minimum efficiency estimate is considerably higher, and the maximum also significantly lower, in the t-half normal model, and therefore the range of efficiency scores is remarkably smaller. In this case less than half. A more complete description is given by Fig. 1, which compares the kernel density estimates for the two sets of efficiency scores.

Table 5 Summary of efficiency scores
Fig. 1
figure 1

Kernel densities of cost efficiency scores

As well as the distribution of efficiency predictions, we are also particularly interested in predictions at the tails. Figure 2 compares the relationships between the estimated residuals and efficiency predictions from the Student’s t-half normal and normal-half normal models.

Fig. 2
figure 2

Scatterplot of cost efficiency scores vs residuals

Given the similarity of the frontier parameter estimates, as shown in Table 3, the residuals from the Student’s t-half normal and normal-half normal models are highly correlated. However the relationships between the residuals and efficiency predictions is shown by Fig. 2 to be substantially different between the two models: for values of residuals between around −0.5 and 1.5, the slope of the function is considerably flatter in the Student’s t-half normal case, so that a change in ε results in a much smaller change in exp[−E(u|ε)]. However, the most striking difference is that the relationship is non-monotonic in the Student’s t-half normal case, with exp[−E(u|ε)] beginning to decrease slightly for the smallest values of ε and increase very considerably for the largest values of ε. This is in contrast to the standard SF model, and also the logistic-half normal model as shown by Stead et al. (2018), in which the relationship is monotonic.

The explanation for the large reduction in the range of the efficiency predictions relative to those from the standard model is therefore twofold. First, the use of a heavy-tailed distribution for v results in greater shrinkage of efficiency predictions at the tails, as is the case in the normal-logistic model (Stead et al. 2018). Second, the non-monotonicity in E(u|ε) means that the highest and lowest efficiency scores belong not to the most outlying observations in either direction, as in the standard model, but to observations with less extreme estimated residuals.

The intuitive explanation for this non-monotonicity in the Student’s t-half normal case is that for outlying observations, the uncertainty associated with exp[−E(u|ε)] increases, and further increases in | are attributed to v to such an extent that there is a reduction in exp[−E(u|ε)]. More formally, we know that the monotonicity of E(u|ε) is linked to the log-concavity property of the distribution of v: specifically, E(u|ε) is a weakly (strictly) monotonic function of ε for any weakly (strictly) log-concave fv (Ondrich and Ruggiero 2001). Goldberger (1983) discusses the log-concavity of the (standard) normal, logistic, Laplace and Student’s t distributions, and these results are easily extended to the nonstandardised cases with scaling σv. It is notable that each of these distributions are log-concave everywhere, with the sole exception of the Student’s t distribution, which is log-concave where \(v \le \sqrt \alpha \sigma _v\) but log-convex where \(v \ge \sqrt \alpha \sigma _v\), i.e. at the tailsFootnote 7. This explains why E(u|ε) can be non-monotonic, changing direction at the tails, when v follows a Student’s t distribution, whereas E(u|ε) has been noted to be either weakly or strictly monotonic everywhere when v follows a normal, logistic, or Laplace distribution.

Generalising slightly outside this particular empirical application, the finding of non-monotonicity in efficiency predictions with respect to residuals is useful for applications in economic regulation (RPI-X regulation). This model has interesting incentive properties which may help overcome informational asymmetries in economic regulation between the regulated firms and the regulator. In particular the non-monotonicity property should help discourage firms from trying to game the process by submitting over-favourable data e.g. under reporting costs in the case of cost benchmarking.

As discussed in Section 4, we are interested in testing whether v is normally distributed, in which case the Student’s t-half normal contains as a limiting case the normal-half normal model. In addition to this hypothesis we are also interested in testing the null hypothesis of no inefficiency. The likelihood ratio follows a \(\chi _{1:0}^2\) distribution in both cases. Log-likelihoods are given in Table 3, and are used to calculate the likelihood ratio statistic as shown in (32). For the first null hypothesis, this gives a likelihood ratio of 4.155 and a corresponding p-value of 0.021. For our second null hypothesis, the Student’s t-half normal model reduces to a regression model with Student’s t errors: we do not report the results of this regression, except the log-likelihood, which is −189.140. Therefore the corresponding likelihood ratio statistic is 4.150 and the p-value is 0.021. We therefore reject the null hypotheses of normally distributed v and zero inefficiency, indicating that this model performs better than either the standard SF model or the Student’s t regression model for this data.

Overall our empirical application demonstrates that the t-half normal SF model is indeed supported by our data relative to the more standard normal-half normal SF model. Further the frontier parameter estimates are plausible and very similar in both models. In terms of the efficiency predictions, the Student’s t-half normal SF model yields a much more plausible spread of efficiency predictions at the tails.

7 Summary and Conclusions

This paper proposes a new stochastic frontier model as a means to account for outlying observations in the context of stochastic frontier analysis. A failure to account for outliers in the standard stochastic frontier model can lead to an exaggerated wide range of efficiency predictions, with the efficiency predictions relating to the least efficient firms being implausibly low. This problem is apparent in our application to highway maintenance departments in England, and also in several other applications we identify in the literature, across a range of countries and industries.

We propose a model combining a Student’s t distribution for noise, v, with a half normal distribution for inefficiency, u. Our model is an original and significant contribution to the literature, not just in being able to better accommodate outlying observations in efficiency analysis relative to the standard normal-half normal stochastic frontier (SF) model, but it is the first contribution to contain as a testable limiting case the standard SF model. As such our model provides a natural extension to the tools of practitioners in the field. The advantages of this distribution are that the kurtosis of v is determined by a degrees of freedom parameter which is freely estimated, and that it encompasses as a limiting case the normal distribution as this parameter approaches infinity. This means that the heaviness of the tails of v, reflecting the prevalence of outliers in the data, is flexible, and that testing down to the standard SF model is possible. We derive the log-likelihood and efficiency predictors for the t-truncated normal SF model, and discuss extension to other distributions for u, which is straightforward.

We consider how to test the null hypothesis of normally distributed noise against the alternative of a heavier tailed distribution. We show that the associated LR test statistic is distributed as a mixture chi-squared through appealing to results in Chen and Liang (2010), and provide simulation evidence that the test statistic does follow the proposed distribution under the null in large samples.

Simulation evidence is provided for the maximum simulated likelihood estimator of our model. As well as showing that the estimator performs well in recovering the true parameters of our DGP for small values of the Student’s t shape (degrees of freedom) parameter, the simulations indicate that in the case of the true DGP being the standard SF (normal-half normal) model, the Student’s t SF model recovers very similar estimates to the standard SF model. The combination of this finding, coupled with the ability to test the hypothesis of a normally distributed noise error against a heavier tailed distribution using the LR test statistic, provides reassurance as to the robustness of a modelling approach based on starting with the Student’s t SF model and testing to see if a standard SF model is an appropriate simplify restriction.

The simulation results also highlight the possibility of ‘wrong kurtosis’, specifically where the excess kurtosis of the OLS residual distribution is less than zero. In this case the maximum simulated likelihood estimates approach those from the normal/half-normal model and as such ‘wrong kurtosis’ is similar to ‘wrong skew’ previously identified in the literature. Just as the probability of ‘wrong skew’ arising in the standard model decreases as the number of observations or the signal to noise ratio, or both, increase (Simar and Wilson 2010), it appears that the probability of ‘wrong kurtosis’ is negatively related to sample size and positively related to the degrees of freedom parameter. It is important to note that the practical implications of this ‘wrong kurtosis’ are not as severe as in the wrong skew case. In the wrong kurtosis case, the estimation of the Student’s t-half normal stochastic frontier models recovers the normal-half normal model estimates, whilst under ‘wrong skew’ the estimate of the variance of the inefficiency distribution approaches zero, indicating no evidence of inefficiency.

We apply a Student’s t-half normal model to estimate a cost frontier using a dataset on English local authorities’ highway maintenance costs, and compare the model’s outputs with those from the normal-half normal model. We find similar frontier parameter estimates from the two models, though with reduced standard errors in the t-half normal model. We implement testing against the standard SF model, and find that we are able to reject the null hypothesis of normally distributed v. This implies that it is important to account for heavy tails in our data.

The main empirical differences between the two models are firstly, a reduced estimate of VAR(u) and an increased estimate of VAR(v) and secondly a reduced range of efficiency predictions according to exp[−E(u|ε)]. Thirdly, we find a non-monotonic relationship between residuals and exp[−E(u|ε)], in contrast to the standard SF model. This could be a very an important feature of the model in practical applications, for example in regulatory settings, where this could incentivise correct reporting of data by regulated firms, as the alternative ‘gaming’ behaviour of under reporting, say, costs, many actually reduce the firm’s efficiency prediction.

Finally, we have integrated our model into the LIMDEP/NLOGIT econometric software (Greene 2016), and have also written a Stata package enabling estimation of the modelFootnote 8, which means that it is ready for use by practitioners. An interesting avenue for future research would be to consider the impact of Student’s t noise in the context of the various panel data SF specifications, for example models accounting for unobserved heterogeneity.