Robust stochastic frontier analysis: a Student’s t-half normal model with application to highway maintenance costs in England

The presence of outliers in the data has implications for stochastic frontier analysis, and indeed any performance analysis methodology, because they may lead to imprecise parameter estimates and, crucially, lead to an exaggerated spread of efficiency predictions. In this paper we replace the normal distribution for the noise term in the standard stochastic frontier model with a Student’s t distribution, which generalises the normal distribution by adding a shape parameter governing the degree of kurtosis. This has the advantages of introducing flexibility in the heaviness of the tails, which can be determined by the data, as well as containing the normal distribution as a limiting case, and we outline how to test against the standard model. Monte Carlo simulation results for the maximum simulated likelihood estimator confirm that the model recovers appropriate frontier and distributional parameter estimates under various values of the true shape parameter. The simulation results also indicate the influence of a phenomenon we term ‘wrong kurtosis’ in the case of small samples, which is analogous to the issue of ‘wrong skewness’ previously identified in the literature. We apply a Student’s t-half normal cost frontier to data for highways authorities in England, and this formulation is found to be preferred by statistical testing to the comparator normal-half normal cost frontier model. The model yields a significantly narrower range of efficiency predictions, which are non-monotonic at the tails of the residual distribution.


Introduction
Frontier analysis is concerned with the measurement of efficiency relative to an estimated production or cost frontier. The presence of noise in the sample is potentially problematic in two ways: it affects the position of each decision making unit (DMU) relative to the frontier, and it affects the precision of the estimates of the shape of the frontier itself. The magnitudes of these effects vary from one method to another: deterministic methods, such as data envelopment analysis (DEA) (Charnes et al. 1978) and corrected ordinary least squares (COLS) are particularly sensitive, given that they make no allowance for noise. In contrast, Stochastic Frontier (SF) Analysis explicitly controls for noise, mitigating its impact on the estimated frontier and on individual efficiency scores.
The range of efficiency scores may still be very large in the presence of data with outlying observations. In terms of empirical motivation for this paper, this paper is in response to the authors' finding of an implausibly wide range of efficiency scores in our work studying cost drivers and cost efficiency in the highways maintenance operations of local authorities in England, which utilises detailed data on operating and capital expenditure provided by each authority. This can be narrowly explained as being due to a combination of under-reporting and over-reporting of expenditure, unobserved investment cycle effects, and extreme weather events. However we also came across this issue in a number of other datasets, across numerous sectors including hotels in Taiwan (Chen 2007), container ports (Cullinane et al. 2006), regional iron and steel production in China (Lin and Wang 2014), crop farms in Poland (Latruffe et al. 2004), French insurance companies (Fecher et al. 1993), Belgian municipalities (De Borger and Kerstens 1996), US banks (Bauer et al. 1998), and police forces in England and Wales (Drake and Simper 2003). In general, as observed by Gupta and Nguyen (2010), financial data are often heavy tailed. For example Berger and Humphrey (1991) observe heavy tailed distribution of costs in banking, an industry to which efficiency analysis is often applied. As such the issue in this paper has broad applicability across the performance literature.
In this paper we outline and discuss the merits and empirical application of a new stochastic frontier model which accommodates the influence of outlying observations. This is the stochastic frontier model with a Student's t distribution for the noise term. The advantages of this model over previous proposals lie in its flexibility, since the degree of kurtosis is no longer fixed but allowed to vary with the degrees of freedom parameter. In fact, the Student's t distribution nests (or more precisely, contains as a limiting case) the normal distribution as the degrees of freedom parameter approaches infinity. For any given distribution of u i , our model encompasses the standard SF model. This enables testing down against the standard model, in contrast to previous proposals which utilise non-nested specifications, and lets the data determine the extent to which outlying observations influence the kurtosis of the noise error term. Thus our model is an original and significant contribution to the literature, not just in being able to better accommodate outlying observations in efficiency analysis relative to the standard SF model, but it is the first contribution to contain as a testable limiting case the standard SF model. As such our model provides a natural extension to the tools of practitioners in the field.
The structure of this paper is as follows: Section 2 reviews existing methods available in the literature to handle a large number of outliers in frontier analysis. Section 3 introduces a t-truncated normal stochastic frontier model for dealing with heavy-tailed noise and discusses efficiency prediction. Testing of the normal distribution hypothesis is considered in Section 4. Section 5 presents Monte Carlo evidence on the performance of the maximum simulated likelihood estimator of the model. Section 6 applies the Student's t-half normal model to data on highways maintenance costs in England and compares the results to those obtained from normal-half normal model, and Section 7 gives our summary and conclusions. Some technical results appear in the Appendices.

Approaches to outliers in stochastic frontier analysis
The standard SF model, as developed by Aigner et al. (1977) and Meeusen and van Den Broeck (1977), is as follows: Where the subscript i denotes the observation, y i is the dependent variable, x i is a vector of independent variables, and ε i is an error term with two components. The noise component, v i , follows a symmetric distribution centred at zero, and the inefficiency component, u i , is drawn from a one-sided distribution. In the production case s = 1, while in the cost case s = −1. Many alternative distributions have been proposed for u, for example the half normal or exponential (Aigner et al. 1977), truncated normal (Stevenson 1980), or gamma (Greene 2003) distributions. In comparison, v i is almost always assumed to follow a normal distribution, although both distributional assumptions are crucial with regards to the both the robustness of the parameter estimates to outliers and the decomposition of the estimated residual into noise and inefficiency. Concerning robustness to outliers, it is worth pointing out that one of the original motivations behind SF analysis as an alternative to ordinary least squares (OLS)-which after all yields unbiased estimates of the frontier parameters apart from the intercept-was to obtain estimates that are more robust to skewness of the residuals implied by inefficiency. It was not until the Jondrow et al. (1982) paper that a method of obtaining observation-specific predictions of efficiency was introduced. On the decomposition of the residuals, obtaining these observation-specific efficiency scores proceeds by making some prediction of u i based on the conditional distribution of u i |ε i Dropping the subscript i from here on, this is given by where f v and f u are the probability density functions of v and u respectively, and f ε is the probability density function of the composed error derived as the convolution of the two error components. The usual approach to efficiency prediction is to use the mean of this distribution according to exp [−E(u|ε)] (Jondrow et al. 1982) or exp[E(−u|ε)] (Battese and Coelli 1988). Clearly, any efficiency prediction derived in this way will depend upon the distribution not only of u but also of v: Wang and Schmidt (2009) derive the distribution of E(u|ε) in the normal-half normal case, and show that E(u|ε)→u only as VAR(v) → 0, while E(u|ε)→E(u) as VAR(v) → ∞. The conditional mean predictor is therefore a shrinkage of u towards its mean with the degree of shrinkage depending upon the distribution of v.
Given that the presence of outliers, where these are attributed to noise rather than inefficiency, implies a leptokurtic 1 noise distribution, the normal distribution usually assumed for v is inadequate since it is mesokurtic for any given parameter values. Intuitively, we may therefore expect outliers in the data to result in an exaggerated spread of efficiency scores for two reasons: first, because of an inflated estimate of the scale of the distribution of u, and second because of insufficient shrinkage of u towards its mean, especially at the extremes. That is to say, if leptokurtosis in the noise term due to outliers is not taken into account, residuals will be attributed disproportionately to inefficiency rather than noise, particularly in outlying observations. This motivates the development of alternative SF models that can accommodate outliers.
A review of the literature in this area provided by Stead et al. (2018) covers several different approaches to outliers: the use of alternative efficiency predictors, thick frontier analysis, heteroskedastic SF models, and the use of alternative noise distributions. Other possible approaches include right truncation of the inefficiency distribution to restrict the range of possible efficiency predictions, and detecting and removing outliers.
The detection and removal (or reweighting) of outliers is a common approach to dealing with outliers in regression analysis, with outliers identified on the basis of the extent to which an observation has a disproportionate effect on parameter estimates, captured by Cook's distance (Cook 1977), Mahalanobis distance (Mahalanobis 1936), and similar measures. However, existing methods may not be appropriate in the case of skewness of the composed error. For example, Cook's Distance explicitly assumes normally distributed errors, while in SF analysis the composed error has a skew normal distribution (Azzalini 1985), and in the normal-exponential case it follows what is known as an exponentially modified Gaussian distribution (Grushka 1972). One approach to outliers could be to derive measures of influence and leverage appropriate for SF models, although these would depend upon the particular specification used.
Finally, we can account for outliers by assuming an appropriate distribution for v. In principle, any distribution that is symmetric, centred around zero and unimodal is an appropriate candidate for the distribution of v. Although far more attention has been paid in the literature to the distribution of u, several such alternatives have been suggested. Outside of the SF literature, Lange et al. (1989) suggest the use of the Student's t distribution for the error terms as a robust alternative to OLS. Lange and Sinsheimer (1993) discuss estimation when errors are drawn from the logistic, slash, t, and contaminated normal distributions. Note that the latter is simply the mixture of two normal distributions with the same mean but differing variances. All of these distributions have heavier tails than the normal distribution, and therefore offer greater robustness to outliers. Reviews of the SF literature by Greene (2008) and Parmeter and Kumbhakar (2014) include some discussion of alternative noise distributions.
In the context of SF analysis, Tancredi (2002) proposes a model in which v is drawn from a Student's t distribution and u from a half t distribution. Tchumtchoua and Dey (2007) study the Student's t-half t model in a Bayesian setting. Griffin and Steel (2007) and Hajargasht and Griffiths (2018) also discuss estimation of Bayesian SF models with Student's t noise using Markov Chain Monte Carlo (MCMC) and variational Bayes methods, respectively. Nguyen (2010) introduces two additional alternative heavy tailed distributions for v: the Laplace distribution and the Cauchy distributionnote that the latter is a Student's t distribution with degrees of freedom equal to oneand derive formulae for Cauchy-truncated Cauchy and Cauchyhalf Cauchy SF models. Applications are shown in Gupta and Nguyen (2010). Horrace and Parmeter (2018) study Laplace-truncated Laplace and Laplace-exponential 2 SF models. Noting that the Laplace distribution is ordinary smooth, in contrast to the normal distribution, which is supersmooth, they conjecture that the Laplace distribution may be advantageous with respect to the deconvolution of ε into v and u, since optimal convergence rates for deconvolution problems decrease with the smoothness of the noise distribution, being in particular much slower when the noise distribution is supersmooth rather than ordinary smooth (Fan 1991;1992).
The performance of these models in the presence of outliers is interesting, but under-explored. Following the previous discussion, given the use of heavy-tailed distributions for v, we might expect these models to be more robust to outliers in terms of parameter estimation, and in terms of yielding less extreme efficiency predictions for outlying observations. Indeed, Tancredi (2002) shows that as sε i →∞, f u|ε (u|ε) becomes completely flat in the Student's t-half t model, in contrast to how f u|ε (u|ε) becomes a degenerate distribution at zero as sε → ∞ in the normal-half normal model. Similarly, Horrace and Parmeter (2018) show that f u|ε (u|ε)and hence E(u|ε)is constant for positive values of sε in the Laplace-exponential case. Focusing more explicitly on outliers, Stead et al. (2018) propose a model in which v follows a logistic distribution and u follows a half normal distribution, and show that the model results in a smaller estimate for VAR(u) and yields a considerably narrower spread of efficiency scores than the normal-half normal model, with little change in exp[E(−u|ε)] for large |sε|. Each of these cases suggest that the choice of a heavy tailed distribution for v significantly affects the prediction of efficiency, and the uncertainty in prediction, at one or both tails, producing less extreme efficiency predictions.
3 The Student's t-truncated normal SF model In this section, we propose a robust SF model in which v follows a Student's t distribution and u follows a truncated normal distribution. The Student's t distribution is for v has a number of attractive properties in the present context. First, it is a heavy tailed distribution, and thus more robust to the presence of outliers. Second, it avoids the arbitrary assumptions on the degree of kurtosis in v embedded in existing models-the normal, logistic, and Laplace distributions have excess kurtosis of 0, 1.2, and 3, respectively, regardless of parameter values. Ultimately, the kurtosis of v is an empirical question, and therefore we should ideally use a distribution for v for which kurtosis is flexible. The kurtosis of the Student's t distribution depends upon the degrees of freedom parameter. Third, the Student's t distribution nests the Cauchy distribution when the degrees of freedom parameter is equal to one and the normal distribution as it approaches infinity. Therefore an SF model with a Student's t distribution for v encompasses a model in which v follows a Cauchy or normal distribution for any given distribution of u. This enables testing against these alternatives. In the latter case, we are testing against the standard SF model, which could be interpreted as a test of robustness to outliers.
Below, we derive simulated log likelihood functions and efficiency predictors for the Student's t-truncated normal SF model, and discuss estimation and hypothesis testing. Results for the Student's t-half normal and Student's texponential models can be obtained via some simple modifications. Extensions to other distributions of u are straightforward if the quantile function of that distribution has a closed form, while in many other cases-as with the Student's t-gamma-the simulated log likelihood function becomes slightly more complex.

Formulation and estimation
In SF models, the error ε (noting again that we drop the subscript i for notational simplicity) is composed of a symmetric noise component v and an inefficiency component u which is drawn from a one-sided distribution: In this study, we assume that v is drawn from a nonstandardized Student's t distribution-which includes a scale parameter σ v -and that u follows a truncated normal distribution, so that the probability density functions f v and f u are given by where μ and σ u are the mean and standard deviation of the pre-truncation distribution of u, respectively, a is a shape parameter that determines the kurtosis of the Student's t distribution, and Γ is the gamma function. If μ is set to zero, then the model reduces to a Student's t-half normal model as applied to our data in Section 6. As noted previously, as a → ∞ the Student's t distribution approaches the normal distribution. Thus, our model encompasses as a limiting case the normal-truncated normal model. Similarly, when a = 1 we have a Cauchy distributed noise component. The joint density of ε and u is given by and the marginal density of ε is given by the convolution which lacks a convenient closed form, hindering ML estimation. As an alternative, we may use simulation to approximate the integral and arrive at a simulated log likelihood function-see Train (2009) for an introduction to maximum simulated likelihood-as proposed by Greene (2003) for the normal-gamma model and Stead et al. (2018) for the logistic-half normal model. We begin by noting that (7) is the expectation of f v (ε + su), where u is drawn from a truncated-normal distribution, that can be estimated by where u q is a draw from a truncated normal distribution 3 . This gives us the simulated probability density function for ε and the simulated log-likelihood function is This may be maximised like any conventional loglikelihood function, provided we have our draws from the truncated normal distribution. From Geweke et al. (1997) and Greene (2003), we have the quantile function of a truncated normal random variable: Where t l and t r are the left and right truncation points, respectively, and F q is draw q from the standard uniform distribution. Since we know that t l = 0, t r = ∞, this simplifies to In order to modify the model so that the one-sided error follows some other distribution, we need only change u q such that we instead obtain draws from the chosen distribution. The most obvious choices are the exponential and gamma distributions. For the t-exponential case, we have the quantile function And by substituting (13) or (14) into (11), we have the simulated log-likelihood function for the Student's t-truncated-normal and Student's t-exponential models respectively. Other proposed distributions for u, such as the Weibull (Tsionas 2007) and Rayleigh (Hajargasht 2015) distributions also have closed form quantile functions. However, even in cases in which the quantile distribution has no analytical expression, such as the gamma distribution, forming the simulated log-likelihood is possible-see Greene (2003).
Proceeding with the Student's t-truncated normal variant of the model, (11) and (13) give the simulated loglikelihood function. First order conditions for maximisation are given in Appendix B. One remaining issue is the method of taking random draws: we prefer to use Halton draws, which aim for good coverage of the unit interval rather than randomness: this significantly reduces the number of draws needed to approximate the integral (see Greene (2003) for a fuller discussion).
Recalling the discussion of Horrace and Parmeter (2018) around convergence rates and the smoothness property of the noise distribution in deconvolution problems, a further attraction of the Student's t distribution is the potential advantage in terms of estimating f u . Fan (1991) discusses the smoothness properties of several distributions. The normal distribution is supersmooth of order 2, while the Cauchy distribution is supersmooth of order 1. The smoothness of the Student's t distribution is not explicitly discussed, but may be determined as follows. From Fan (1991), the distribution of a random variable is said to be supersmooth of order β if its characteristic function, φ(t), satisfies where d 0 , d 1 , β, and γ are positive constants, and β 0 and β 1 are constants. Expression 4.8 in Hurst (1995) gives the following result where φ v (t) denotes the characteristic function of a Student's t distribution with degrees of freedom parameter a. Substituting (16) into (15), we see that the Student's t distribution is supersmooth of order 1 when a is finitewhich encompasses the Cauchy case where a = 1. From the results of Fan (1991), this implies that optimal convergence rates are faster when noise is Student's t distributed than when noise is normally distributed. It should be noted however that ordinary smooth distributions, such as the Laplace distribution, have convergence rates that are faster still, being polynomial functions of N, as opposed to logarithmic functions of N as in the case supersmooth distributions (Fan 1992). For more discussion of the differences between smooth and supersmooth noise distributions in stochastic frontier analysis, see Horrace and Parmeter (2011;.

Parameter identification
To demonstrate that the model parameters are identified, we show that consistent method of moments estimators exist. For the sake of tractability, we restrict attention to the case where μ = 0. As in the standard SF model, the OLS estimator b β is a consistent estimate of β, with the exception that b β 0 yields a consistent estimator of β 0 −sE(u).
Since v and u are independent, the first four moments of ε are given by Setting these equal to the corresponding raw sample moments μ 2 , μ 3 , and μ 4 then rearranging gives the following method of moments estimators-note the correction to the estimated intercept.
Note that since the moments of the Student's t distribution are variously undefined or infinite for small a, this is also the case for moments of ε, so these estimators exist only when a is sufficiently large. Further, just as the method of moments approach to estimation of the standard SF model breaks down when μ 3 ≤ 0, μ 4 must be sufficiently large that the denominator in (22) is positive for method of moments to be used in this case. 4 In Section 5, we undertake Monte Carlo simulations to analyse the performance of the model for a variety of values of α = 2, a = 4, a = 5, a = 10, and a → ∞, the latter corresponding to v~N(0,1).
Well-known results from Waldman (1982) show that the OLS estimator is a stable stationary point of the loglikelihood function for the standard model, and that the identification of the σ u parameter hinges on the skewness of the OLS residuals. Horrace and Parmeter (2018) show that, while these results do not apply to the SF model with Laplace noise since OLS is not the limiting estimator as Var (u) → 0, an analogous result applies: the least absolute deviations (LAD) estimator is a stationary point of the loglikelihood 5 . It is trivial to show that a similar result applies to our model, i.e. that the ML estimator of a regression model with Student's t errors is a stationary point in our loglikelihood function. Horrace and Wright (forthcoming) show that such a stationary point exists under very weak distributional assumptions about v and u. This suggests that identification of σ u in our model depends upon the skewness of the residuals from the Student's t regression model. However, while Horrace and Parmeter (2018) point out that in the Laplace case the LAD stationary point is not stable due to non-differentiability of the log-likelihood function in the limiting case, this does not appear to apply in the Student's t case, in which the log-likelihood function is everywhere differentiable.

Efficiency prediction
As discussed in previous sections, the usual approach to generating observation-specific efficiency scores is to predict values based on the distribution of u|ε, which is given by: The most widely used predictors are the mean of this conditional distribution according to exp[−E(u|ε)]: Or, E[exp(−u|ε)]: Note that in both cases, since f ε (ε) is not a function of u, it can be moved outside the integral. Bearing in mind that f ε (ε) is the convolution of f v (ε+su) and f u (u), we therefore have And For the model we are considering, none of these integrals have closed form solutions, so we approximate them via simulation. The simulated integral in the denominators of both formulae are given by (10), and the integrals in the numerators are the expectations of uf v (ε+su) and exp(−u)f v (ε+su) given that u is a random variable with the probability density function f u (u). We therefore have, with some simplifying and rearranging: Where u q is given by (13) in the Student's t-truncated normal and (14) in the Student's t-exponential models, for example.

Hypothesis testing concerning the noise distribution
As discussed previously, an attraction of the Student's t distribution in the current context is that the normal distribution is the limiting case as a → ∞. Therefore a stochastic frontier model in which v follows a Student's t distribution encompasses a model in which v follows a normal distribution. This allows us to test down from a model with Student's t noise to a standard SF model, which could be interpreted as a testing for heavy tails-or the significance of outliers in the data-and used for model selection. For this purpose, the likelihood ratio test statistic is an obvious choice. This is defined as Where lnL A is the simulated log-likelihood from the Student's t model and lnL 0 is the log-likelihood from the null model with normally distributed v. The standard result that this statistic has a limiting χ 2 distribution with degrees of freedom equal to the difference in dimensionality between the alternative and null models does not apply, since under the null hypothesis the degrees of freedom parameter a lies on the boundary of the parameter space. Case 5 in Self and Liang (1987)-see also Case 2 in Chen and Liang (2010)-shows that the likelihood ratio statistic follows an asymptotic 50:50 mixture of χ 2 0 and χ 2 1 distributions, denoted χ 2 1:0 . Economou (2011) applies this result to an analogous problem in survival analysis of testing down from a three parameter Burr XII distribution to a two parameter Weibull distribution. A further analogue in stochastic frontier analysis is testing for the presence of an upper bound on u, since an SF model with a tail truncated distribution for u nests the standard SF model as the tail truncation point B → ∞.
We present simulation evidence regarding the distribution of the LR, under the null hypothesis that v is normally distributed, in Appendix C. We do this to verify that the results of Chen and Liang (2010) do apply in this case. Note that we could reparameterise the model to include an inverse degrees of freedom parameter, 1/a, in which case the standard model is the limiting case as 1/a → 0; however, this boundary remains an open boundary. In particular, we note that in this setting, and the two other examples appearing in the literature discussed above (Weibull and upper truncation point in the inefficiency distribution), the null hypothesis corresponds to a parameter value at an open (and not closed) boundary. Chen and Liang (2010) state that their result still holds when the boundary is an open boundary, but we wish to verify this. The evidence presented in Appendix B lends support to the idea that LR $ χ 2 1:0 under the null hypothesis. We therefore consider this test to be appropriate for this purpose.

Monte Carlo simulations
In this section, we examine the performance of the Student's t-half normal model under various data generating processes (DGPs). We are primarily interested in how the model performs in two different scenarios: when v is Student's t distributed, and when v is normally distributed. In And v is a standard Student's t random variable with degrees of freedom parameter a = 2, a = 5, a = 10, and a → ∞, respectively. Note that in the latter case, v is normally distributed, so we draw from the standard normal distribution. For each of these DGPs, we then vary the number of observations used in each replication. We consider the cases N = 100, N = 200, and N = 1000. This will give some insight into the small sample performance of the model.
We estimate a Student's t-half normal stochastic cost frontier model using the simulated data generated for each of one thousand replications per DGP. As a point of comparison, we also estimate a normal-half normal stochastic cost frontier model using the data from each replication. The comparison between the results from the two models helps to put the results from our model into context, allowing us to judge its performance relative to that of the standard model. When comparing the results from the two models, it should be kept in mind that the interpretation of the σ v parameter differs. In the normal-half normal model, σ 2 v is the variance of the noise term v, while in the Student's t-half normal model, the variance of v also depends on the degrees of freedom parameter, a. Specifically, the variance of v in our model is given by And is undefined for a ≤ 2. It is therefore to be expected that, when the true DGP involves a Student's t distributed v, the estimated σ v parameter from the normal-half normal model will tend to be greater than the value used in the DGP. Table 1 summarises the results from each simulation. The mean, median, and standard deviation of the estimates for each parameter across the 1000 replications are shown from both the normal-half normal and Student's t-half normal models. The first two columns show the parameters and corresponding values in each DGP.
Across each of the four DGPs, and regardless of the number of observations, the mean parameter estimates from the Student's t-half normal model are similar to those from the normal-half normal model, and both tend to be reasonably close to the true values. An exception to this is that in the N = 100 and N = 200 cases for the DGPs with finite a, the mean estimates of the degrees of freedom parameter are many times greater than the true values from the DGP. However, the median estimates of a are much closer to the true values. The reason for the large difference between the mean and median estimates of a is that the distributions of the estimates are skewed by a small number extremely large values. We find that these very large values occur in repetitions in which the kurtosis of the OLS residuals is low, and specifically where the excess kurtosis-i.e. kurtosis over and above that of the normal distribution-is near to or less than zero. This is significant since, as with the discussion in Section 3.23.2, it further suggests a link between excess kurtosis and identification of the degrees of freedom parameter a, analogous to the link between skewness and identification of the σ u parameter in the standard model.
In the N = 1000 repetitions, this 'wrong kurtosis' issue does not arise in the a = 4 or a = 5 cases, but does in the a = 10 case. In the α = 2 case, 'wrong kurtosis' does not arise in either the N = 200 or N = 1000 repetitions. Intuitively, this is due to the fact that the Student's t distribution with a = 10 already closely resembles the normal distribution, and further increases in a yield relatively small changes to the shape of the distribution. Just as the probability of 'wrong skew' arising in the standard model decreases as the number of observations or the signal to noise ratio, or both, increase (Simar and Wilson 2010), it appears that the probability of 'wrong kurtosis' is negatively related to sample size and positively related to the degrees of freedom parameter.
Finally, on the subject of 'wrong kurtosis', it is important to note that whilst this phenomenon is similar to the 'wrong skew' problem in terms of potentially arising due to the luck of small sample draws from a 'correct kurtosis' distribution, the practical implications of this occurrence are not as severe as in the wrong skew case. In the wrong kurtosis case, the estimation of the Student's t-half normal stochastic frontier models recovers the normal-half normal model estimates, whilst under 'wrong skew' the estimate of the variance of the inefficiency distribution approaches zero, indicating no evidence of inefficiency.
Estimates of σ v from the normal-half normal model for DGPs 1-4 tend to be greater than 1, as expected. For DGPs 2-4, in which VAR(v) is defined, the standard deviations of v are 1.414, 1.291, and 1.118, respectively. Interestingly, the mean estimates of σ v from the normal-half normal model correspond closely to these values. This suggests that, although the normal-half normal model cannot capture the kurtosis of v, it approximates the variance of the noise term relatively well. Looking at the standard deviations of the parameter estimates from our model, we can see that the sampling variation is generally less than that from the normal-half normal for the a = 2, a = 4 and a = 5 cases. This reflects the greater sensitivity of the standard model to outlying observations, and suggests that our model is indeed more robust to the presence of heavy tails. For the a = 10 and vÑ (0,1) cases, the estimates from the normal-half normal model generally have lower standard deviations than those from our model.
The final DGP corresponds to a model in which v~N(0,1). In this case, our model performs well when the true DGP is the normal-half normal model, as evidenced by the very large mean and median estimates of a across all sample sizes, and by the similarity between the frontier and scale parameter estimates produced by the model and those produced by the normal-half normal model. In this case, we are also interested in how the results typically differ from those of the normal-half normal model. Table 2 summarises the within-replication differences between the estimates of β 0 , β 1 , σ v , and σ u from the Student's t-half normal and normal-half normal models, respectively, showing the mean difference and the standard deviation of the differences. A positive (negative) mean indicates that the Student's t-half normal model tended to produce a higher (lower) estimate of a given parameter than the normal-half normal model.
From the above, we can see that, as expected following the discussion in Section 2, the Student's t-half normal model tends to produce lower estimates of σ u for a given replication. This holds for DGPs 1, 2, and 3, in which the kurtosis of v is greatest. Also, for DGPs 1-3, we see that the normal-half normal model yields higher estimates of σ v ; however, given the discussion around the differing interpretations of σ v in the two models, we have also looked at the differences in ffiffiffiffiffiffiffiffiffiffiffiffiffi Var v ð Þ p directly. Table 2 shows that, while the Student's t-half normal model tends to yield lower estimates of σ v when a is small, its estimates of the standard deviation of v in fact tend to be higher. This supports the expectation, following the discussion in Section 2 and in Stead et al. (2018), that a model with a heavy-tailed distribution in v should attribute more of the overall error variance to noise than to inefficiency.

Application to highways maintenance costs in England
In this section, we apply a Student's t-half normal model to the dataset on highway maintenance costs in England used by Stead et al. (2018). In England, responsibility for road maintenance is divided between Highways England-until 2015 the Highways Agency-a publicly-owned company responsible for the strategic 'trunk road network', and the county councils and unitary authorities, who maintain the non-trunk roads within their boundaries. Our data are from the CQC Efficiency Network 6 and consist of costs and cost drivers associated with local authorities' highway maintenance activities.
The majority of previous studies of road maintenance costs have focussed on the issue of marginal costs, and their implications for road pricing, rather than relative efficiency. These use data on motorways and canton roads in Switzerland (Schreyer et al. 2002), motorways in Austria (Sedlacek and Herry 2002), roads in Sweden (Haraldsson 2006;Jonsson and Haraldsson 2008), trunk roads in Poland (Bak et al. 2006;Bak and Borkowski 2009), and motorways and federal roads in Germany (Link 2006;2014).
With respect to efficiency studies, Wheat (2017) undertakes the first study of local road maintenance costs in Britain and utilised a forerunner of the dataset under consideration in this paper. The author considered the optimal scale of operation as well as evidence for the cost efficiency of local highway authorities. A normal-half normal stochastic frontier model was used. A further study relating to efficiency in road maintenance is that of Fallah-Fini et al. (2009), which applies DEA to data on eight counties of the US state of Virginia, with expenditure, traffic and equivalent single axle loads as inputs, road area and quality indicators as outputs, and climate factors as nondiscretionary variables.
We used an unbalanced panel consisting of data on the 70 English unitary authorities and county councils that were members of the CQC Efficiency Network in 2015-16 and supplied data for at least one year from 2009-10 to that year; this gives us 327 observations in total. Cost data were supplied to the network by each authority according to definitions agreed by a working group of network members, and relate to carriageway maintenance activities only, i.e. they exclude costs associated with related activities such as winter service and footway maintenance. The dataset is updated annually for a new round of analysis. In this study we use the dataset from the 2015-16 round, which was the first year that the network ran. We observe large differences in unit costs, with a large number of extreme outliers in both directions, which are clearly the result of reporting errors. As a result, standard SF models yield a wide range of efficiency predictions, motivating the development of robust SF methods.
In line with the previous literature mentioned above-see Link (2014) for a summary-we use road length and traffic variables as output variables. Detailed breakdowns of local authorities' total road lengths into urban and rural and by classification are publicly available from the Department for Transport (DfT). Roads in the UK are classified as: M (motorways), A, B, classified unnumbered, or unclassified; we hereafter refer to the latter two as C and U, respectively. The trunk road network, maintained in England by Highways England, consists of motorways and trunk A roads, leaving non-trunk A roads and all B, C, and U roads as the responsibility of local authorities. Our road length data therefore exclude motorways and trunk A roads, and likewise we use traffic data supplied by DfT which relate only to local authority maintained roads.
We separate total network length into urban and rural road lengths, URL and RRL, and also include a set of proportion variables relating to each road classification. To control for network quality, we include three road condition indicators, also available from the DfT: RDCA, RDCBC, and RDCU relating to the proportion of A roads, B and C roads, and U roads respectively where maintenance should be considered; we weight these by the corresponding share of their road classifications in total network length. Our input price variables are: WAGE, the median hourly wage in civil engineering by NUTS1 region from the Annual Survey of Hours and Earnings (ASHE), published by the Office for National Statistics (ONS), and a national index of materials prices in road construction that were published by the Department of Business, Innovation and Skills (BIS), ROCOSM. We use a modified Cobb-Douglas functional form including second-order terms relating to urban and rural road length. We estimate the following cost frontier: Where TOTEX is total expenditure, PROP UA through to PROP RC are urban A roads, urban B roads, etc. as proportions of the total network length, with the proportion of rural unclassified roads omitted, and YEAR is a time trend. All variables are mean-centred, and linear homogeneity in input prices is imposed. Table 3 compares results from the Student's t-half normal and normal-half normal models, showing parameter estimates, standard errors and significance levels. The Student's t-half normal model was estimated using a developer's version of LIMDEP, along with formulae for efficiency predictions as discussed in the previous section, and the model was estimated based upon 100 Halton draws.
We can see from Table 3 that both models result in generally similar parameter estimates; the main difference is in the standard errors of these estimates, which are approximately a third smaller in the t-half normal model than in the normal-half normal model. While the σ u parameters are comparable between the two models, as discussed in Section 5 the σ v parameters are not. The variances of u is given in both cases by:    192 0.194 while VAR(v) is given by σ 2 v in the normal-half normal case, and by (34) in the Student's t-half normal case. The estimates of these are compared in Table 4. Table 4 shows that, in line with our expectations, and the findings of Stead et al. (2018) relating to the logistic-half normal model, that the Student's t-half normal model results in a lower estimated variance in inefficiency than the normal-half normal model, and that more of the total error variance is attributed to v. The overall error variance is also slightly lower, likewise mirroring the results from the logistic-half normal model. Following from this we expect to find a considerably narrower distribution of efficiency predictions from the Student's t-half normal model owing to both the lower estimated VAR(u) and the greater shrinkage towards the mean resulting from the higher estimated VAR(v). Table 5 compares the mean, minimum and maximum efficiency scores from the Student's t-half normal and normal-half normal models obtained via exp [−E(u|ε)]. As expected, the minimum efficiency estimate is considerably higher, and the maximum also significantly lower, in the t-half normal model, and therefore the range of efficiency scores is remarkably smaller. In this case less than half. A more complete description is given by Fig. 1, which compares the kernel density estimates for the two sets of efficiency scores.
As well as the distribution of efficiency predictions, we are also particularly interested in predictions at the tails. Figure 2 compares the relationships between the estimated residuals and efficiency predictions from the Student's t-half normal and normal-half normal models.
Given the similarity of the frontier parameter estimates, as shown in Table 3, the residuals from the Student's t-half normal and normal-half normal models are highly correlated. However the relationships between the residuals and efficiency predictions is shown by Fig. 2 to be substantially different between the two models: for values of residuals between around −0.5 and 1.5, the slope of the function is considerably flatter in the Student's t-half normal case, so that a change in ε results in a much smaller change in exp [−E(u|ε)]. However, the most striking difference is that the relationship is non-monotonic in the Student's t-half normal case, with exp[−E(u|ε)] beginning to decrease slightly for the smallest values of ε and increase very considerably for the largest values of ε. This is in contrast to the standard SF model, and also the logistic-half normal model as shown by Stead et al. (2018), in which the relationship is monotonic.
The explanation for the large reduction in the range of the efficiency predictions relative to those from the standard model is therefore twofold. First, the use of a heavy-tailed distribution for v results in greater shrinkage of efficiency predictions at the tails, as is the case in the normal-logistic model (Stead et al. 2018). Second, the non-monotonicity in E(u|ε) means that the highest and lowest efficiency scores belong not to the most outlying observations in either direction, as in the standard model, but to observations with less extreme estimated residuals.
The intuitive explanation for this non-monotonicity in the Student's t-half normal case is that for outlying observations, the uncertainty associated with exp[−E(u|ε)] increases, and further increases in |ε| are attributed to v to such an extent that there is a reduction in exp [−E(u|ε)]. More formally, we know that the monotonicity of E(u|ε) is linked to the log-concavity property of the distribution of v: specifically, E(u|ε) is a weakly (strictly) monotonic function of ε for any weakly (strictly) log-concave f v (Ondrich and  Ruggiero 2001). Goldberger (1983) discusses the logconcavity of the (standard) normal, logistic, Laplace and Student's t distributions, and these results are easily extended to the nonstandardised cases with scaling σ v . It is notable that each of these distributions are log-concave everywhere, with the sole exception of the Student's t distribution, which is log-concave where v ffiffiffi α p σ v but logconvex where v ! ffiffiffi α p σ v , i.e. at the tails 7 . This explains why E(u|ε) can be non-monotonic, changing direction at the tails, when v follows a Student's t distribution, whereas E(u|ε) has been noted to be either weakly or strictly monotonic everywhere when v follows a normal, logistic, or Laplace distribution. Generalising slightly outside this particular empirical application, the finding of non-monotonicity in efficiency predictions with respect to residuals is useful for applications in economic regulation (RPI-X regulation). This model has interesting incentive properties which may help overcome informational asymmetries in economic regulation between the regulated firms and the regulator. In particular the non-monotonicity property should help discourage firms from trying to game the process by submitting over-favourable data e.g. under reporting costs in the case of cost benchmarking.
As discussed in Section 4, we are interested in testing whether v is normally distributed, in which case the Student's t-half normal contains as a limiting case the normalhalf normal model. In addition to this hypothesis we are also interested in testing the null hypothesis of no inefficiency. The likelihood ratio follows a χ 2 1:0 distribution in both cases. Log-likelihoods are given in Table 3, and are used to calculate the likelihood ratio statistic as shown in (32). For the first null hypothesis, this gives a likelihood ratio of 4.155 and a corresponding p-value of 0.021. For our second null hypothesis, the Student's t-half normal model reduces to a regression model with Student's t errors: we do not report the results of this regression, except the loglikelihood, which is −189.140. Therefore the corresponding likelihood ratio statistic is 4.150 and the p-value is 0.021. We therefore reject the null hypotheses of normally distributed v and zero inefficiency, indicating that this model performs better than either the standard SF model or the Student's t regression model for this data.
Overall our empirical application demonstrates that the thalf normal SF model is indeed supported by our data relative to the more standard normal-half normal SF model.
Further the frontier parameter estimates are plausible and very similar in both models. In terms of the efficiency predictions, the Student's t-half normal SF model yields a much more plausible spread of efficiency predictions at the tails.

Summary and Conclusions
This paper proposes a new stochastic frontier model as a means to account for outlying observations in the context of stochastic frontier analysis. A failure to account for outliers in the standard stochastic frontier model can lead to an exaggerated wide range of efficiency predictions, with the efficiency predictions relating to the least efficient firms being implausibly low. This problem is apparent in our application to highway maintenance departments in England, and also in several other applications we identify in the literature, across a range of countries and industries. We propose a model combining a Student's t distribution for noise, v, with a half normal distribution for inefficiency, u. Our model is an original and significant contribution to the literature, not just in being able to better accommodate outlying observations in efficiency analysis relative to the standard normal-half normal stochastic frontier (SF) model, but it is the first contribution to contain as a testable limiting case the standard SF model. As such our model provides a natural extension to the tools of practitioners in the field. The advantages of this distribution are that the kurtosis of v is determined by a degrees of freedom parameter which is freely estimated, and that it encompasses as a limiting case the normal distribution as this parameter approaches infinity. This means that the heaviness of the tails of v, reflecting the prevalence of outliers in the data, is flexible, and that testing down to the standard SF model is possible. We derive the log-likelihood and efficiency predictors for the t-truncated normal SF model, and discuss extension to other distributions for u, which is straightforward.
We consider how to test the null hypothesis of normally distributed noise against the alternative of a heavier tailed distribution. We show that the associated LR test statistic is distributed as a mixture chi-squared through appealing to results in Chen and Liang (2010), and provide simulation evidence that the test statistic does follow the proposed distribution under the null in large samples.
Simulation evidence is provided for the maximum simulated likelihood estimator of our model. As well as showing that the estimator performs well in recovering the true parameters of our DGP for small values of the Student's t shape (degrees of freedom) parameter, the simulations indicate that in the case of the true DGP being the standard SF (normal-half normal) model, the Student's t SF model recovers very similar estimates to the standard SF model. The combination of this finding, coupled with the 7 Note that as α → ∞, i.e. the distribution of v approaches normality, we would therefore expect the non-monotonicity seen in Fig. 2 to become less prominent. Simulation evidence, not included here for brevity's sake, indicates that this is the case: the larger a is, the more closely E(u|ε) corresponds to the normal-half normal case. For sufficiently large a, the relationship may be monotonic within the range of the estimated residuals. ability to test the hypothesis of a normally distributed noise error against a heavier tailed distribution using the LR test statistic, provides reassurance as to the robustness of a modelling approach based on starting with the Student's t SF model and testing to see if a standard SF model is an appropriate simplify restriction.
The simulation results also highlight the possibility of 'wrong kurtosis', specifically where the excess kurtosis of the OLS residual distribution is less than zero. In this case the maximum simulated likelihood estimates approach those from the normal/half-normal model and as such 'wrong kurtosis' is similar to 'wrong skew' previously identified in the literature. Just as the probability of 'wrong skew' arising in the standard model decreases as the number of observations or the signal to noise ratio, or both, increase (Simar and Wilson 2010), it appears that the probability of 'wrong kurtosis' is negatively related to sample size and positively related to the degrees of freedom parameter. It is important to note that the practical implications of this 'wrong kurtosis' are not as severe as in the wrong skew case. In the wrong kurtosis case, the estimation of the Student's t-half normal stochastic frontier models recovers the normal-half normal model estimates, whilst under 'wrong skew' the estimate of the variance of the inefficiency distribution approaches zero, indicating no evidence of inefficiency.
We apply a Student's t-half normal model to estimate a cost frontier using a dataset on English local authorities' highway maintenance costs, and compare the model's outputs with those from the normal-half normal model. We find similar frontier parameter estimates from the two models, though with reduced standard errors in the t-half normal model. We implement testing against the standard SF model, and find that we are able to reject the null hypothesis of normally distributed v. This implies that it is important to account for heavy tails in our data.
The main empirical differences between the two models are firstly, a reduced estimate of VAR(u) and an increased estimate of VAR(v) and secondly a reduced range of efficiency predictions according to exp [−E(u|ε)]. Thirdly, we find a non-monotonic relationship between residuals and exp [−E(u|ε)], in contrast to the standard SF model. This could be a very an important feature of the model in practical applications, for example in regulatory settings, where this could incentivise correct reporting of data by regulated firms, as the alternative 'gaming' behaviour of under reporting, say, costs, many actually reduce the firm's efficiency prediction.
Finally, we have integrated our model into the LIMDEP/ NLOGIT econometric software (Greene 2016), and have also written a Stata package enabling estimation of the model 8 , which means that it is ready for use by practitioners. An interesting avenue for future research would be to consider the impact of Student's t noise in the context of the various panel data SF specifications, for example models accounting for unobserved heterogeneity.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix A
First order conditions (reintroducing i subscripts for completeness) for maximisation in the t-truncated normal case are The expressions needed to complete the first order conditions above are as follows 9 Appendix B In order to avoid the complications presented by 'wrong skew' and simulation chatter, not to mention the large amount of time that would be needed to estimate the full model via MSL for thousands of repetitions, we restrict our attention to the case where σ u = 0, so that the model reduces to a regression with Student's t errors. That is, true data generating process is where v i is drawn from a standard normal distribution. We then estimate two regression models: one assuming a normally distributed error term and estimated via OLS, and the other assuming a Student's t error term and estimated via ML. The LR statistic is then calculated as in (32), where lnL 0 and lnL A are the resulting log-likelihoods from the former and latter models, respectively. We use a sample size of 5000 for each of 50,000 repetitions. Table 6 below compares the mean, variance, and percentiles of the χ 2 1:0 distribution to those of the sampling distribution of the LR statistic from our Monte Carlo replications. Comparing the two columns, the evidence from our Monte Carlo simulations seems to support the idea that Case 5 of Self and Liang (1987) applies in this case, i.e. that the LR statistic follows a χ 2 1:0 distribution.