1 Introduction

The main goal of Stochastic Frontier Analysis (SFA) is to estimate (in)efficiencies of decision-making units using a stochastic model of a frontier delimited by ideal, fully efficient units; see Kumbhakar et al. (2020a, b) for an overview. In order to achieve this, the usual (two-sided) observation error is augmented with an additional, one-sided component, representing some form of inefficiency. Nonparametric SFA alternatives that rely on Data Envelopment Analysis (DEA; Charnes et al. 1978) and its derivatives have been criticized for ignoring the observation error. Since they are fundamentally non-stochastic, it is problematic, e.g., to assess parameter or model uncertainty or conduct any kind of formal statistical testing. Some of these issues have been addressed in a number of ways; e.g., Simar and Wilson (2015) use nonparametric methods in DEA-like approaches, while Tsionas (2020) demonstrates a Bayesian interpretation of DEA. A particularly appealing approach is represented by Andor et al. (2019), which strives to combine results from various DEA and SFA methods (see also Parmeter and Zelenyuk 2019; Tsionas 2021).

Our key contribution is two-fold. First, we formulate a parametric SFA framework that offers a considerable generalization in terms of distributional assumptions, compared to existing alternatives. Second, we demonstrate the applicability of exact and approximate inference techniques, including those for model selection or averaging, which is nontrivial for non-Gaussian models with latent variables. Consequently, we make the generalized formulation manageable for empirical work, avoid potential overparameterization issues, and account for parameter and specification uncertainty.

Our results are, in a way, parallel to developments in nonparametric inefficiency analysis, considering nonparametric specification of the error terms, especially the inefficiency-related ones (e.g., Assaf et al. 2021; Florens et al. 2020; Sun et al. 2015; Griffin and Steel 2004). We show that a similar flexibility gain can be achieved fully parametrically by a generalization of distributional assumptions for the symmetric observation error and the inefficiency term. The parametric approach offers direct control over model properties like monotonicity or tail thickness. Moreover, it is possible to obtain (explicit) mixture distributions reflecting model averaging.

Assaf et al. (2021) list three SFA-related research gaps, i.e., (i) endogeneity, (ii) functional form specification and (iii) distributional assumptions in the error terms. Our aim is to fill some of these gaps within the parametric approach. Leaving the endogeneity issues aside, we believe that (ii) and (iii), which relate to model specification uncertainty, can be dealt with using model comparison and model averaging over a sufficiently broad model class. Consequently, we first develop a very flexible stochastic frontier (SF) model labeled GT-GB2, which is based on generalized t and generalized beta of the second kind (GB2) distributions (see Harvey and Lange 2017). Second, we make use of model averaging techniques, including Bayesian Model Averaging (BMA), which is a general tool for inference pooling.

There are two ways for using model comparison techniques in order to address specification uncertainty issues: model averaging and model selection. Model selection amounts to assuming a “point estimate” in model space. Selection can be practically equivalent to averaging if one model is clearly superior to others or if all specifications imply very similar empirical conclusions. Otherwise, model averaging seems to be a preferred technique. However, model selection based on empirical criteria is clearly better than arbitrary model choice, which is not rare in practical applications of SFA. We use BMA as our benchmark for model averaging due to its sound theoretical background. Although BMA is predominantly applied to covariate selection problems (Steel 2020), its applicability is not limited to such cases. We focus on averaging over competing stochastic assumptions. However, a simpler option of model selection is obviously available. The methodology presented here also allows for averaging or selection over a broad class of frontier functional form specifications, though we do not pursue this avenue. Our model is also formulated in terms of a nonlinear parametric frontier. Hence, though we estimate (log)linear frontiers in the empirical part of the paper, we make no particular use of the linearity assumption. It is also possible to generalize our framework towards, e.g., a semiparametric frontier formulation. Finally, although we demonstrate results from a basic generalized SF model, we indicate that it can be extended towards panel data framework, varying efficiency distribution, or zero-inefficiency cases (Kumbhakar et al. 2013).

The Bayesian approach has made its way into SFA since it allows for inference on latent variables (inefficiencies) and parameters while accounting for the estimation uncertainty. Early works on Bayesian SFA have also addressed the issue of specification uncertainty to some extent; see van den Broeck et al. (1994). Although Bayesian SF models have gained popularity due to fast and reliable numerical methods, e.g., Gibbs sampling, these numerical methods impose limitations on model comparison, which leaves the issue of model uncertainty unresolved in most practical applications. Our approach to estimation is somewhat different since it relies on the direct numerical approximation of the integrated likelihood function. Hence, our framework can be used by both Bayesian and non-Bayesian researchers. We find it particularly attractive because it avoids serious limitations about the error term distribution, frontier functional form, or prior class, which are implied by the broadly used Gibbs sampling-driven Bayesian SF models. We emphasize that our generalization involves broader distributional assumptions and extends the scope of admissible priors or functional forms. The generalization comes at a considerable cost in terms of computation effort. However, our approach is parallelization-friendly, and thus it is in line with the recent hardware and software evolution. Furthermore, we demonstrate not only exact, computationally demanding methods, but also outline simplified, approximate strategies that are likely to be more practical while maintaining essential advantages of exact methods to a satisfactory degree.

As far as model averaging and comparison are concerned, we overcome three important barriers, which exist in the abovementioned Bayesian SFA. First, due to the availability of the integrated likelihood, it is possible to effectively use, e.g., Laplace-type approximations in order to compute marginal data density (and Bayes Factors). These quantities are crucial for model comparison, and their estimation is nontrivial in existing SF models. Second, as our approach provides a general parametric class and allows for unrestricted choice of priors, the so-called coherent priors can be used for comparison of nested cases avoiding distortions induced by inherently contradictory prior beliefs, see Makieła and Mazur (2020). Third, the unrestricted choice of priors allows for Bayesian sensitivity analysis. The competing Gibbs sampling-based models used in Bayesian SFA restrict priors to certain parametric classes, so sensitivity checks are feasible only within such classes. Our approach extends the available model space and allows for broad use of model selection or averaging, which in turn allows model uncertainty to be fully addressed.

The general and flexible model, GT-GB2, may raise concerns regarding its potential overparameterization. Hence, we emphasize its use as a vehicle for model comparison or specification search. In other words, we avoid the overparameterization issue by using model averaging or selection techniques over a set of nested or limiting special cases. The same is true about potential problems associated with identification of parameters. Our general model is in principle identified. However, in the case of weak sample information, the difference between the prior and posterior information might be limited (the likelihood function might be almost flat in certain directions). Such members of our model class should be ruled out by model selection criteria or down-weighted by averaging procedures. This is in line with our empirical and simulation-based results presented in the paper.

To sum up, existing methods in parametric SFA may implicitly introduce strong, possibly irrelevant assumptions without sufficient empirical verification. Consequences, such as distorted inference about structural properties of the analyzed processes or spurious detection of non-existing phenomena (e.g., spurious inefficiency), are therefore likely to happen. Thus, the goal is twofold: (i) to develop a sufficiently general SFA framework, and (ii) to use formal model averaging (or selection) techniques to deal with model specification uncertainty. We obtain density estimates of efficiency scores (for managerial or regulatory decision-making) that adequately reflect the uncertainty about parameters and model specification.

The paper is structured as follows. Section ”Existing SFA specifications: an overview of distributional assumptions” provides an overview of the existing SFA literature with respect to distributional assumptions. Section “Formulation of the GT-GB2 model” presents the main assumptions, outlines the model, its nested or limiting cases and possible extensions. Section “Inference: likelihood-based methods for the GT-GB2 model” describes the statistical methods. Finally, Section “Simulations” provides simulation-based evidence in favor of the new SFA framework and Section ”Empirical applications” presents two empirical applications using well-researched datasets, which demonstrate the empirical feasibility of the proposed methods. Section “Conclusions” concludes with a discussion.

2 Existing SFA specifications: an overview of distributional assumptions

The oldest and probably the most intensely researched field in SFA is that concerning the specification of inefficiency distribution. The first proposed extension to basic SF models by Aigner et al. (1977) and Meeusen and van den Broeck (1977) is probably the normal-truncated-normal model by Stevenson (1980). It is a natural generalization of the normal-half-normal model, in which the location of the truncation point is not necessarily at the mode of the underlying two-sided distribution. It should be noted, however, that as van den Broeck et al. (1994) point out, this model can be troublesome because of weak identification. That is, multiple combinations of the two parameters’ values that govern this distribution (location and scale) after truncation (at zero) can lead to very similar shapes. Also, if the location parameter is less than zero, the resulting shape after truncation becomes similar to the exponential case. Another popular extension is the normal-gamma model, proposed by Beckers and Hammond (1987) and Greene (1990), which generalizes the normal-exponential model. Although numerically challenging to estimate, see, e.g., Ritter and Simar (1997), it provides a good middle ground between half-normal and exponential, which can be viewed as its extreme cases (Greene 2008). An interesting generalization based on half-Student’s t distribution is introduced by Tancredi (2002). Tsionas (2006), on the other hand, suggests the normal-lognormal model, which allows to model inefficiency over time as a stationary AR(1) process (see also Emvalomatis 2012). This, however, comes at a cost. Lognormal distribution has virtually no probability mass near zero, which implies that the case of full efficiency is practically excluded by construction. Tsionas (2007) also studies models in which inefficiency follows Weibull distribution that, he argues, is better at handling outliers. The most general distribution for inefficiency up to this date, which nests most distributions proposed so far, has probably been proposed by Griffin and Steel (2008). They develop an SF model based on generalized gamma distribution. Further research has turned its attention to Rayleigh distribution, which is a special case of Weibull (Hajargasht 2015) or uniform and half-Cauchy distributions (Nguyen 2010). Horrace and Parmeter (2018) also consider truncated-Laplace distribution for inefficiency, though the exponential case is used in their application. Finally, Parmeter et al. (2019) and Isaksson et al. (2021) discuss some averaging techniques for weighting conditional inefficiency. These estimators, however, are applicable to a limited class of SF models. The large volume of proposals comes from the fact that the theory provides very limited information as to how inefficiency distribution should look. Oikawa (2016) provides some evidence in favor of gamma distribution. This, however, is dismissed by Tsionas (2017), who argues that inefficiency does not follow any of the distributions used so far. Hence, further research is required to pursue more flexible distributions that can better reflect inefficiency.

The random disturbance (i.e., observation error) is traditionally assumed to be normal. This, however, may be too restrictive in practice especially in the presence of outliers or sample heterogeneity (not only in SFA; see, e.g., Lange et al. 1989). That is why more attention has been given recently towards relaxing assumptions about the observation error. Tancredi (2002) proposes to use Student’s t distribution and develops the Student’s t-half-Student’s t model, which directly generalizes the normal-half-normal case. Later, Tchumtchoua and Dey (2007) estimate this model using Bayesian inference, while Griffin and Steel (2007) briefly discuss Bayesian estimation of Student’s t-half-normal, Student’s t-exponential, and Student’s t-gamma SF models using WinBUGS software package. A good overview of contemporary methods for handling outliers is provided by Stead et al. (2018), who propose a logistic-half-normal model, and Wheat et al. (2019), who use Student’s t distribution. Horrace and Wang (2022) propose nonparametric tests to check if one deals with heavy tails in the observation error term (or inefficiency).

Apart from the abovementioned, there is one more strain of SFA literature that deserves attention due to its inevitable proximity with the proposed framework. For some time now, there has been an ongoing effort to make SFA more “flexible” by introducing nonparametric elements, especially in terms of nonparametric treatment of errors. Probably the first attempt has been made by Park and Simar (1994) who introduce an SF model where the density of the individual firm-specific effects is unknown. Griffin and Steel (2004) introduce Bayesian semiparametric model, in which distribution of inefficiency is modelled nonparametrically through a Dirichlet process prior (with exponential centering distribution). Sun et al. (2015) capture inefficiency through firm and time-effects in the intercept. Finally, Assaf et al. (2021) model inefficiency in terms of firm-specific and time-specific effects. It is worth mentioning that even though the abovementioned specifications provide flexible efficiency modelling they do not avoid imposing strong assumptions on the symmetric error component.Footnote 1

To sum up, existing inference methods provide results that may be misinterpreted due to their failure to account for alternative formulations that have practically the same explanatory power. The problem is closely related to that of weak identification. For example, Das and Bandyopadhyay (2008) note that once the error components’ independence assumption is removed, an SF model is “near-non-identification” under specific parametric configurations of the distribution. Hence, identifiability in SFA may not so much pertain to the generalization of the error components but rather to the underlying assumptions under which the distinction between the two components (and their interaction) is made. We shall pay special attention to this issue and discuss it further in subsequent sections.

3 Formulation of the GT-GB2 model

Consider a frontier of the form Yt = g(xt; β) and the corresponding observation equation:

$$y_t = \ln g\left( {{{{\mathbf{x}}}}_t;\beta } \right) - \omega {{{\mathrm{u}}}}_t + {{{\mathrm{v}}}}_t,\quad \quad t = 1, \ldots ,T$$
(1)

with vt representing the i.i.d. observation errors with symmetric and unimodal probability density function (pdf), ut being an i.i.d. latent variables, taking strictly positive values, representing the inefficiency terms, and with known \(\omega \in \left\{ { - 1,1} \right\}\); thus the compound error is \({\upvarepsilon}_t = {{{\mathrm{v}}}}_t - \omega {{{\mathrm{u}}}}_t\). Basically ut and vt are assumed independent of the regressors xt, though much effort has been made recently in order to handle the potential endogeneity problem in SF models; we return to this issue Section 4. Here, g(x; β) is a known function (of unknown parameters β and exogenous variables x) representing the frontier under consideration; it corresponds to, e.g., cost function if \(\omega = - 1\), or production function if \(\omega = 1\). Additionally, yt denotes the log of observable cost or output (Yt). We make the following assumptions about vt’s and ut’s:

A1. vt’s are i.i.d. variables, with zero median, symmetric, unimodal, and continuous pdfs: \(p_{{{\mathrm{v}}}}\left( {{{{\mathrm{v}}}};\uptheta ^{\left( {{{\mathrm{v}}}} \right)}} \right)\), \({{{\mathrm{v}}}} \in {\Bbb R}\);

A2. ut’s are nonnegative i.i.d. variables, having continuous pdfs: \(p_{{{\mathrm{u}}}}\left( {{{{\mathrm{u}}}};\uptheta ^{\left( {{{\mathrm{u}}}} \right)}} \right)\), \({{{\mathrm{u}}}} \in {\Bbb R}_ +\);

A3. all vt’s are stochastically independent of all ut’s.Footnote 2

In our view, the assumptions A1–A3 are minimalistic and take into account the considerations regarding statistical identification of parameters. In particular, we argue that given a sufficiently general form of \(p_{{{\mathrm{u}}}}\left( . \right)\), and \(p_{{{\mathrm{v}}}}\left( . \right)\), the potential gain in terms of statistical fit arising from the relaxation of A1-A3 (for example, relaxation of independence of v’s and u’s) would be very limited. It could also weaken overall model identification, which SFA already suffers from to a varying degree. The resulting likelihood function would have a very small curvature in certain directions in the parameter space. Hence, although the resulting model would be locally identified almost everywhere, the parameterization would be less convenient from the statistical inference viewpoint. Though a formal investigation of the issue is left for further research, there is an informal motivation for assumptions A1-A3. That is, crucial gains in terms of explanatory power are likely to stem from a generalization of \(p_{{{\mathrm{u}}}}\left( . \right)\) and \(p_{{{\mathrm{v}}}}\left( . \right)\) rather than from relaxing A1–A3.

Once the parametric form of \(p_{{{\mathrm{u}}}}\left( . \right)\) and \(p_{{{\mathrm{v}}}}\left( . \right)\) is set, the vector of all statistical parameters is θ′=(β θ(v)′θ(u)′)′. Statistical inference regarding θ relies on properties of the compound error term εt. In the general case, its density \(p_{\upvarepsilon}\left( . \right)\) is defined by the convolution of the respective densities of vt and ut:

$$\begin{array}{l}p_{\upvarepsilon}\left( {y - \ln g\left( {{{{\boldsymbol{x}}}};\beta } \right);\theta ^{\left( \varepsilon \right)}} \right) = \mathop {\int}\limits_{{\Bbb R}_ + } {p_{{{\mathrm{v}}}}} \left( {y - \ln g\left( {{{{\boldsymbol{x}}}};\beta } \right) + \omega {{{\mathrm{u}}}};\theta ^{\left( {{{\mathrm{v}}}} \right)}} \right)\\\qquad\qquad\qquad\qquad\qquad\quad\,{p}_{{{\mathrm{u}}}}\left( {{{{\mathrm{u}}}};\theta ^{\left( {{{\mathrm{u}}}} \right)}} \right)d{{{\mathrm{u}}}}\end{array}$$
(2)

Note that in the majority of practical applications vt is assumed to be Gaussian, while \(p_{{{\mathrm{u}}}}\left( . \right)\) is half-normal or exponential. Moreover, for now, we assume that there exists a one-to-one relationship between parameters of the structural form, i.e., β, θ(v), θ(u), and the parameters of the reduced form: β, θ(ε), so no identification issues arise (possible identification problems are discussed later). However, in the case of stochastic frontier models, the statistical inference is not restricted to θ’s, since the latent variables (ut’s), are also within the scope of interest, as they represent object-specific inefficiency terms.

We assume a general parametric distributional form for \(p_{{{\mathrm{v}}}}\left( . \right)\), and \(p_{{{\mathrm{u}}}}\left( . \right)\), making use of two distributions: the generalized t distribution (GT) and the generalized beta distribution of the second kind (GB2, see references in Harvey and Lange 2017), respectively:

$$\begin{array}{l}f_{GT}\left( {{{{\mathrm{v}}}};\sigma _{{{\mathrm{v}}}},\nu _{{{\mathrm{v}}}},\psi _{{{\mathrm{v}}}}} \right) = \frac{1}{{\sigma _{{{\mathrm{v}}}}}}\frac{{\psi _{{{\mathrm{v}}}}}}{{{\rm B}\left( {1/\psi _{{{\mathrm{v}}}},\nu _{{{\mathrm{v}}}}/\psi _{{{\mathrm{v}}}}} \right)2\nu _{{{\mathrm{v}}}}^{1/\psi _{{{\mathrm{v}}}}}}}\\\qquad\qquad\qquad\quad\left[ {\frac{1}{{\nu _{{{\mathrm{v}}}}}}\left( {\frac{{\left| {{{\mathrm{v}}}} \right|}}{{\sigma _{{{\mathrm{v}}}}}}} \right)^{\psi _{{{\mathrm{v}}}}} + 1} \right]^{ - \left( {1 + \nu _{{{\mathrm{v}}}}} \right)/\psi _{{{\mathrm{v}}}}},\end{array}$$
(3)
$$\begin{array}{l}f_{GB2}\left( {{{{\mathrm{u}}}};\sigma _{{{\mathrm{u}}}},\nu _{{{\mathrm{u}}}},\psi _{{{\mathrm{u}}}},\tau } \right) = \frac{1}{{\sigma _{{{\mathrm{u}}}}}}\frac{{\psi _{{{\mathrm{u}}}}}}{{{\rm B}\left( {\tau {{{\mathrm{/}}}}\psi _{{{\mathrm{u}}}},\nu _{{{\mathrm{u}}}}/\psi _{{{\mathrm{u}}}}} \right)\nu _{{{\mathrm{u}}}}^{\tau /\psi _{{{\mathrm{u}}}}}}}\\\qquad\qquad\qquad\qquad\left( {\frac{{{{\mathrm{u}}}}}{{\sigma _{{{\mathrm{u}}}}}}} \right)^{\tau - 1}\left[ {\frac{1}{{\nu _{{{\mathrm{u}}}}}}\left( {\frac{{{{\mathrm{u}}}}}{{\sigma _{{{\mathrm{u}}}}}}} \right)^{\psi _{{{\mathrm{u}}}}} + 1} \right]^{ - \left( {\tau + \nu _{{{\mathrm{u}}}}} \right)/\psi _{{{\mathrm{u}}}}},\end{array}$$
(4)

with B(.,.) denoting the beta function. The above formulation of the GT distribution assumes that the mode, median, and mean (if the latter exists) is zero, in line with the usual formulation of SF models. The GB2 density is often formulated as \(f_{GB2}\left( {x;a,b,p,q} \right) = \frac{{\left| a \right|x^{ap - 1}}}{{{\rm B}\left( {p,q} \right)b^{ap}}}\left[ {\left( {\frac{x}{b}} \right)^a + 1} \right]^{ - \left( {p + q} \right)}.\) The parameterization in (4) used throughout this paper is its equivalent and implies: \(p = \tau /\psi _u\), \(q = \nu _u/\psi _u\), \(a = \psi _u\),\(b = \sigma _u\nu _u^{1/\psi _u}\). The reason for using an alternative parameterization is to emphasize the relationship between GT and GB2 distributions. Note that with τ = 1, the \(f_{GB2}\left( {.;\sigma _u,\nu _u,\psi _u,1} \right)\) distribution is equivalent to half-GT distribution with parameters \(f_{GT}\left( {.;\sigma _u,\nu _u,\psi _u} \right)\). In other words, the absolute value of a GT variable distributed as \(f_{GT}\left( {.;\sigma _u,\nu _u,\psi _u} \right)\) follows the GB2 distribution \(f_{GB2}\left( {.;\sigma _u,\nu _u,\psi _u,1} \right)\). Additionally, the reason for not using the double GB2 (instead of GT) for \(p_{{{\mathrm{v}}}}\left( . \right)\) is that we would violate A1 (by double GB2 we mean symmetric distribution over the real line such that the absolute value of the corresponding variable follows the GB2 distribution). The double GB2 distribution with τ > 1 would be bimodal, whereas τ < 1 implies a lack of continuity at the mode (with the density function approaching infinity from both sides of zero). We find such properties to be undesirable given the interpretation of the symmetric observation error term vt that is broadly accepted in SFA. However, the framework described here could in principle be extended to encompass such cases, as the resulting density of observables, \(p_{\upvarepsilon}\left( . \right)\) would be nevertheless continuous.

It is clear that σv in (3) and σu in (4) are scale parameters (with σv being analogous to the inverse of precision in Student’s t distribution), while \(\psi\)’s and \(\nu\)’s are shape parameters. In particular, \(\nu _{{{\mathrm{u}}}}\) and \(\nu _{{{\mathrm{v}}}}\) control tail thickness (and, hence, the existence of finite moments – analogously to the degrees-of-freedom parameter in Student’s t distribution). Moreover, for parameters \(\nu _{{{\mathrm{u}}}}\) and \(\nu _{{{\mathrm{v}}}}\) we also consider the limiting cases (of \(\nu _{{{\mathrm{u}}}} \to \infty\) or \(\nu _{{{\mathrm{v}}}} \to \infty\)). In order to analyze the limiting behavior of the normalizing constants and the kernels of (3) and (4), note that (following Harvey and Lange 2017; who cite Davis 1964):

$$\frac{\psi }{{{\rm B}\left( {\tau /\psi ,\nu /\psi } \right)\nu ^{\tau /\psi }}}\begin{array}{*{20}{c}} \to \\ {\nu \to \infty } \end{array}\frac{\psi }{{{{\Gamma }}\left( {\tau /\psi } \right)\psi ^{\tau /\psi }}},$$
(5)
$$\left[ {\frac{1}{\nu }\left( {\frac{{\left| x \right|}}{\sigma }} \right)^\psi + 1} \right]^{ - \left( {\tau + \nu } \right)/\psi }\begin{array}{*{20}{c}} \to \\ {\nu \to \infty } \end{array}\exp \left[ { - \frac{1}{\psi }\left( {\frac{{\left| x \right|}}{\sigma }} \right)^\psi } \right],$$
(6)

where \({{\Gamma }}\left( . \right)\) denotes the gamma function. The above formulations indicate the relationship with the exponential family of densities, nested as limiting cases in the general model used here. In principle, it is also possible to consider the limiting behavior for \(\psi _{{{\mathrm{u}}}}\) or \(\psi _{{{\mathrm{v}}}}\), though these cases are less interesting from the empirical point of view given the usual interpretation in SFA; so this option is not considered here (e.g., \(\psi _{{{\mathrm{v}}}} \to \infty\) implies convergence towards uniform distribution on an interval). Note that each of the abovementioned parameters governs a specific characteristic of the compound error (i.e., different combinations of the parameters’ values do not lead to similar shapes of \(p_{\upvarepsilon}\left( . \right)\)), and thus under assumptions A1-A3 the GT-GB2 SF model is identified.

We emphasize the following special cases:

  1. 1.

    As \(\psi _{{{\mathrm{v}}}} = 2\), the GT distribution (3) reduces to the Student’s t distribution:

    $$f_{ST}\left( {z;\sigma ,\nu } \right) = \frac{1}{\sigma }\frac{2}{{{\rm B}\left( {1{{{\mathrm{/}}}}2,\nu {{{\mathrm{/}}}}2} \right)\nu ^{1{{{\mathrm{/}}}}2}}}\left[ {\frac{1}{\nu }\left( {\frac{z}{\sigma }} \right)^2 + 1} \right]^{ - \left( {1 + \nu } \right)/2},$$
    (7)

    with σ equivalent to σv and \(\psi\) equivalent to \(\psi _{{{\mathrm{v}}}}\) in (3); consequently, the GB2 distribution in (4) with τ = 1 and \(\psi _{{{\mathrm{u}}}} = 2\) reduces to the half-Student’s t case.

  2. 2.

    As \(\nu _{{{\mathrm{v}}}} \to \infty\), the GT distribution (3) reduces to the Generalized Error Distribution (GED):

    $$f_{GED}\left( {z;\sigma ,\psi } \right) = \frac{1}{\sigma }\frac{\psi }{{2{{\Gamma }}\left( {1/\psi } \right)\psi ^{1/\psi }}}\exp \left[ { - \frac{1}{\psi }\left( {\frac{{\left| z \right|}}{\sigma }} \right)^\psi } \right],$$
    (8)

    consequently, the GB2 distribution with \(\tau\)=1, \(\nu _{{{\mathrm{u}}}} \to \infty\) reduces to half-GED.

  3. 3.

    The GB2 distribution (4) reduces to the generalized gamma distribution (GG) as \(\nu _{{{\mathrm{u}}}} \to \infty\):

$$f_{GG}\left( {z;\sigma ,\tau ,\psi } \right) = \frac{1}{\sigma }\frac{\psi }{{{{\Gamma }}\left( {\tau /\psi } \right)\psi ^{\tau /\psi }}}\left( {\frac{z}{\sigma }} \right)^{\tau - 1}\exp \left[ { - \frac{1}{\psi }\left( {\frac{z}{\sigma }} \right)^\psi } \right].$$
(9)

The original parameterization of the generalized gamma distribution, according to Stacy (1962, Eq. 1), is \(f_{GG}\left( {z;a,d,p} \right) = \frac{p}{{{{\Gamma }}\left( {d/p} \right)a^d}}z^{d - 1}\exp \left[ { - \left( {\frac{z}{a}} \right)^p} \right]\). The relationship between the original parameterization and the one used in (9) is as follows: \(d = \tau\), \(p = \psi\), \(a = \sigma \psi ^{1/\psi }\). The Gaussian case can be obtained as a conjunction of the Student’s t and GED sub-cases since it requires \(\nu _{{{\mathrm{v}}}} \to \infty\), and \(\psi _{{{\mathrm{v}}}} = 2\) (analogously, GB2 becomes half-normal with τ = 1, \(\nu _{{{\mathrm{u}}}} \to \infty\), and \(\psi _{{{\mathrm{u}}}} = 2\)). Moreover, setting \(\psi _{{{\mathrm{v}}}} = 1\) instead of \(\psi _{{{\mathrm{v}}}} = 2\) leads to the Laplace distribution or the exponential case for GB2 with \(\psi _{{{\mathrm{u}}}} = \tau = 1\), and \(\nu _{{{\mathrm{u}}}} \to \infty\). The general SF model considered here is labeled GT-GB2, while its special case, which assumes τ = 1, is labeled GT-HGT (as the GB2 distribution is reduced into the half-GT case). Other special cases, with literature references, are listed in Table 1. Also, the density of \({\upvarepsilon}_t\) implied by (2), (3), and (4) in GT-GB2 is in general asymmetric and includes skew-t and skew-normal distributions as special cases.

Table 1 Selected models nested in GT-GB2 SF specification

Note that in the empirical analysis, we make an additional assumption: (A4) ut’s follow a non-increasing pdf, and impose resulting restrictions on parameters of the GT-GB2 model, excluding τ > 1. We are aware that there is considerable SFA literature about non-monotonic distributions of ut’s, which would indicate their usefulness (see, e.g., Stevenson 1980; Greene 1990; van den Broeck et al. 1994; Griffin and Steel 2004, 2007, 2008). Such applications, however, have assumed restrictive distributional forms of vt’s, which could be the reason why non-monotonic distributions of ut’s have been found relevant in the first place (e.g., due to outliers; see Stead et al. 2018; for a discussion about outliers detection and treatment). In our view, a generalization of distributional assumptions about vt is sufficient to maintain similar statistical fit, despite the use of strictly non-increasing distribution of ut. Thus, A4 is an additional assumption (or restriction) we find plausible, which may also help with identification in SF models in general, but not a requirement in the GT-GB2 model. We think that A4 is a reasonable assumption for typical SF models with a single source of inefficiency. However, if analysis of the economic process at hand suggests the existence of many sources of inefficiencies, e.g., as in Oh and Shin (2021), the resulting compound error representing total inefficiency (as a sum of source-specific inefficiencies) might indeed violate A4 and follow, e.g., a unimodal distribution with nonzero mode. Nonetheless, as \(p_{{{\mathrm{u}}}}\left( . \right)\) gets closer to that of a symmetric distribution, the decomposition of errors in SFA becomes more problematic. This issue is similar in nature to what is witnessed in normal-truncated-normal and normal-gamma SF models, in which different parametric configurations of ut and vt distributions can lead to a practically identical distribution of εt, especially when the mode of u is above zero (Ritter and Simar 1997). Hence, though the GT-GB2 model is identified also without A4 (allowing τ > 1) in our view, A4 is economically reasonable in standard cases and it results in more regular shapes of posterior or likelihood.

Finally, we note that some of the forms listed in Table 1 are obtained not through parametric restrictions but as limiting cases. Although this may preclude simple hypothesis testing in such cases, model comparison is still very much feasible, e.g., (i) through likelihood ratios in case of ML-based estimation or (ii) using Bayes Factors and posterior probabilities in case of Bayesian inference. Both approaches are discussed in the next section.

3.1 Possible model extensions: varying efficiency distribution

The basic SFA formulation, labelled Common Efficiency Distribution (CED-SFA), does not allow for heterogeneity (across observations) in the inefficiency process. This assumption is relaxed within the Varying Efficiency Distribution model class (VED-SFA; see, e.g., Koop et al. 1997). VED-SF models allow exogenous variables to influence the inefficiency process. They are characterized by the fact that the inefficiency distribution depends on certain covariates (hence, u’s are independent but no longer identically distributed). Within the GT-GB2 framework a natural option is to replace σu in (4) with:

$$\sigma _{{{{\mathrm{u}}}},t} = {{{\mathrm{exp}}}}\left( {\gamma + \delta {{{\boldsymbol{w}}}}_t} \right),$$
(10)

where wt represent a vector of exogenous covariates driving differences across objects with respect to inefficiency distribution, γ and δ are model parameters replacing σu. Importantly, such a general formulation carries over to VED-type versions of all nested special cases listed in Table 1, which results in a whole new class of coherent VED-SFA formulations.

In principle, it is possible to extend the idea to consider individual effects in shape parameters as well. However, it is not obvious whether such formulation would turn out to be empirically relevant. From empirical viewpoint, an important distinction would be between covariate-driven σu,t (individual effects in inefficiency terms) and covariate-driven σv,t (heteroskedasticity in observation error). Such formulations are fully feasible extensions to the GT-GB2 framework presented here.

3.2 Possible model extensions: panel data modelling

There is a number ways to account for panel data structure in the proposed framework. First, we can consider u’s as time-invariant effects (i.e., object-specific). This is the most traditional setting in which inefficiency is to capture persistent effects that differentiate objects’ performances. In other words, all time-invariant differences in performances are attributed to inefficiency and all transient effects are treated as part of random disturbance (vit). The convolution of densities for vit and ui is

$$\begin{array}{l}f\left( \varepsilon \right) = \mathop {\prod }\limits_{i = 1}^n \mathop {\prod }\limits_{t = 1}^T p_\varepsilon \left( {y_{it} - \ln g\left( {x_{it};\beta } \right);\theta ^{\left( {\varepsilon _{it}} \right)}} \right)\\\qquad\, =\, \mathop {\prod }\limits_{i = 1}^n \left\{ {\mathop {\int}\limits_{R + } {\left\{ {\mathop {\prod }\limits_{t = 1}^T p_{{{\mathrm{v}}}}\left( {y_{it} - \ln g\left( {x_{it};\beta } \right) + \omega {{{\mathrm{u}}}}_i} \right)} \right\}} p_{{{\mathrm{u}}}}\left( {{{{\mathrm{u}}}}_i;\theta ^{\left( {{{{\mathrm{u}}}}_i} \right)}} \right)d{{{\mathrm{u}}}}} \right\}.\end{array}$$
(11)

where i is the object index \((i = 1, \ldots ,n)\) and t is the time index \((t = 1, \ldots ,T)\). This results in a likelihood function similar to that in Eqs. (2)–(4). The change is that now we evaluate only n integrals. On the other hand, however, each integral needs to be calculated over a product of T densities of pv. This is more challenging in terms of fine-tuning the integration procedure (specifying relevant waypoints etc.) and becomes more complex as T increases. We reckon, however, that this is still not as time consuming as having to evaluate nT integrals in the baseline (pooled) model. So the net effect is likely an increase in computational speed.

Another popular way is to add another latent variable (αi) to represent an object-specific effect much like in the True Random Effects SF model (Greene 2004). In this setting inefficiency is to capture transient effects that differentiate objects’ performances over time, while the random effect αi takes on the persistent part. Thus, time-invariant differences are attributed to heterogeneity of the frontier \(y_{it} = \ln g\left( {{{{\boldsymbol{x}}}}_{{{{\boldsymbol{it}}}}};\beta } \right) + {{{\mathrm{v}}}}_{it} - \omega {{{\mathrm{u}}}}_{it} + \alpha _i.\) Assuming that αi and \(\sigma _\alpha ^2\) have the traditional natural-conjugate normal-inverted gamma priors we can employ hybrid MCMC procedures which involve “wrapping” traditional data augmentation techniques (e.g., Gibbs sampling) around the procedure discussed in Section “Inference: likelihood-based methods for the GT-GB2 model”.

The third option - probably the easiest to implement - is to follow approach based on True Fixed Effects SF model (Greene 2004). From a Bayesian perspective modelling fixed effects amounts to having each prior on αi with a different scale parameter (e.g., \(\sigma _{\alpha _i}\)). Assuming that \(\alpha _i\sim N\left( {0,\sigma _{\alpha _i}^2} \right)\) we can simplify the problem by considering the intercept, e.g., \(\beta _0\sim N( {0,\sigma _{\beta _0}^2} )\), and the fixed effect jointly as \(\delta _i = \beta _0 + \alpha _i\). Hence, in practice this approach amounts to estimating an object-specific intercept δi of the frontier under the following prior: \(\delta _i\sim N(0,\sigma _{\alpha _i}^2 + \sigma _{\beta _0}^2 + 2\rho _i\sigma _{\alpha _i}\sigma _{\beta _0})\), where ρi it the correlation between αi and β0. If we assume independence between αi and β0 the prior on δi simplifies to \(N(0,\sigma _{\alpha _i}^2 + \sigma _{\beta _0}^2)\).

There are many other panel data modelling techniques proposed within the SFA. However, the abovementioned ones can be relatively easily incorporated within the proposed framework.

4 Inference: likelihood-based methods for the GT-GB2 model

In general, Bayesian SFA estimation amounts to sampling from joint posterior of parameters and latent variables \(p\left( {{{{\mathbf{u}}}},\theta \left| y \right.} \right)\). Popular methods are based on Gibbs sampling (see, e.g., Griffin and Steel 2007), interchangeably drawing from two conditionals: \(p\left( {{{{\mathbf{u}}}}{{{\mathrm{|}}}}\theta ,y} \right)\) and \(p\left( {\theta {{{\mathrm{|}}}}{{{\mathbf{u}}}},y} \right)\). Such samplers are very efficient (in terms of computational power), however, their successful application usually relies on certain simplifying assumptions as to the form of priors (being quasi-conjugate in order to form standard full conditional densities) as well as the frontier. We follow an alternative path and split the problem into two parts. First, we develop a satisfactory numerical representation of the integrated likelihood (i.e., we integrate out the latent variables using non-stochastic integration methods). We then use the integrated likelihood, augmented with a prior \(p\left( \theta \right)\), to sample parameters from \(p\left( {\theta {{{\mathrm{|}}}}y} \right)\), i.e., the posterior for parameters marginalized with respect to the latent inefficiencies u. Second, we sample u’s conditionally on parameters, using \(p\left( {{{{\mathbf{u}}}}{{{\mathrm{|}}}}\theta ,y} \right)\), which is analogous to one of the steps in traditional Gibbs sampling. However, we never sample parameters conditionally on latent variables, so in our approach it is possible to draw parameters only, e.g., for the purpose of model comparison.Footnote 3

It is possible to use general-purpose Bayesian algorithms for each of the two steps; our choice is that of Metropolis-Hastings (MH) sampler. Note that the outcome of our algorithm is fully equivalent to that of existing Gibbs sampling schemes, though those samplers are readily available only for a very specific subset of our general model space. In order to compare actual mixing properties we can, e.g., reproduce results obtained using a standard Gibbs sampler for N-HN model case; see Appendix B (Fig. B3). Moreover, as we make use of the integrated likelihood, it is possible to apply non-Bayesian methods within our framework, so it can be used by classically oriented researchers as well. In order to obtain the integrated log-likelihood function implied by (2)–(4), under our baseline assumptions (A1–A3), note that the observations are i.i.d. variables, so full log-likelihood for all the data can be trivially decomposed into a sum of observation-specific log-likelihoods of the general form:

$$ln{{{\mathcal{L}}}}_t = A + C_t$$
(12)

where

$$A = \ln \left( {\frac{1}{{\sigma _{{{\mathrm{v}}}}}}\frac{1}{{\sigma _{{{\mathrm{u}}}}}}\frac{{\psi _{{{\mathrm{v}}}}}}{{{\rm B}\left( {\frac{1}{{\psi _{{{\mathrm{v}}}}}},\frac{{\nu _{{{\mathrm{v}}}}}}{{\psi _{{{\mathrm{v}}}}}}} \right)2\nu _{{{\mathrm{v}}}}^{\frac{1}{{\psi _{{{\mathrm{v}}}}}}}}}\frac{{\psi _{{{\mathrm{u}}}}}}{{{\rm B}\left( {\frac{\tau }{{\psi _{{{\mathrm{u}}}}}},\frac{{\nu _{{{\mathrm{u}}}}}}{{\psi _{{{\mathrm{u}}}}}}} \right)\nu _{{{\mathrm{u}}}}^{\frac{\tau }{{\psi _{{{\mathrm{u}}}}}}}}}} \right),$$
$$\begin{array}{l}C_t = \ln \left( \mathop {\int}\limits_0^{ + \infty } {\left[ {\frac{1}{{\nu _{{{\mathrm{v}}}}}}\left( {\frac{{\left| {y_t - \ln g\left( {{{{\boldsymbol{x}}}}_{{{\boldsymbol{t}}}};\beta } \right) + {{{\mathrm{u}}}}} \right|}}{{\sigma _{{{\mathrm{v}}}}}}} \right)^{\psi _{{{\mathrm{v}}}}} + 1} \right]^{ - \frac{{\left( {1 + \nu _{{{\mathrm{v}}}}} \right)}}{{\psi _{{{\mathrm{v}}}}}}}}\right.\\\qquad\,\, \left.\left( {\frac{{{{\mathrm{u}}}}}{{\sigma _{{{\mathrm{u}}}}}}} \right)^{\tau - 1}\left[ {\frac{1}{{\nu _{{{\mathrm{u}}}}}}\left( {\frac{{{{\mathrm{u}}}}}{{\sigma _{{{\mathrm{u}}}}}}} \right)^{\psi _{{{\mathrm{u}}}}} + 1} \right]^{ - \frac{{\left( {\tau + \nu _{{{\mathrm{u}}}}} \right)}}{{\psi _{{{\mathrm{u}}}}}}}d{{{\mathrm{u}}}} \right).\end{array}$$

This is not given in a closed form, as the integral in Ct cannot be solved analytically in the general case. However, it involves a number of one-dimensional integrals, which can be (e.g., parallelly) evaluated with an arbitrary precision. Furthermore, in Bayesian estimation, the integrated likelihood is multiplied by priors to obtain the kernel of \(p\left( {\theta {{{\mathrm{|}}}}y} \right)\). As noted, the choice of priors is essentially unrestricted. In the empirical part we assume proper priors with independence across parameters. We leave the issue of more advanced prior elicitation for further research, since some form of prior dependence across parameters might be reasonable. We assume proper priors also to allow for a broad menu of model reducing restrictions. Further numerical aspects are discussed in Appendix A.

Due to the direct use of integrated likelihood (12), point estimation requires the usual optimization over the parameter space (maximization of the integrated log-likelihood or log-posterior \(\ln p\left( {\theta {{{\mathrm{|}}}}y} \right)\)). This in turn allows for computation of Bayesian Information Criterion (BICs), or BIC-based approximations of Bayes Factors for model comparison, as well as the corresponding model weights (approximate posterior probabilities of models) for model averaging. One may also prefer to use an approximate LR-like model reduction testing procedure for model selection instead. If an estimate of the covariance matrix (e.g., via numerical Hessian of log-posterior) is available, it is possible to compute Bayes Factors via Laplace approximation, which is simple and quite reliable. Hence, it is possible to conduct some form of formal model comparison (or selection) and inference without the use of computationally intensive MC procedures. This is rather unusual when dealing with non-Gaussian models with latent structures.

As for interval estimation (or parameter estimation uncertainty in general), one option is to use approximations based on multivariate normality in the parameter space. Properties of Maximum Likelihood (ML) estimation of models of similar degree of complexity - though allowing for closed-form likelihoods - are discussed by Harvey and Lange (2017). However, our empirical results indicate that indeed the multivariate normal approximation is not necessarily relevant in the case of shape parameters (\(\psi _{{{\mathrm{u}}}}\), \(\psi _{{{\mathrm{v}}}}\), vu, vv, τ). Therefore, in this paper, we make use of more elaborate Bayesian inference techniques to deal with the issue of parameter estimation error. We use informative priors (to ensure the existence of the posterior), which informally can be considered as being close to uninformative (see Appendix A). However, the goal could also be obtained using, e.g., bootstrapping.

To summarize, the estimation strategy advocated here, though computationally demanding, has three important advantages over the standard Gibbs sampling-based Bayesian estimation in SFA:

  • it does not require specific, “convenient” forms of \(p_{{{\mathrm{v}}}}\left( . \right)\) or \(p_{{{\mathrm{u}}}}\left( . \right)\) – efficient Gibbs samplers have been developed only for some special cases listed in Table 1;

  • it does not require specific classes of priors, so it is possible to conduct an in-depth analysis of prior sensitivity and prior coherence;

  • it does not rely on any specific (e.g., linear or loglinear) form of the frontier \(g\left( {{{{\boldsymbol{x}}}}_{{{\boldsymbol{t}}}};\beta } \right)\); although our empirical applications in Section “Empirical applications” follow previous studies and make use of the traditional loglinear forms of g, this aspect is worth noting as it further adds to the generality and potential attractiveness of the proposed framework. The algorithm only requires the computation of residuals \(g\left( {{{{\boldsymbol{x}}}}_{{{\boldsymbol{t}}}};\beta } \right) - y_t\). Hence, numerous nonlinear alternatives like Constant Elasticity of Substitution (CES), or semiparametric forms can be used and their utility assessed using formal Bayesian model comparison.

Note, however, that it would be possible to construct a non-standard Gibbs sampler for the GT-GB2 model in the spirit of traditional Bayesian SFA. It would require the so-called Metropolis steps for drawing θ’s and u’s, though omitting the need for computation of the integrated likelihood (therefore, saving the computational power used for numerical integration). Though this option might be interesting for certain practical purposes, however, we have decided not to pursue this approach for a number of reasons. First, the issue of model comparison would be difficult to solve in practice since it is not easy to obtain reliable Bayes Factors from Gibbs sampler output. Second, such sampler would be considerably slower (compared to those of traditional Bayesian SFA) due to non-standard Metropolis steps. Hence, the gain in computation time over our approach would not be impressive. Third, our algorithm draws parameters from \(p\left( {\theta {{{\mathrm{|}}}}y} \right)\), while its Gibbs-like counterpart would draw from \(p\left( {\theta {{{\mathrm{|}}}}{{{\mathbf{u}}}},y} \right)\). In our view the marginalized distribution \(p\left( {\theta {{{\mathrm{|}}}}y} \right)\) is more convenient to sample from (e.g., due to being flatter and independent of u’s), compared to the conditional one, i.e., \(p\left( {\theta {{{\mathrm{|}}}}{{{\mathbf{u}}}},y} \right)\). Consequently, it is easier to develop and calibrate a proposal for sampling \(p\left( {\theta {{{\mathrm{|}}}}y} \right)\); also, in our case, potential imperfections in drawing u’s do not deteriorate the sampling quality for parameters, since we omit drawing from \(p\left( {\theta {{{\mathrm{|}}}}{{{\mathbf{u}}}},y} \right)\). Finally, relying upon the computation of integrated likelihood makes it possible for non-Bayesians to use this framework. Details regarding inference on u’s are provided in Appendix A, both in terms of exact Bayesian approach presented here as well as its feasible ML-based approximation.

To mitigate the risk of overparameterization, the general GT-GB2 specification described here is intended to serve as a platform for specification search. One may estimate the GT-GB2 model and do “traditional” hypothesis testing, also known as Lindley-type testing in Bayesian inference. Alternatively, it is possible to run a sequence of maximum-posterior (or ML) estimates over a number of nested special cases listed in Table 1, with a relatively little computational cost. Such a sequence of point estimates can be used either for model averaging or selection (BMA/S; based on quasi-LR tests, BIC values, or BIC/Laplace approximation-based Bayes Factors and model posterior probabilities). If the results indicate one superior model, full MCMC inference can be conducted for this case only. If there is no clear winner, it is possible to restrict explicit averaging to a smaller subset of models, running MCMC over a couple of models with nonnegligible posterior probabilities and combining the results using BMA. Such approximate strategies might be practical and sufficient to deal with most model uncertainty problems. Our preference in the empirical section ultimately goes to model averaging over model selection or hypothesis testing for the following reasons.

  • Though more computationally demanding, BMA/S procedures are more general and do not suffer from some pitfalls inherent to traditional hypothesis testing (e.g., asymptotic properties, low test power).

  • If one specification clearly dominates over others, then BMA produces similar results to BMS (because one model will dominate the averaging process), but not otherwise.

  • BMA makes it possible to promote parsimony via model priors which penalize larger models, therefore further mitigating the risk of overparameterization.

  • SF models can suffer from weak identification, which can be particularly troublesome in some specifications (e.g., normal-truncated-normal, normal-gamma). But, in general every SF model can suffer from weak identification, e.g., as a consequence of insufficient sample information. In such cases models with comparable fit may offer different estimates of inefficiency and formal Bayesian model averaging (though not ideal) seems like the best solution from the decision-making perspective.

Using BMA also brings us conceptually closer to nonparametric treatment of the composed error in SFA as we deal with mixture modeling where weights are assigned using formal Bayesian techniques (posterior model probabilities). However, unlike with semi or nonparametric treatment of the error in SFA, (i) we have mixture modelling for both components, i.e., observation error and inefficiency, (ii) we retain all the benefits of a fully parametric approach and (iii) we know the exact mixture composition for v and u. Moreover, suppose a researcher is making use of an existing SF model. In that case, our approach provides a relatively simple check to verify whether, for the dataset at hand, the specification used is empirically accurate (by examining, e.g., BIC difference or LR-type ratio versus the GT-GB2 model).

Last but not least, there is still the question of possible endogeneity treatment within such a framework. We do not focus on this issue, as the most reliable way to deal with it is via relaxing the assumption, thus creating a model with an explicit form of endogeneity. Nonetheless, we believe that a number of existing proposals can be adopted rather easily. For example, if we have panel data Assaf et al. (2021) propose to use lagged values as instruments. Also, Kumbhakar (2011) shows that if producers’ objective is to maximize return to the outlay (or can be framed as such) then one can remove endogeneity by re-parameterizing the technology frontier based on input distance function transformation. A more challenging approach, though still feasible, would be to model the dependence between inputs and error components using copula functions (Amsler et al. 2016) or a vector of control functions, the latter being likely less computationally intensive to obtain (Centorrino and Pérez-Urdiales 2021). Finally, Osiewalski et al. (2020) demonstrate a way to deal with potential endogeneity in dynamic SF models using Vector Error Correction (VEC) framework.

5 Simulations

Our simulation exercise is based on two data generating processes (DGPs). The first one corresponds to traditional SF models, and the other one represents one of the new specifications developed in the paper. In order to avoid unrealistic assumptions regarding non-standard DGP’s, the simulation framework used here is closely related to one of our empirical applications discussed later in Section “Empirical applications” . The first DGP follows a standard normal-half-normal specification. We refer to it as “N-HN” or “classic” DGP as it is based on the most commonly used SF model. The second one follows Laplace-generalized gamma specification, labelled “LAP-GG” DGP; we pick this particular model since it achieved the best empirical fit in one of our real-data applications.

The simulation design is as follows. We choose fixed regressors from a well-researched dataset, which, in our case, is the WHO (2000) study (we use the balanced panel version of the dataset with 140 countries observed over 5 years; i.e., 700 data points). For “true” parameter values, we take the point estimates obtained for the dataset; see Table B1 in Appendix B, specific values are in rows labelled LAP-GG and N-HN. Given fixed regressors and parameters, we repeatedly draw u’s and v’s, and thus obtain y’s used for model estimation in each run. Our focus is two-fold. First, we investigate whether the use of our framework allows us to detect the correct specification (hence providing a useful tool for model selection). Second, we analyze the results from inefficiency estimation. Note that the latter is more demanding in terms of computation, and thus we use fewer simulations in that case.

As previously mentioned, the GT-GB2 model is likely to be overparameterized for most applications and should be viewed primarily as a model search platform. We investigate the performance using simple BICs, which are the most likely to be used in practical applications (also, rescaled BIC difference approximates log of Bayes Factor). Table 2 summarizes results of our model comparison exercise based on 100 datasets generated from N-HN and LAP-GG DGPs. Only in two cases out of 200 iterations (100 generated datasets times two DGPs) the procedure fails to select the “true” model, and in one of those cases the difference is quite small (−0.12). More generally, the simulation results indicate that indeed our framework can be used for specification search (as there is no tendency towards choosing overparameterized models).

Table 2 Simulation results: quantiles and min/max of BIC differences vs. the model corresponding to the DGP (negative values are in favor of a model in column label)

Nonetheless, should the GT-GB2 model be used purely for estimation, Fig. 1 provides some evidence in favor, especially under non-classic, complex DGP. We plot pdfs of the error components (v,u) based on the “true” DGP and compare them against pdfs based on estimated vales from the GT-GB2 model as well as the classic ones, i.e., N-HN and N-EXP. As one can notice, the GT-GB2 model significantly outperforms the other two (its pdfs almost overlap with the true ones) whenever the DGP is complex, while maintaining a satisfactory fit (better than N-EXP) when the DGP is simple.

Fig. 1
figure 1

Simulation results: comparison of \(p_{{{\mathrm{v}}}}\left( . \right)\) or \(p_{{{\mathrm{u}}}}\left( . \right)\) at “true” values of parameters vs. those using averages of point estimates from 10 simulations; results for GT-GB2, N-HN and N-EXP

Finally, since the key interest in SFA is about (in)efficiency, Table 3 provides summary statistics, which compare “true” inefficiencies with their point estimates (posterior means) from GT-GB2, N-HN and N-EXP. Our primary measure of quality of fit is based on a simple statistic, which is the sum of squared deviations (SSD) of the point estimates from real values of inefficiency. Intuitively, the smaller the SSD the better the fit. Interestingly, we find that GT-GB2 maintains a comparable, rather small SSD regardless of the DGP, while classic models can change their fit quite dramatically, and it can be significantly off when the DGP is complex. Other characteristics that should be of interest to practitioners, like mean inefficiency or inefficiency spread, are also in favor of the GT-GB2 model, which seems like the “safest” option. In particular, the model appears to be tentatively “conservative” in inefficiency measurement; i.e., over-estimation of inefficiency seems far less likely compared to classic SF models. Given the results from the complex DGP, one could even argue that classic SF models tend to show substantially more inefficiency than there really is (spurious inefficiency).

Table 3 Simulation results: true vs. estimated inefficiencies from GT-GB2, N-HN and N-EXP models

6 Empirical applications

Empirical applications are based on the following well-researched datasets, extensively covered in many papers in the field of (in)efficiency analysis (i) World Health Report by WHO (see, e.g., WHO (2000); Tandon et al. 2000; Evans et al. 2000; Hollingsworth, Wildman (2003); Greene 2004; 2005; 2017), and (ii) Spanish dairy farms (see, e.g., Cuesta, 2000; Alvarez, et al. 2006). For space considerations, more detailed results are provided for the first dataset (WHO), while for the latter, we report only basic results for model specification search and averaging (for detailed results see Appendix B). We note that BMA-type model averaging is usually applied to ensure an adequate choice of exogenous variables (within the same stochastic structure; see, e.g., Makieła and Osiewalski 2018). Our use of BMA here is different; i.e., we fix the set of exogenous variables and average results over competing stochastic specifications as this reflects actual problems of model uncertainty in SFA.

6.1 Analysis of World Health Report data

The dataset used in this section is based on WHO (2000) study, which contains annual information on healthcare attainments in 191 countries in 1993–1997. The dataset has been extensively described in the aforementioned papers. We use countries with complete 5 year observations, which is 140 countries and 700 observations in total. Given the aforementioned studies, we make use of the following translog function:

$$\begin{array}{l}y_{it} = \beta _0 + \beta _1x_{ti,1} + \beta _2x_{ti,2} + \beta _3x_{ti,1}^2 + \beta _4x_{ti,2}^2 + \beta _5x_{it,1}x_{it,2}\\\qquad + {{{\mathrm{v}}}}_{it} - {{{\mathrm{u}}}}_{it}\end{array}$$
(13)

where y is the log of COMP (a composite measures of success in achieving hive health goals: (1) health, (2) health distribution, (3) responsiveness, (4) responsiveness in distribution and (5) fairness in financing); x1 is the log of health expenditure per capita, x2 is the log of education attainment (average years of schooling), while i = 1, ..., 140 and t = 1, …, 5 are country and year indices respectively. Some previous studies would drop parts of the full translog model; e.g., Tandon et al. (2000) and Evans et al. (2000) would have \(\beta _3 = \beta _5 = 0\), while Greene (2005) would use a Cobb-Douglas form \(\left( {\beta _3 = \beta _4 = \beta _5 = 0} \right)\). Evidently, research on the form of the frontier is also of much interest in SFA and it could be used within this framework. In particular, we can consider complex nonlinear forms of the frontier in BMA because our inference method does not require any specific form of \(g(.)\); see Section “Inference: likelihood-based methods for the GT-GB2 model”. Nonetheless, as mentioned earlier, we leave the frontier parameterization fixed in order to focus our investigation on SF model specification uncertainty (with respect to the stochastic components v and u).

Table 4 shows results of model comparison based on GT-GB2 and other 33 distinctive SF models, all of which are special cases of GT-GB2. We compare approximate results based on ML and the popular Bayesian Information Criterion (column label w3), labelled approximate MLFootnote 4, with exact Bayesian results under two prior model probabilities: (i) a uniform prior which gives equal prior probability (1/34) to every model (column label w1) and (ii) a prior that follows the rule of Ockham Razor and prefers parsimonious models (column label w2). That is, we take \(p(M_i) \propto 2^{ - k}\) where k is the number of parameters. Hence, a model with one additional parameter is a priori two times less likely than its more parsimonious counterpart. This particular prior was first used by Osiewalski and Steel (1993), so we refer to it as the “OS prior” hereafter. The model ranking is sorted based on exact Bayesian results under uniform prior (w1).

Table 4 WHO data: model comparison results

We find that approximate ML results inflate the significance of the best model by assigning its model weights roughly about 0.957–0.979 as compared to 0.426–0.640 based on exact Bayesian inference. Model rankings between the two methods are largely similar, especially if we consider the Bayesian results with OS prior (correlation is about 0.964).

The best model specification is the one, in which inefficiency follows distribution similar to generalized gamma (GG), GB2 being its possible extension given the Bayesian results. The observation error is likely Laplace-shaped. Thus, LAP-GG is the favored specification given the data. Even though Bayesian results also point towards five other nonnegligible specifications, we note that these are also very similar constructs to LAP-GG. Furthermore, regardless of the method used (ML or Bayes), models with half-normal distribution of inefficiency are by far the worst specifications. The most popular normal-half-normal SF model is roughly nine orders of magnitude (109) less likely given the data than its somewhat less popular normal-exponential counterpart. Similar differences can be found in other models, which employ half-normal distribution of inefficiency.

Table 5 presents model averaging results for the stochastic parameters based on full Bayesian inference (i.e., based on exact \(p\left( {y{{{\mathrm{|}}}}M_i} \right)\) calculation, given the fully Bayesian framework; results for frontier parameters are in Appendix B). We can notice that estimates of the pooled model only slightly deviate from the aforementioned LAP-GG specification. Also, given results in Table 6 we notice that \(P(\tau = 1|y)\), \(P(\psi _{{{\mathrm{u}}}} = 1|y)\) and \(P(\psi _{{{\mathrm{u}}}} = 2|y)\) are virtually zero, while \(P\left( {v_{{{\mathrm{u}}}} \to \infty |y} \right)\) is about 0.6–0.77. This indicates that GG-type distribution of inefficiency is clearly preferred by the data though the more general GB2 distribution has some merit.

Table 5 WHO data: posterior means of stochastic parameters across models and model averaging
Table 6 WHO data: posterior probabilities of reducing GT-GB2 based on two prior probabilities

It is worth noting that in models similar to LAP-GG (in which \(\psi _u\) is estimated), parameter \(\sigma _{{{\mathrm{u}}}}\) has a potentially interesting interpretation given particular values of shape parameters \(\psi _{{{\mathrm{u}}}}\) and \(\nu _{{{\mathrm{u}}}}\). That is, the GB2 density with high values of \(\psi _{{{\mathrm{u}}}}\) and \(\nu _{{{\mathrm{u}}}}\) (or \(\nu _{{{\mathrm{u}}}} \to \infty\)) displays an approximate inefficiency truncation at \(\sigma _{{{\mathrm{u}}}}\), which implies an approximate lower bound of (theoretical) efficiency at around \(\exp \left( { - \sigma _{{{\mathrm{u}}}}} \right)\). For example, in LAP-GG this lower bound is about 0.77.

6.2 Analysis of Spanish dairy farms data

The dataset contains information on 247 farms in 1993–1998. Again, following the literature, we use translog production function:

$$y_{it} = \beta _0 + \beta _1t + \mathop {\sum }\limits_{j = 2}^{J + 1} \beta _jx_{ti,j} + \mathop {\sum }\limits_j^{J + 1} \mathop {\sum }\limits_{g \ge j}^{J + 1} \beta _{j,g}x_{it,j}x_{it,g} + {{{\mathrm{v}}}}_{it} - {{{\mathrm{u}}}}_{it}$$
(14)

where y is the log of milk output, x is a vector of logs of inputs (number of milk cows, land, labor, and feed), J denotes the number of inputs (J = 4) while i=1, ..., 247 and t = 1, …, 6 are object and time indices respectively. Hence, we have 16 elements in vector β (15 slope parameters and the intercept) and 1482 observations.

Table 7 summarizes our results of model specification search. Given the exact Bayesian results, there are two relevant model specifications: GT-GB2 and Laplace-half-normal (LAP-HN). The two models have roughly equal explanatory power as measured by the marginal data density (ln p(y): 810.518 vs. 810.495) so their posterior probabilities largely depend on the prior. Under equal prior odds (uniform prior), the two models are equally relevant; but under OS prior, the simpler model clearly dominates the ranking. This effect is further strengthened in the case of approximate ML results. Again, model probability rankings between ML and Bayes are highly correlated (especially with OS prior: 0.956 correlation); approximate ML results assign much higher model weights to the simpler LAP-HN model. Evidently, penalization of larger models in ML-based approximation is substantial, even compared to weights based on OS prior.

Table 7 Spanish dairy farms data: model comparison results

Similarly to WHO, the data prefer a Laplace-shaped observation error. However, the inefficiency term this time is likely to be close to half-normal, with an exponential case being significantly less probable. Given the results in Table 8 the notion that τ = 1 cannot be rejected. We do note, however, a distinctive difference between the two model classes with respect to scale parameters.

Table 8 Spanish dairy farms data: posterior probabilities of reducing GT-GB2 based on two sets of prior model probabilities

Compared to the WHO example, the model ranking is much more even. This is somewhat to be expected if we look at the differences between the highest and the lowest recorded maximum log-likelihood values for the two datasets. In the case of WHO, it is about 49.2, while for Spanish dairy farms, it is “only” 14.8. Similar evidence can be found when comparing the logs of marginal data densities (ln p(y)). Hence, no surprise that the posterior model ranking is more even for Spanish dairy farms, and the impact of the prior model probability much more substantial.

6.3 Efficiency analysis

The key aspect of almost every applied research which uses SFA is to produce credible (in)efficiency scores. From a practitioner’s point of view, an argument can be made that since efficiency score rankings among all the above-mentioned specifications are comparable, there is no good reason to bother with the best yet very complex specification. After all, though models like normal-half-normal or normal-exponential are far inferior (as measured by their posterior probabilities), they are simpler to use, more popular, and provide us with a roughly similar (in)efficiency ranking.

First, the reason for “bothering” is the very same reason that has led to the introduction of SFA in the first place. Even the simplest procedures, e.g., based on corrected OLS (COLS), can produce comparable efficiency rankings in mildly favorable conditions (i.e., when the scale parameter of efficiency is much larger than the scale parameters of the symmetric disturbance, which is not that rare). However, such simple procedures fail drastically in delivering credible interpretations to individual (in)efficiency levels and their differences. Quite similarly, based on our simulations and empirical examples, we reckon that efficiency rankings between specifications do not vary much, but we do notice substantial differences in terms of average levels and the spread of efficiencies (i.e., the differences between the most and least efficient observation). The best models in our study are the ones that indicate relatively high average efficiency with relatively narrow spread (similar observations were made by Makieła and Osiewalski 2018; within a different model averaging context). Conversely, the less adequate models, such as popular normal-half-normal or normal-exponential, produce spurious inefficiency, i.e., indicate more inefficiency than it is warranted by the data (given the best model). In applied research, exact efficiency scores and their differences are often directly interpreted by practitioners, sometimes even linked with performance-oriented policies. Hence, such spurious inefficiency can translate into considerable losses or gains for individual agents.

Second, Fig. 2 compares posterior densities of efficiency from the best model against the two classic cases: normal-half-normal and exponential. Indeed, these plots are substantially different. In particular, we can see that the posterior density from the best model stochastically dominates over the other two distribution. Such differences should not be neglected in a credible efficiency benchmark. We also note that the notion of stochastic dominance in terms of (in)efficiency distribution in the most adequate SF specifications is closely related to the aforementioned spurious inefficiency observed in the least adequate models.

Fig. 2
figure 2

Comparison of average-unit efficiency distribution in the best model with normal-half-normal and normal-exponential cases: WHO data (left), Spanish dairy farms data (right)

Finally, we should point out that the presented framework allows for inference pooling with respect to (in)efficiency measurement from all models. We do not have to make any particular discrete model choice and pool inference about (in)efficiency from all of them. Fig. 3 shows the posterior density plot of efficiency in the pooled inference under both priors (OS prior and uniform prior). For WHO data, we notice that the shape remains unchanged regardless of the prior model probability used. Also, the distribution is clearly bi-modal, with one of the modes being at one (indicating full efficiency). This may provide some evidence in favor of a zero-inefficiency stochastic frontier model specifications (see, e.g., Kumbhakar et al. 2013; Tran and Tsionas 2016). In the case of Spanish dairy farms, the pooled posterior distribution of efficiency is indeed driven by the prior model probabilities to some extent. Hence, whether or not prior model probabilities (i.e., our subjective prior beliefs about each model’s adequacy) are of any significance in inference pooling about (in)efficiency seems to depend on the data.

Fig. 3
figure 3

Posterior density of average-unit efficiency under uniform and OS priors: WHO (left) and Spanish dairy farms (right); pooled/averaged efficiency densities under uniform and OS priors overlap

7 Conclusions

The main contribution of this paper comes with the development of a new stochastic frontier model, labeled GT-GB2, which generalizes almost all parametric specifications used in the field. It is based on flexible assumptions regarding the compound error, with the sampling density defined as a convolution of two densities: generalized t (for observation error) and generalized beta of the second kind (for inefficiency). Consequently, it allows for a broad range of deviations from the classic N-HN or N-EXP SF models while maintaining crucial regularity conditions described in the paper. The model admits nonstandard forms of the frontier and can be generalized towards the VED-like family or panel data treatment. It is also possible to generalize the approach along the lines of ZISF-type models (see Kumbhakar et al. 2013), for example, using the results of Harvey and Ito (2020), which is subject to further research.

We develop exact and approximate inference methods, which allow for model selection and/or averaging. This is important for addressing model specification uncertainty with respect to inference on latent variables. These variables represent individual (in)efficiency terms, which are key quantities of interest in SFA. In order to provide an adequate description of all aspects of their estimation uncertainty (model risk) and to avoid overparameterization issues, we use full Bayesian inference with BMA and its simplified, approximate alternative. Crucially, we are able to obtain density estimates of efficiency scores that account for model uncertainty.

We have shown that with T = 1482 observations and a heavily-parameterized frontier (16 parameters in vector β), it is possible to estimate the general GT-GB2 model with its nested sub-cases and to effectively conduct model averaging procedures. Thus, it is possible to overcome two limitations of the Bayesian approach often encountered in practice. First, we address the problem of computing marginal data densities (crucial for BMA/S) by developing a simple yet reliable method via Laplace approximation. In non-Gaussian models with latent variables, this solution is usually unfeasible, while reliable alternatives are difficult to obtain. Second, it is now possible to develop coherent priors or conduct an in-depth analysis of prior sensitivity because our approach can be used with any choice of (proper) priors (Tsionas 2018; Makieła and Mazur 2020). We leave the derivation of optimal (or reference) priors within the GT-GB2 class for further research.

In the approximate approach, point estimates can be obtained by maximizing the log-likelihood or log-posterior using the available integrated likelihood. These results are less computationally demanding and sufficient for model comparison (i.e., computation of model weights for model averaging). This amounts to the use of BICs or the aforementioned Laplace approximations (the latter requires slightly more effort when the numerical Hessian of the log-posterior is unavailable). Simulations results indicate that the simplest approximate approach based on BIC can be effectively used for model specification search. Moreover, the results suggest that the use of model averaging or model selection indeed mitigates the risk of overparameterization, as parsimonious DGPs are correctly identified.

Our empirical analysis is based on a combined inference from 34 model specifications, many of which are novel in SFA. We find it is important to take into account Laplace distribution when considering heavy tails of the symmetric term (see, e.g., Horrace and Parmeter 2018; i.e., the alternatives of Gaussian vs. Student’s t may be too restrictive). However, there is no particular indication about the distribution of inefficiency. So far, inefficiency distribution seems to be data-specific, and thus GB2 is a convenient base for inefficiency specification search. Also, simulations results indicate that the proposed framework provides a more credible (in)efficiency detection. We find that inadequate models – among which are the classic normal-half-normal and normal-exponential – can show a substantial amount of spurious inefficiency. This feature has been identified in both simulations results as well as in the empirical examples. That is, a significant amount of noise can be wrongly attributed to inefficiency in contemporarily used SF models. This leads to higher average inefficiency and a larger spread between individual inefficiency scores in comparison to the adequate specifications. This, in a sense, can be viewed in terms of stochastic dominance of inefficiency distribution in one model over the other. The proposed framework may thus be particularly attractive for those applications, in which obtaining credible measures of (in)efficiency location and spread is highly relevant.

Correlations of model probability rankings based on Bayesian inference and simplified ML-like approach are quite similar, especially under model prior probabilities that favor parsimony. Moreover, there is a growing tendency in SFA to use heavy-tailed distributions (see, e.g., Horrace and Wang 2022). Our findings suggest that in order to determine a given model’s adequacy, it is more important to model how the stochastic components’ distributions behave around the mode than in the tails (or to generalize beyond the t-distribution for the symmetric term). Heavy tails can be easily tested within this framework along with other features of the components’ distributions.

Since one of the key advantages of the GT-GB2 SF model is its flexibility, one might be tempted to consider its nonparametric alternatives like in Florens et al. (2020). However, despite some obvious limitations of the parametric approach, it is relatively straightforward to impose assumptions A1-A3 as well as A4 on the GT-GB2 model, which is not trivial within the nonparametric framework (see Preciado Arreola et al. 2020). Moreover, the model can also be extended in order to account for covariate-dependent scale/shape characteristics as well as semi or nonparametric forms of the frontier technology. To sum up, the parametric framework considered in this paper provides a research strategy that is complementary to nonparametric SFA or stochastic DEA, with the perspective of a combined use, as in Andor et al.(2019).