1 Introduction

Peter Schmidt has made many seminal contributions in advancing the statistical inference methods and their applications in time series, cross section, and panel data econometrics in general (Schmidt 1976a) and, in particular, in the areas of dynamic econometric models, estimation and testing of cross-sectional and panel data models, crime and justice models (Schmidt and Witte 1984), survival models (Schmidt and Witte 1988). His fundamental and innovative contributions on the econometrics of stochastic frontier production/cost models have made significant impact on the generations of econometricians (e.g., Schmidt 1976b, Aigner et al. 1977, Amsler et al. 2017, Amsler et al. 2019). Also, he has contributed many influential papers on developing efficient procedures involving the generalized least squares (GLS) method (see Guilkey and Schmidt 1973, Schmidt 1977, Arabmazar and Schmidt 1981, Ahu and Schmidt 1995) among others. These were for the parametric models, whereas here we consider the nonparametric models.

Nonparametric regression function estimators are useful econometric tools. Common methods to estimate a regression function are kernel based methods, such as Kernel Regularized Least Squares (KRLS), Support Vector Machines (SVM), Local Polynomial Regression, etc. However, in order to avoid overfitting the data, some type of regularization, lasso or ridge, is generally used. In this paper, we will focus on KRLS; this method is also known as Kernel Ridge Regression (KRR) in the machine learning literature and is the kernelized version of the simple ridge regression to allow for nonlinearities in the model.

In this paper, we establish fitting a nonparametric regression function via KRLS under a general parametric error covariance. Some theoretical results, including pointwise marginal effects, unbiasedness, consistency and asymptotic normality, on KRLS are found in Hainmueller and Hazlett (2014). However, Hainmueller and Hazlett (2014) only consider errors to be homoskedastic and that the estimator is unbiased for estimating the postpenalization function, not for the true underlying function. Confidence interval estimates for Least Squares Support Vector Machine (LSSVM) are discussed in De Brabanter et al. (2011), allowing for heteroskedastic errors. Although not directly stated, the LSSVM estimator in De Brabanter et al. (2011) is equivalent to KRR/KRLS when an intercept term is included in the model. Following Hainmueller and Hazlett (2014), we will use KRLS without an intercept. Although De Brabanter et al. (2011) allow for heteroskedastic errors, none of the papers mentioned thus far discuss incorporating the error covariance in estimating the regression function itself, making these type of estimators inefficient. In this paper, we focus on making KRLS more efficient by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. We use a two step procedure where in the first step, we estimate the parametric error covariance from the residuals obtained by KRLS and in the second step, we estimate a model by KRLS based on transformed variables using the error covariance. We also provide estimating derivatives based on the two step procedure, allowing us to determine the partial effects of the regressors on the dependent variable.

The structure of this paper is as follows: Sect. 2 discusses the model framework and the GKRLS estimator, Sects. 3, 4, and 5 show the finite sample properties, asymptotic properties, and partial effects and derivatives of the GKRLS estimator, respectively, Sect. 6 runs through a simulation example, Sect. 7 illustrates an empirical example for a random effects model with heteroskedastic and correlated errors, and Sect. 8 concludes the paper.

2 Generalized KRLS estimator

Consider the nonparametric regression model:

$$\begin{aligned} Y_i = m(X_i) + U_i, \quad i=1,\ldots ,n, \end{aligned}$$
(1)

where \(X_i\) is a \(q\times 1\) vector of exogenous regressors, and \(U_i\) is the error term such that \(\mathbb {E}[U_i|X_{1},\ldots ,X_{n}] = \mathbb {E}[U_i|\textbf{X}]=0\), where \(\textbf{X}=(X_1,\ldots ,X_n)^\top \) and

$$\begin{aligned} \mathbb {E}[U_iU_j|\textbf{X}] = \omega _{ij}(\theta _0) \text { for some }\theta _0 \in \mathbb {R}^p, i, j = 1,\ldots ,n. \end{aligned}$$
(2)

In this framework, we allow the error covariance to be parametric, where the errors can be autocorrelated or non-identically distributed across observations.

2.1 KRLS estimator

For KRLS, the function \(m(\cdot )\) can be approximated by some function in the space of functions constituted by

$$\begin{aligned} m(\textbf{x}_0) = \sum _{i=1}^n {c}_i K_{\sigma }(\textbf{x}_i,\textbf{x}_0), \end{aligned}$$
(3)

for some test observation \(\textbf{x}_0\) and where \({c}_i,\; i=1,\ldots ,n\) are the parameters of interest, which can be thought of as the weights of the kernel functions \(K_{\sigma }(\cdot )\). The subscript of the kernel function, \(K_{\sigma }(\cdot )\), indicates that the kernel depends on the bandwidth parameter, \(\sigma \).

We will use the Radial Basis Function (RBF) kernel,

$$\begin{aligned} K_\sigma (\textbf{x}_i,\textbf{x}_0) = {\text {e}}^{-\frac{1}{\sigma ^2} || \textbf{x}_i - \textbf{x}_0||^2}. \end{aligned}$$
(4)

Notice that the RBF kernel is very similar to the Gaussian kernel, in that it does not have the normalizing term out in front and that \(\sigma \) is proportional to the bandwidth h in the Gaussian kernel often used in nonparametric local polynomial regression. This functional form is justified by a regularized least squares problem with a feature mapping function that maps \(\textbf{x}\) into a higher dimension (Hainmueller and Hazlett 2014), where this derivation of KRLS is also known as Kernel Ridge Regression (KRR). Overall, KRLS uses a quadratic loss with a weighted \(L_2\)-regularization. Then, in matrix notation, the minimization problem is

$$\begin{aligned} \underset{\textbf{c}}{\arg \min } \; (\textbf{y} - \textbf{K}_{\sigma } \textbf{c})^\top (\textbf{y} - \textbf{K}_{\sigma } \textbf{c}) + \lambda \textbf{c}^\top \textbf{K}_{\sigma }\textbf{c}, \end{aligned}$$
(5)

where \(\textbf{y}\) is the vector of training data corresponding to the dependent variable, \(\textbf{K}_{\sigma }\) is the kernel matrix, with \(K_{\sigma ,i,j} = K_{\sigma }(\textbf{x}_i,\textbf{x}_j)\) for \(i,j=1,\ldots ,n\), and \(\textbf{c}\) is the vector of coefficients that is optimized over. The solution to this minimization problem is

$$\begin{aligned} \widehat{\textbf{c}}_1 = (\textbf{K}_{\sigma _1}+\lambda _1 \textbf{I})^{-1}\textbf{y}. \end{aligned}$$
(6)

The kernel function can be user specified but in this paper we only consider the RBF kernel in Eq. (4). The kernel function’s hyperparameter \(\sigma \) and the regularization parameter \(\lambda \) can also be user specified or can be found via cross validation. The subscript of one denotes the KRLS estimator, or the first stage estimation. Finally, predictions for KRLS can be made by

$$\begin{aligned} \widehat{m}_{1}(\textbf{x}_0) = \sum _{i=1}^n \widehat{c}_{1,i} K_{\sigma _1}(\textbf{x}_i,\textbf{x}_0). \end{aligned}$$
(7)

2.2 An efficient KRLS estimator

The KRLS estimator, \(\widehat{m}_{1}(\cdot )\) does not take into consideration any information in the error covariance structure and therefore is inefficient. As a result, consider the \(n\times n\) error covariance matrix, \(\Omega (\theta )\), where \(\omega _{ij}(\theta )\) denotes the (ij)th element. Assume that \(\Omega (\theta )=P(\theta )P(\theta )'\) for some square matrix \(P(\theta )\) and let \(p_{ij}(\theta )\) and \(v_{ij}(\theta )\) denote the (ij)th element of \(P(\theta )\) and \(P(\theta )^{-1}\). Let \(\textbf{m}\equiv (m(X_1), \ldots , m(X_n))^\prime \) and \(\textbf{U} \equiv (U_1, \ldots , U_n)^\prime \). Now, premultiply the model in Eq. (1) by \(P^{-1}\), where \(P^{-1}=P^{-1}(\theta )\) and we condense the notation and the dependence on \(\theta \) is implied.

$$\begin{aligned} P^{-1}\textbf{y} = P^{-1}\textbf{m}+P^{-1}\textbf{U}. \end{aligned}$$
(8)

The transformed error term, \(P^{-1}\textbf{U}\) has mean \(\varvec{0}\) and covariance matrix as the identity matrix. Therefore, we consider a regression of \(P^{-1}\textbf{y}\) on \(P^{-1}\textbf{m}\). This simply re-scales the variables by the inverse of their square root of their variances. Since \(\textbf{m}=\textbf{K}_{\sigma }\textbf{c}\), the quadratic loss function with \(L_2\) regularization under the transformed variables is

$$\begin{aligned} \underset{\textbf{c}}{\arg \min } (\textbf{y}-\textbf{K}_{\sigma }\textbf{c})^{\top }\Omega ^{-1} (\textbf{y}-\textbf{K}_{\sigma }\textbf{c}) + \lambda \textbf{c}^\top \textbf{K}_{\sigma }\textbf{c}. \end{aligned}$$
(9)

The solution for vector is

$$\begin{aligned} \hat{\textbf{c}}_2 =(\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\textbf{y} \end{aligned}$$
(10)

Note that the solution obtained depends on the bandwidth parameter \(\sigma _2\) and ridge parameter \(\lambda _2\), which can be different than the hyperparameters used in the KRLS estimator. In practice, cross validation can be used for obtaining estimates for both hyperparameters. Here, it is assumed that \(\Omega \) is known if \(\theta \) is known. However, if \(\theta \) is unknown, it can be estimated consistently and \(\Omega \) can be replaced by \(\widehat{\Omega }=\widehat{\Omega }(\hat{\theta })\).Footnote 1

Furthermore, predictions for the generalized KRLS estimator can be made by

$$\begin{aligned} \widehat{m}_2(\textbf{x}_0) = \sum _{i=1}^n \widehat{c}_{2,i} K_{\sigma _2}(\textbf{x}_i,\textbf{x}_0) \end{aligned}$$
(11)

The two step procedure is outlined below

  1. 1.

    Estimate Eq. (1) by KRLS from Eq. (7) with bandwidth parameter, \(\sigma _1\) and ridge parameter, \(\lambda _1\). Obtain the residuals which can then be used to get a consistent estimate for \(\Omega \).

  2. 2.

    Estimate Eq. (8) by KRLS under the transformed variables as in Eqs. (9) and (11). Denote these estimates as GKRLS.

2.3 Selection of hyperparameters

Throughout this paper, we focus on the RBF kernel in Eq. (4), which contains the hyperparameter \(\sigma _1\) (and \(\sigma _2\)). Since these parameters are squared in the RBF kernel in Eq. (4), we can instead search for the hyperparameters \(\sigma _1^2\) and \(\sigma _2^2\). The selection of the hyperparameters \(\lambda _1,\lambda _2,\sigma _1^2\), and \(\sigma _2^2\) is selected via leave one out cross validation (LOOCV). However, prior to cross validation, it is common in penalized methods to scale the data to have mean of 0 and standard deviation of 1. This way, the penalty parameters \(\lambda _1\) and \(\lambda _2\) do not depend on the scale of the data or the magnitude of the coefficients. Note that the scaling of the data does not affect the interpretations of predictions and marginal effects since the estimates can be translated back to their original scale and location.

For the hyperparameters, \(\sigma _1^2\) and \(\sigma _2^2\), Hainmueller and Hazlett (2014) suggest setting \(\sigma ^2=q\), the number of regressors. Therefore, in items 1 and 2 in the two step procedure, \(\sigma _1^2=q\) and \(\sigma _2^2=q\). Then, only the penalty hyperparameters \(\lambda _1\) and \(\lambda _2\) need to be chosen. \(\lambda _1\) is chosen via LOOCV in item 1 of the two step procedure using Eq. (5). \(\lambda _2\) is then chosen via LOOCV in item 2 of the two step procedure using Eq. (9). If one wishes to also search for \(\sigma _1^2\) and \(\sigma _2^2\), one would perform LOOCV to find \(\lambda _1\) and \(\sigma _1^2\) simultaneously in item 1 using Eq. (5) and then perform another LOOCV to find \(\lambda _2\) and \(\sigma _2^2\) simultaneously in 2 of the two step procedure using Eq. (9).

3 Finite sample properties

In this section, finite sample properties of both KRLS and GKRLS estimators, including the estimation procedures of bias and variance, are discussed in detail.

3.1 Estimation of bias and variance

In this subsection, we estimate the bias and variance of the two step estimator. Following, De Brabanter et al. (2011), notice that the GKRLS estimator is a linear smoother.

Definition 1

An estimator \(\widehat{m}\) of m is a linear smoother if, for each \(\textbf{x}_0\in \mathbb {R}^q\), there exists a vector \(L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots ,l_n(\textbf{x}_0))^\top \in \mathbb {R}^n\) such that

$$\begin{aligned} \widehat{m}(\textbf{x}_0) = \sum _{i=1}^n l_i(\textbf{x}_0)Y_i, \end{aligned}$$
(12)

where \(\widehat{m}(\cdot ):\mathbb {R}^{q}\rightarrow \mathbb {R}\).

For in sample data, Eq. (12) can be written in matrix form as \(\widehat{\textbf{m}}=\textbf{Ly}\), where \(\widehat{\textbf{m}}=(\widehat{m}(X_1),\ldots ,\widehat{m}(X_n))^\top \in \mathbb {R}^n\) and \(\textbf{L} = (l({X_1})^\top ,\ldots ,l({X_n})^\top )^\top \in \mathbb {R}^{n\times n}\), where \(\textbf{L}_{ij}=l_j(X_i)\). The ith row of \(\textbf{L}\) show the weights given to each \(Y_i\) in estimating \(\widehat{m}(X_i)\). For the rest of the paper, we will denote \(\widehat{m}_2(\cdot )\) as the prediction made by GKRLS for a single observation and \(\widehat{\textbf{m}}_2\) as the \(n\times 1\) vector of predictions made for the training data.

To obtain the bias and variance of the GKRLS estimator, we assume the following:

Assumption 1

The regression function \(m(\cdot )\) to be estimated falls in the space of functions represented by \(m(\textbf{x}_0) = \sum _{i=1}^n {c}_i K_\sigma (\textbf{x}_i,\textbf{x}_0)\) and assume the model in Eq. (1).

Assumption 2

\(\mathbb {E}[U_i| \textbf{X}] = 0\) and \(\mathbb {E}[U_iU_j|\textbf{X}] = \omega _{ij}(\theta ) \text { for some }\theta \in \mathbb {R}^p, i, j = 1,\ldots ,n \)

Using Definition 1, Assumption 1, and Assumption 2, the conditional mean and variance can be obtained by the following theorem.

Theorem 1

The GKRLS estimator in Eq. (11) is

$$\begin{aligned} \begin{aligned} \widehat{m}_{2}(\textbf{x}_0)&= \sum _{i=1}^n l_i(\textbf{x}_0)Y_i\\&= {L(\textbf{x}_0)}^\top \textbf{y}, \end{aligned} \end{aligned}$$
(13)

and \(L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots , l_n(\textbf{x}_0))^\top \) is the smoother vector,

$$\begin{aligned} L(\textbf{x}_0) = \left[ K_{\sigma _2,\textbf{x}_0}^{*\top } (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top , \end{aligned}$$
(14)

with \(K_{\sigma _2,\textbf{x}_0}^{*}= (K_{\sigma _2}(\textbf{x}_1,\textbf{x}_0),\ldots ,K_{\sigma _2}(\textbf{x}_n,\textbf{x}_0))^\top \) the kernel vector evaluated at point \(\textbf{x}_0\).

Then, the estimator, under model Eq. (1), has conditional mean

$$\begin{aligned} \mathbb {E}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0]=L(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$
(15)

and conditional variance

$$\begin{aligned} {\text {Var}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \Omega L(\textbf{x}_0). \end{aligned}$$
(16)

Proof

see Appendix A. \(\square \)

From Theorem 1, the conditional bias can be written as

$$\begin{aligned} \begin{aligned} {\text {Bias}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0]&=\mathbb {E}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}]-m(\textbf{x}_0)\\&=L(\textbf{x}_0)^\top \textbf{m} - m(\textbf{x}_0) \end{aligned} \end{aligned}$$
(17)

Following De Brabanter et al. (2011), we will estimate the conditional bias and variance by the following:

Theorem 2

Let \(L(\textbf{x}_0)\) be the smoother vector evaluated at \(\textbf{x}_0\) and let \(\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), [0]\ldots , \widehat{m}_2(\textbf{x}_n))^\top \) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that \(\widehat{\Omega }\rightarrow \Omega \), the estimated conditional bias and variance for GKRLS are obtained by

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{2}(\textbf{x}_2)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\textbf{m}}_2 - \widehat{m}_{2}(\textbf{x}_0) \end{aligned}$$
(18)

and

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\Omega } L(\textbf{x}_0). \end{aligned}$$
(19)

Proof

See Appendix B. \(\square \)

3.2 Bias and variance of KRLS

First, note that the KRLS estimator is also a linear smoother, so the bias and the variance take the same form as in Eqs. (18) and (19), except that the linear smoother vector \(L(\textbf{x}_0)\) will be different. Let

$$\begin{aligned} L_{1}(\textbf{x}_0)&=\left[ K_{\sigma _1,\textbf{x}_0}^{*\top } (\textbf{K}_{\sigma _1}+\lambda _1 \textbf{I})^{-1} \right] ^\top \end{aligned}$$
(20)

be the smoother vector for KRLS. Then, Eq. (7) can be rewritten as

$$\begin{aligned} \widehat{m}_{1}(\textbf{x}_0) = L_{1}(\textbf{x}_0)^\top \textbf{y}. \end{aligned}$$
(21)

Using Theorem 1 and Theorem 2 and applying them to the KRLS estimator, the estimated conditional bias and variance of KRLS are

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{1}(\textbf{x}_0)|X=\textbf{x}_0]&= L_{1}(\textbf{x}_0)^\top \widehat{\textbf{m}}_{1} - \widehat{m}_{1}(\textbf{x}_0) \end{aligned}$$
(22)
$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{1}(\textbf{x}_0)|X=\textbf{x}_0]&= L_{1}(\textbf{x}_0)^\top \widehat{\Omega } L_{1}(\textbf{x}_0), \end{aligned}$$
(23)

where \(\widehat{\textbf{m}}_{1}\) is the \(n\times 1\) vector of fitted values for KRLS. Note that the estimate of the covariance matrix, \(\Omega \), will be the same for both KRLS and GKRLS.

4 Asymptotic properties

The asymptotic properties of GKRLS, including consistency, asymptotic normality, and bias corrected confidence intervals are covered in this section. To obtain consistency of the GKRLS estimator, we also assume:

Assumption 3

Let \(\lambda _1,\lambda _2,\sigma _1,\sigma _2>0\) and as \(n\rightarrow \infty \), for singular values of \(\textbf{L}P\) given by \(d_i\), \(\sum _{i=1}^n d_i^2\) grows slower than n once \(n>M\) for some \(M<\infty \).

Theorem 3

Under Assumptions 13, and let the bias corrected fitted values be denoted by

$$\begin{aligned} \widehat{\textbf{m}}_{2,c}=\widehat{\textbf{m}}_{2}-{\text {Bias}}[\widehat{\textbf{m}}_{2}|\textbf{X}], \end{aligned}$$
(24)

then

$$\begin{aligned} \underset{n\rightarrow \infty }{{\text {lim}}} {\text {Var}}[\widehat{\textbf{m}}_{2,c}|\textbf{X}]=0 \end{aligned}$$
(25)

and the bias corrected GKRLS estimator is \(\sqrt{n}\)-consistent with \(\underset{n\rightarrow \infty }{{\text {plim}}} \; \widehat{m}_{c,n}(\textbf{x}_{i})=m(\textbf{x}_i)\) for all i.

Proof

See Appendix C. \(\square \)

The estimated conditional bias from Eq. (18) and conditional variance from Eq. (19) can be used to construct pointwise confidence intervals. Asymptotic normality of the proposed estimator is given via the central limit theorem.

Theorem 4

Under Assumptions 13, \(\widehat{\textbf{m}}_2\) is asymptotically normal by the central limit theorem:

$$\begin{aligned} \sqrt{n}(\widehat{\textbf{m}}_2-{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m})\overset{d}{\rightarrow } N(\varvec{0},{\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}]), \end{aligned}$$
(26)

where \({\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{Lm}-\textbf{m}\) and \({\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{L}\Omega \textbf{L}^\top \).

Proof

See Appendix D. \(\square \)

Since GKRLS is a biased estimator for m, we need to adjust the pointwise confidence intervals to allow for bias. Since the exact conditional bias and variance are unknown, we can use Eqs. (18) and (19) as estimates and can conduct approximate bias corrected \(100(1-\alpha )\%\) pointwise confidence intervals from Theorem 4 as

$$\begin{aligned} \widehat{{m}}_{2}(\textbf{x}_i)-\widehat{{\text {Bias}}}[\widehat{{m}}_{2}(\textbf{x}_i)|X =\textbf{x}_i] \pm z_{1-\alpha /2}\sqrt{\widehat{{\text {Var}}}[\widehat{{m}}_{2}(\textbf{x}_i)|X=\textbf{x}_i]} \end{aligned}$$
(27)

for all i. Furthermore, to test the significance of the estimated regression function at an observation point, we can use the bias corrected confidence interval to see if 0 is in the interval.

5 Partial effects and derivatives

We also derive an estimator for pointwise partial derivatives with respect to a certain variable \(\textbf{x}^{(r)}\). The partial derivative of the GKRLS estimator, \(\widehat{m}_{2}(\textbf{x}_0)\) with respect to the rth variable is

$$\begin{aligned} \begin{aligned} \widehat{m}_{2,r}^{(1)}(\textbf{x}_0)&= \sum _{i=1}^n\frac{\partial K_{\sigma _2}(\textbf{x}_i,\textbf{x}_0)}{\partial \textbf{x}_0^{(r)}} \widehat{c}_{2,i}\\&=\frac{2}{\sigma _2^2} \sum _{i=1}^n {\text {e}}^{-\frac{1}{\sigma _2^2} || \textbf{x}_i - \textbf{x}_0||^2} \big (\textbf{x}_i^{(r)}-\textbf{x}_0^{(r)}\big ) \widehat{c}_{2,i}, \end{aligned} \end{aligned}$$
(28)

using the RBF kernel in Eq. (4) and where \(\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)\equiv \frac{\partial \widehat{m}_{2}(\textbf{x}_0)}{\partial \textbf{x}^{(r)}}\). To find the conditional bias and variance of the derivative estimator, we use the following:

Theorem 5

The GKRLS derivative estimator in Eq. (28) with the RBF kernel in Eq. (4) can be rewritten as

$$\begin{aligned} \widehat{m}_{2,r}^{(1)}(\textbf{x}_0)&=S_r(\textbf{x}_0)^\top \textbf{y}, \end{aligned}$$
(29)

where \(\Delta _r \equiv \frac{2}{\sigma _2^2}{\text {diag}} (\textbf{x}_1^{(r)}-\textbf{x}_0^{(r)},\ldots ,\textbf{x}_n^{(r)}-\textbf{x}_0^{(r)})\) is a \(n\times n\) diagonal matrix, and

$$\begin{aligned} S_r(\textbf{x}_0)=\left[ K_{\sigma _2,\textbf{x}_0}^{*\top } \Delta _r (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top \end{aligned}$$
(30)

is the smoother vector for the first partial derivative with respect to the rth variable. Then, the conditional mean of the GKRLS derivative estimator is

$$\begin{aligned} \mathbb {E}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0]=S_r(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$
(31)

and conditional variance is

$$\begin{aligned} {\text {Var}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \Omega S_r(\textbf{x}_0). \end{aligned}$$
(32)

Proof

see Appendix E. \(\square \)

Using Theorem 5, the conditional bias and variance can be estimated as follows

Theorem 6

Let \(S_r(\textbf{x}_0)\) be the smoother vector for the partial derivative evaluated at \(\textbf{x}_0\) and let \(\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), \ldots , \widehat{m}_2(\textbf{x}_n))^\top \) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that \(\widehat{\Omega }\rightarrow \Omega \), the estimated conditional bias and variance for GKRLS derivative estimator in Eq. (28) are obtained by

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \widehat{\textbf{m}} - \widehat{m}^{(1)}_{2,r}(\textbf{x}_0) \end{aligned}$$
(33)

and

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \widehat{\Omega } S_r(\textbf{x}_0). \end{aligned}$$
(34)

Proof

See Appendix F. \(\square \)

The average partial derivative with respect to the rth variable is

$$\begin{aligned} \widehat{m}_{avg,r}^{(1)}=\frac{1}{n^\prime } \sum _{j=1}^{n^\prime } \widehat{m}^{(1)}_{2,r}(\textbf{x}_{0,j}) \end{aligned}$$
(35)

The bias and variance of the average partial derivative estimator is given by

$$\begin{aligned} {\text {Bias}}[ \widehat{m}_{avg,r}^{(1)}|X]=\frac{1}{n^\prime } \varvec{\iota }_{n^\prime }^\top \textbf{S}_{0,r} \textbf{m}- \frac{1}{n^\prime }\varvec{\iota }_{n^\prime }^\top \textbf{m}_{0,r}^{(1)} \end{aligned}$$
(36)

and

$$\begin{aligned} {\text {Var}}[\widehat{m}_{avg,r}^{(1)}|X] = \frac{1}{n^{\prime ^2}} \varvec{\iota }^\top _{n^\prime } \textbf{S}_{0,r} \Omega \textbf{S}_{0,r}^\top \varvec{\iota }_{n^\prime } , \end{aligned}$$
(37)

where \(n^\prime \) is the number of observations in the testing set, \(\varvec{\iota }_{n^\prime }\) is a \(n^\prime \times 1\) vector of ones, \(\textbf{S}_{0,r}\) is the \(n^\prime \times n\) smoother matrix with the jth row as \(S_r(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime \), and \(\textbf{m}_{0,r}^{(1)}\) is the \(n^\prime \times 1\) vector of derivatives evaluated at each \(\textbf{x}_{0,j},j=1,\ldots ,n^\prime \).

5.1 First differences for binary independent variables

Unlike for the continuous case, partial effects for binary independent variables should be interpreted as and estimated by first differences. That is, the estimated effect of going from \(x^{(b)}=0\) to \(x^{(b)}=1\) can be determined by

$$\begin{aligned} \begin{aligned} \widehat{m}_{FD_b}(\textbf{x}_0)&=\widehat{m}(x^{(b)}=1,\textbf{x}_0) - \widehat{m}(x^{(b)}=0,\textbf{x}_0)\\&=L_{FD_b}(\textbf{x}_0)^\top \textbf{y} \end{aligned} \end{aligned}$$
(38)

where \(\widehat{m}_{FD_b}(\cdot )\) is the first difference estimator for the bth binary independent variable, \(x^{(b)}\) is a binary variable that takes the values 0 or 1, \(\textbf{x}_0\) is the \((q-1)\times 1\) vector of the other independent variables evaluated at some test observation, and \(L_{FD_b}(\textbf{x}_0) \equiv L(x^{(b)}=1,\textbf{x}_0)-L(x^{(b)}=0,\textbf{x}_0)\) is the first difference smoother vector. The conditional bias and variance of the first difference GKRLS estimator in Eq. (38) are shown in the following theorem.

Theorem 7

Using Theorems 1 and 2, the conditional bias and variance for the GKRLS first difference estimator in Eq. (38) are obtained by

$$\begin{aligned} {{\text {Bias}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] = L_{FD_b}(\textbf{x}_0)^\top {\textbf{m}} - m_{FD_b}(\textbf{x}_0) \end{aligned}$$
(39)

and

$$\begin{aligned} {{\text {Var}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] = L_{FD_b}(\textbf{x}_0)^\top {\Omega } L_{FD_b}(\textbf{x}_0), \end{aligned}$$
(40)

where \(m_{FD_b}(\textbf{x}_0)={m}(x^{(b)}=1,\textbf{x}_0) - {m}(x^{(b)}=0,\textbf{x}_0)\).

Proof

See Appendix G. \(\square \)

Then, the conditional bias and variance can be estimated as follows:

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} L_{FD_b}(\textbf{x}_0)^\top \widehat{\textbf{m}} - \widehat{m}_{FD_b}(\textbf{x}_0) \end{aligned}$$
(41)
$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} L_{FD_b}(\textbf{x}_0)^\top \widehat{\Omega } L_{FD_b}(\textbf{x}_0). \end{aligned}$$
(42)

Note that Eq. (38) provides the pointwise first difference estimates. If one is interested in the average partial effect of going from \(x^{(b)}=0\) to \(x^{(b)}=1\), the following average first difference GKRLS estimator would be used.

$$\begin{aligned} \widehat{m}_{\overline{FD},b} = \frac{1}{n^\prime } \sum _{j=1}^{n^\prime } \widehat{m}_{FD_b}(\textbf{x}_{0,j}). \end{aligned}$$
(43)

This average partial effect of a discrete variable is similar to the continuous case and can be compared to traditional parametric partial effects as in the case of least squares coefficients. The conditional bias and variance of the average first difference GKRLS estimator in Eq. (43) are:

$$\begin{aligned} {{\text {Bias}}}[\widehat{m}_{\overline{FD}_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} \frac{1}{n^\prime }\varvec{\iota }^{\top }_{n^\prime } \textbf{L}_{FD_{0,b}} {\textbf{m}} - \frac{1}{n^\prime }\varvec{\iota }^{\top }_{n^\prime } \textbf{m}_{FD_{0,b}} \end{aligned}$$
(44)
$$\begin{aligned} {{\text {Var}}}[\widehat{m}_{\overline{FD}_b}|X=\textbf{x}_0]= & {} \frac{1}{n^{\prime ^2}}\varvec{\iota }^{\top }_{n^\prime } \textbf{L}_{FD_{0,b}} {\Omega } \textbf{L}_{FD_{0,b}}^\top , \end{aligned}$$
(45)

where \(\textbf{L}_{FD_{0,b}}\) is the \(n^\prime \times n\) smoother matrix with the jth row as \(L_{FD_b}(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime \), and \(\textbf{m}_{FD_{0,b}}\) is the \(n^\prime \times 1\) vector of first differences evaluated at each \(\textbf{x}_{0,j},j=1,\ldots ,n^\prime \). The conditional bias and variance of the average first difference estimator can be estimated using Eqs. (41) and (42).

6 Simulations

We conduct simulations that show the performance with respect to gaining efficiency of the proposed generalized KRLS estimator. Consider the data generating process from Eq. (1):

$$\begin{aligned} Y_i = m(X_i)+U_i, \quad i=1,\ldots ,n. \end{aligned}$$
(1)

We consider the sample size of \(n=200\) and three independent variables X that is generated from

$$\begin{aligned} \begin{aligned} X_1&\sim {Bern}(0.5)\\ X_2&\sim N(0,1)\\ X_3&\sim U(-1,1). \end{aligned} \end{aligned}$$
(46)

The specification for m is:

$$\begin{aligned} m(X_i)=5-2X_{i,1}+\sin (X_{i,2})+3X_{i,3} \end{aligned}$$
(47)

and the partial derivatives with respect to each independent variable are given by

$$\begin{aligned} \begin{aligned} m^{(1)}_1(X_i)&=-2\\ m^{(1)}_2(X_i)&= \cos (X_{i,2})\\ m^{(1)}_3(X_i)&= 3 \end{aligned} \end{aligned}$$
(48)

For the error terms, we consider two cases.

$$\begin{aligned} \begin{gathered} U_i=0.7U_{i-1}+V_i\\ V_i\sim N(0,5^2)\\ \end{gathered} \end{aligned}$$
(49)

and

$$\begin{aligned} U_i\sim N\left( 0,{\text {exp}}(X_{i,1}+0.2X_{i,2}-0.3X_{i,3})\right) \end{aligned}$$
(50)

First, in Eq. (49), \(U_i\) is generated by an AR(1) process. Second, \(U_i\) is heteroskedastic but independent of each other with \({\text {Var}}[U_i|\textbf{X}]={\text {exp}}(X_{i,1}+0.2X_{i,2}-0.3X_{i,3})\).

In addition to the proposed estimator, we compare four other nonparametric estimators: the KRLS estimator (KRLS), Local Polynomial (LP) estimator with degree zero, Random Forest (RF), and Support Vector Machine (SVM). The KRLS estimator is used as a comparison to GKRLS to show the magnitude of the efficiency loss from ignoring the information in the error covariance matrix. In addition, the KRLS, LP, RF, and SVM estimators do not utilize the covariance matrix in estimating the regression function and excludes heteroskedasticity or autocorrelation of the errors. For the GKRLS and KRLS estimators, we set \(\sigma _1^2=\sigma _2^2=3\), the number of independent variables in this example, and implement leave one out cross validation to select the hyperparameters, \(\lambda _1\) and \(\lambda _2\).Footnote 2 The variance function under the heteroskedastic case is estimated by least squares from the regression of the log residuals on X. Taking the exponential would give the predicted variance estimates. Under the case of AR(1) errors, the covariance function is estimated from an AR(1) model. We run 200 simulations for each of the two cases and the bias corrected results are reported below in Table .Footnote 3 To evaluate the estimators, mean squared error is used as the main criterion, where we also investigate the bias and variance. To compare results, all estimators are evaluated from 300 data points generated from Eqs. (46) and (47).

Table 1 The table reports the bias, variance, and MSE of GKRLS, KRLS, LP, RF, and SVM estimators for the regression function \(m(\textbf{x}_0)\) under the cases of heteroskedastic and AR(1) errors generated from Eqs. (46),(47),(49) and (50). The GKRLS and KRLS estimates are bias corrected. All estimates are averaged across all simulations

Table 1 displays the evaluations, including bias, variance, and MSE of the estimators for the regression function under both error cases. Note that the GKRLS and KRLS estimates in Table 1 are bias corrected. All estimates are averaged across all simulations. Estimates based on GKRLS seem to exhibit similar finite sample bias as KRLS, and there is an obvious reduction in the variability with smaller variance of the proposed estimator relative to KRLS. Note that GKRLS estimation provides a 31.6% and a 3.6% decrease in the variance for estimating the regression function for the autocorrelated and heteroskedastic errors, relative to KRLS. With smaller variance, GKRLS also has a smaller MSE, making GKRLS superior to KRLS. Compared to the other nonparametric estimators, LP, RF, and SVM, the GKRLS estimator outperforms the others in terms of MSE and is the preferred method in the presence of heteroskedasticity or autocorrelation.

Table 2 The table reports the bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators and the cases of heteroskedastic and AR(1) errors for the derivative of the regression function \(m^{(1)}_r(\textbf{x}_0)\) generated from Eqs. (46)–(50). Each row represents the MSE, variance, and bias of the partial derivative estimates with respect to \(X_r\), \(r=1,2,3\). All estimates are averaged across all simulations
Table 3 The table reports the bias, variance, and MSE of the GKRLS estimator for both the regression function and the partial derivatives and for the cases of heteroskedastic and AR(1) errors generated from Eqs. (46)–(50) for different sample sizes, \(n=100,200,400\). All reported estimates are biased corrected and are averaged across all simulations. The kernel hyperparameters are set as \(\sigma _1^2=\sigma _2^2=3\) and the hyperparameters \(\lambda _1\) and \(\lambda _2\) are found by LOOCV

Table displays the evaluations, including bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators for the partial derivatives of the regression function with respect to each of the independent variables under both error cases.Footnote 4 Since \(X_1\) is discrete, the partial derivative is estimated by first differences discussed in Sect. 5.1. Similar to the regression estimates, for both heteroskedastic and AR(1) errors, the variability from estimating the derivative is reduced by GKRLS estimation relative to KRLS estimation. In addition, the efficiency gain in estimating both the regression and the derivative seems to be more evident in the AR(1) case compared to the heteroskedastic case. A possible explanation for this is that the covariance matrix contains more information in the off-diagonal elements compared to the diagonal covariance matrix in the heteroskedastic case. Overall, when estimating the regression function and its derivative for this simulation example, the reduction in variance and therefore MSE is clearly evident in Tables 1 and 2, making the GKRLS the preferred estimator.

Table shows the simulation results for the consistency of GKRLS. The bias, variance, and MSE are reported for sample sizes of \(n=100,200,400\). In this example, we set \(\sigma _1^2=\sigma _2^2=3\) and the hyperparameters \(\lambda _1\) and \(\lambda _2\) are found by LOOCV. For the regression function and the derivative and for both error covariance structures, the squared bias, variance, and MSE all decrease as the sample size increases, which implies that the GKRLS estimator is consistent in this simulation exercise.

7 Application

We implement an empirical application from the U.S. airline industry with heteroskedastic and autocorrelated errors using a panel of 6 firms over 15 years.Footnote 5 For the data set, we set aside a portion of the data for training and the other for testing. We estimate the model with four methods, GKRLS, KRLS, LP, and Generalized Least Squares (GLS), and compare their results in terms of mean squared error (MSE). To evaluate the out of sample performance of each method, the predicted out of sample MSEs are computed as follows

$$\begin{aligned} MSE_e=\frac{1}{n^\prime T}\sum _{i=1}^{n^\prime }\sum _{t=1}^T \big (y_{0,it}-\widehat{m}_e(\textbf{x}_{0,it})\big )^2 \end{aligned}$$
(51)

where \(MSE_e\) is the mean squared error for the \(e^{th}\) estimator and \(n^\prime \) is the number of observations in the testing data set and \(j=1,\ldots ,n^\prime \). In this empirical exercise, \(n^\prime =1\) and \(T=15\) since we leave out the first firm as a test set. To assess the estimated average derivatives, we use the bootstrap to calculate the MSEs for the average partial effects. We report the bootstrapped MSEs for the average derivative by the following.Footnote 6

$$\begin{aligned} MSE_{e,r}=\frac{1}{B} \sum _{b=1}^B \left( \widehat{m}^{(1)}_{avg,e,r,b} - \frac{1}{4}\sum _{e} \widehat{m}^{(1)}_{avg,e,r}\right) ^2 \end{aligned}$$
(52)

where B is the number of bootstraps with \(b=1,\ldots ,B\), \(\widehat{m}_{avg,e,r,b}^{(1)}(\cdot )\) is the \(b^{th}\) bootstrapped average partial first derivative with respect to the \(r^{th}\) variable for the \(e^{th}\) estimator, and \(\frac{1}{4}\sum _e\widehat{m}^{(1)}_{avg,e,r}\) is the simple average of the average partial first derivatives with respect to the \(r^{th}\) variable from the four estimators (GLS, GKRLS, KRLS, and LP):

$$\begin{aligned} \begin{aligned} \hat{m}_{{avg,e,r}}^{{(1)}} =&\,\frac{1}{{nT}}\sum \limits _{{i = 1}}^{n} {\sum \limits _{{t = 1}}^{T} {\hat{m}_{{e,r}}^{{(1)}} } } (x_{{it}} ), \\ e =&\,\left\{ {{\text {GLS}},{\text {GKRLS}},{\text {KRLS}},{\text {LP}}} \right\} \\ \end{aligned} \end{aligned}$$
(53)

7.1 U.S. airline industry

We obtain the data on the efficiency in production of airline services from Greene (2018). Since the data are a panel of 6 firms for 15 years, we consider the one way random effects model:

$$\begin{aligned} \log C_{it}&=m(\log Q_{it},\log P_{it})+\alpha _i +\varepsilon _{it}, \end{aligned}$$
(54)

where the dependent variable \(Y_{it} = \log C_{it}\) is the logarithm of total cost, the independent variables \(X_{it} = (\log Q_{it}, \log P_{it})^{\top }\) are the logarithms of output and the price of fuel, respectively, \(\alpha _i\) is the firm specific effect, and \(\varepsilon _{it}\) is the idiosyncratic error term. In this empirical setting, we assume \(\mathbb {E}[\varepsilon _{it}|\textbf{X}]=0,\; \mathbb {E}[\varepsilon _{it}^2|\textbf{X}]=\sigma ^2_ {\varepsilon _{i}},\; \mathbb {E}[\alpha _i|\textbf{X}]=0,\; \mathbb {E}[\alpha _i^2|\textbf{X}]=\sigma ^2_{\alpha _i},\; \mathbb {E}[\varepsilon _{it}\alpha _j|\textbf{X}]=0\) for all itj, \(\mathbb {E}[\varepsilon _{it}\varepsilon _{js}|\textbf{X}]=0\) if \(t\ne s\) or \(i\ne j\), and \(\mathbb {E}[\alpha _i\alpha _j|\textbf{X}]=0\) if \(i\ne j\). Consider the composite error term \(U_{it}\equiv \alpha _i+\varepsilon _{it}\). Then, the model in Eq. (54) can be rewritten as

$$\begin{aligned} \log C_{it}=m(\log Q_{it},\log P_{it})+U_{it}, \end{aligned}$$
(55)

In Eq. (55), the independent variables are strictly exogenous to the composite error term, \(\mathbb {E}[U_{it}|\textbf{X}]=0\). The variance of the composite error term is \(\mathbb {E}[U_{it}^2|\textbf{X}]=\sigma ^2_{\alpha _i}+\sigma ^2_{\varepsilon _{i}}\). Therefore, in this empirical example, we allow for firm specific heteroskedasticity. In other words, the variance of the error terms are not constant across firms, but are constant over time for each firm. Since there is a time component, we allow an individual firm to be correlated across time but not with other firms, that is, \(\mathbb {E}[U_{it}U_{is}|\textbf{X}]=\sigma ^2_{\alpha _i}, \; t\ne s\) and \(\mathbb {E}[U_{it}U_{js}|\textbf{X}]=0\) for all t and s if \(i\ne j\). Note that the correlation across time can be different for every firm. Therefore, in this empirical framework, we allow the error terms to be heteroskedastic across firms and correlated across time.

To estimate Eq. (55) by GKRLS and KRLS in the framework set up in this paper, we can write the model in matrix notation. Consider

$$\begin{aligned} \textbf{y} = \textbf{m}+\textbf{U}, \end{aligned}$$
(56)

where \(\textbf{y}\) is the \(nT\times 1\) vector of \(\log C_{it}\), \(\textbf{m}\) is the \(nT\times 1\) vector of the regression function \(m(X_{it})\), and \(\textbf{U}\) is the \(nT\times 1\) vector of \(U_{it}\), \(i=1,\ldots ,n\) and \(t=1,\ldots ,T\). Then, the \(nT\times nT\) error covariance matrix \(\Omega \) is

$$\begin{aligned} \Omega ={\text {Var}}[\textbf{U}|\textbf{X}] = {\text {diag}}(\Sigma _1, \ldots ,\Sigma _n), \end{aligned}$$
(57)

where \(\Sigma _i=\sigma ^2_{\varepsilon _{i}}\textbf{I}_T +\sigma ^2_{\alpha _i} \varvec{\iota }_T\varvec{\iota }^\top _T, i=1,\ldots ,n\) has dimension \(T\times T\), \(\textbf{I}_T\) is a \(T\times T\) identity matrix and \(\varvec{\iota }_T\) is a \(T\times 1\) vector of ones. To use the GKRLS estimator in this empirical framework, we first estimate Eqs. (55) or (56) by KRLS and obtain the residuals, denoted by \(\widehat{u}_{it}\). To estimate the error covariance matrix \(\Omega \), the variances of the firm specific error and the idiosyncratic error, \(\sigma ^2_{\alpha _i}\) and \(\sigma ^2_{\varepsilon _{i}}\) need to be estimated. Consider the following consistent estimators using time averages,

$$\begin{aligned} \widehat{\sigma }^2_{U_i}= & {} \frac{1}{T} \widehat{\textbf{u}}_i^\top \widehat{\textbf{u}}_i \end{aligned}$$
(58)
$$\begin{aligned} \widehat{\sigma }^2_{\alpha _i}= & {} \frac{1}{T(T-1)/2} \sum _{t=1}^{T-1} \sum _{s=t+1}^{T} \widehat{u}_{it}\widehat{u}_{is} \end{aligned}$$
(59)
$$\begin{aligned} \widehat{\sigma }^2_{\varepsilon _{i}}= & {} \widehat{\sigma }^2_{U_i} - \widehat{\sigma }^2_{\alpha _i}, \end{aligned}$$
(60)

where \(\widehat{\textbf{u}}_i\) is the \(T\times 1\) vector of residuals for the ith firm. Now, plugging these estimates in for \(\Omega \), the GKRLS estimator can be estimated as in the previous sections. For further details, please see Appendix H.

With regards to the other comparable estimators, the KRLS and LP estimators are used to estimate Eqs. (55) or (56) ignoring the heteroskedasticity and correlation in the composite error, \(\textbf{U}\). Note that the KRLS estimator uses the error covariance matrix in the variances and standard errors but does not use the error covariance in estimating the regression function. Lastly, the GLS estimator is used as a parametric benchmark to compare to the standard random effects panel data model.Footnote 7

The data contain 90 observations of 6 firms for 15 years, from 1970–1984. We split the data into two parts, where the first 15 observations, which corresponds to the first firm, are used as testing data and 75 observations, which corresponds to the last five firms, are set as training data to evaluate out of sample performance. Thus, the training data, \(i=1,\ldots ,5\) and \(t=1,\ldots ,15\), has a total of 75 observations. For the GKRLS and KRLS estimators, all hyperparameters are chosen via LOOCV.Footnote 8

Table 4 Bias corrected average partial derivatives and their standard errors in parentheses are reported for GLS, GKRLS, KRLS, and LP estimators. The columns represent the estimates of the average partial derivative with respect to each regressor

The bias corrected average partial derivatives and corresponding standard errors are reported in Table . These averages are calculated by training each estimator on the five firms with 75 observations in the training data set. The estimates are bias corrected and the results from Sect. 5 are used in our calculations. All estimators display positive and significant relationships between cost and each of the regressors, output and price, with their average partial derivatives being positive. The elasticity with respect to output ranges from 0.5885 to 0.8436 and with respect to price ranges from 0.2260 to 0.4581. More specifically, for the GKRLS estimator, a 10% increase in output would increase the total cost by an average of 8.13% and a 10% increase in fuel price would increase the total cost by an average of 4.25% holding all else fixed. Comparing the GKRLS and KRLS methods, the estimates of the average partial derivatives are similar but the standard errors are significantly reduced for GKRLS for both output and fuel price, implying a gain in efficiency. Therefore, using the information and the structure of the error covariance in Eq. (57) in estimated the regression function allows GKRLS to provide more robust estimates of the average partial effects of each independent variable compared to KRLS.

Table 4 shows that the GLS estimator slightly overestimates the elasticity with respect to output and underestimates the elasticity with respect to fuel price compared to those of GKRLS. The LP estimator appears to provide different average partial effect estimates compared to the rest of the estimators. One possible explanation is that the bandwidths may not be the most optimal since data-driven bandwidth selection methods (e.g., cross validation) fail when there is correlation in the errors (De Brabanter et al. 2018). Since the data is panel structured, there is correlation across time, making bandwidth selection for LP estimators difficult. The LP estimates are from the local constant estimator; however, the local linear estimator provides similar estimates of the average partial effects to those of the local constant estimator. Nevertheless, the LP average partial effects of each variable are positive and significant, which are consistent with the other methods. Furthermore, GKRLS provides similar average partial effects with respect to output and price but is more efficient in terms of smaller standard errors relative to the other considered estimators.

Table 5 The MSEs are reported for the GLS, GKRLS, KRLS, and LP, estimators. The first column are the out of sample MSEs calculated by Eq. (51) and the second and third columns are the bootstrapped MSEs for the average partial derivatives calculated by Eq. (52). The GKRLS and KRLS estimates are bias corrected

To assess the estimators in terms of out of sample performance, we calculate the MSEs using the 15 observations in the testing data set. Table reports MSEs for the four considered estimators. The first column reports the out of sample MSEs using the 15 observations from the first firm. Out of all the considered estimators, the GKRLS estimator outperforms the others in terms of MSE. In other words, the GKRLS estimator can be seen as the superior method in estimating the regression function in this empirical example. The bootstrapped MSEs for the average partial derivatives, calculated by Eq. (52), are reported in the second and third columns of Table 5. For both the average partial derivatives with respect to output and price, GKRLS produces the lowest MSE, outperforming the other estimators. In addition, since GKRLS incorporates the error covariance structure, efficiency is gained and therefore reductions in MSEs are made relative to KRLS. Overall, GKRLS is considered to be the best method in terms of MSE for estimating both the airline cost function and the average partial effects with respect to output and price.

8 Conclusion

Overall, this paper proposes a nonparametric regression function estimator via KRLS under a general parametric error covariance. The two step procedure allows for heteroskedastic and serially correlated errors, where in the first step, KRLS is used to estimate the regression function and the parametric error covariance, and in the second step, KRLS is used to estimate the regression function using the information in the error covariance. The method improves efficiency in the regression estimates as well as the partial effects estimates compared to standard KRLS. The conditional bias and variance, pointwise marginal effects, consistency, and asymptotic normality of GKRLS are provided. Simulations show that there are improvements in variance and MSE reduction when considering GKRLS relative to KRLS. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The average derivatives are evaluated, and the average partial effects of the inputs are determined in the application. In the empirical exercise, GKRLS is more efficient compared to KRLS and is the most preferred method for estimating the airline cost function and its average partial derivatives in terms of MSE.