Abstract
A two-step estimator of a nonparametric regression function via Kernel regularized least squares (KRLS) with parametric error covariance is proposed. The KRLS, not considering any information in the error covariance, is improved by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. A two step procedure is used, where in the first step, a parametric error covariance is estimated by using KRLS residuals and in the second step, a transformed model using the error covariance is estimated by KRLS. Theoretical results including bias, variance, and asymptotics are derived. Simulation results show that the proposed estimator outperforms the KRLS in both heteroskedastic errors and autocorrelated errors cases. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The derivatives are evaluated, and the average partial effects of the inputs are determined in the application.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Peter Schmidt has made many seminal contributions in advancing the statistical inference methods and their applications in time series, cross section, and panel data econometrics in general (Schmidt 1976a) and, in particular, in the areas of dynamic econometric models, estimation and testing of cross-sectional and panel data models, crime and justice models (Schmidt and Witte 1984), survival models (Schmidt and Witte 1988). His fundamental and innovative contributions on the econometrics of stochastic frontier production/cost models have made significant impact on the generations of econometricians (e.g., Schmidt 1976b, Aigner et al. 1977, Amsler et al. 2017, Amsler et al. 2019). Also, he has contributed many influential papers on developing efficient procedures involving the generalized least squares (GLS) method (see Guilkey and Schmidt 1973, Schmidt 1977, Arabmazar and Schmidt 1981, Ahu and Schmidt 1995) among others. These were for the parametric models, whereas here we consider the nonparametric models.
Nonparametric regression function estimators are useful econometric tools. Common methods to estimate a regression function are kernel based methods, such as Kernel Regularized Least Squares (KRLS), Support Vector Machines (SVM), Local Polynomial Regression, etc. However, in order to avoid overfitting the data, some type of regularization, lasso or ridge, is generally used. In this paper, we will focus on KRLS; this method is also known as Kernel Ridge Regression (KRR) in the machine learning literature and is the kernelized version of the simple ridge regression to allow for nonlinearities in the model.
In this paper, we establish fitting a nonparametric regression function via KRLS under a general parametric error covariance. Some theoretical results, including pointwise marginal effects, unbiasedness, consistency and asymptotic normality, on KRLS are found in Hainmueller and Hazlett (2014). However, Hainmueller and Hazlett (2014) only consider errors to be homoskedastic and that the estimator is unbiased for estimating the postpenalization function, not for the true underlying function. Confidence interval estimates for Least Squares Support Vector Machine (LSSVM) are discussed in De Brabanter et al. (2011), allowing for heteroskedastic errors. Although not directly stated, the LSSVM estimator in De Brabanter et al. (2011) is equivalent to KRR/KRLS when an intercept term is included in the model. Following Hainmueller and Hazlett (2014), we will use KRLS without an intercept. Although De Brabanter et al. (2011) allow for heteroskedastic errors, none of the papers mentioned thus far discuss incorporating the error covariance in estimating the regression function itself, making these type of estimators inefficient. In this paper, we focus on making KRLS more efficient by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. We use a two step procedure where in the first step, we estimate the parametric error covariance from the residuals obtained by KRLS and in the second step, we estimate a model by KRLS based on transformed variables using the error covariance. We also provide estimating derivatives based on the two step procedure, allowing us to determine the partial effects of the regressors on the dependent variable.
The structure of this paper is as follows: Sect. 2 discusses the model framework and the GKRLS estimator, Sects. 3, 4, and 5 show the finite sample properties, asymptotic properties, and partial effects and derivatives of the GKRLS estimator, respectively, Sect. 6 runs through a simulation example, Sect. 7 illustrates an empirical example for a random effects model with heteroskedastic and correlated errors, and Sect. 8 concludes the paper.
2 Generalized KRLS estimator
Consider the nonparametric regression model:
where \(X_i\) is a \(q\times 1\) vector of exogenous regressors, and \(U_i\) is the error term such that \(\mathbb {E}[U_i|X_{1},\ldots ,X_{n}] = \mathbb {E}[U_i|\textbf{X}]=0\), where \(\textbf{X}=(X_1,\ldots ,X_n)^\top \) and
In this framework, we allow the error covariance to be parametric, where the errors can be autocorrelated or non-identically distributed across observations.
2.1 KRLS estimator
For KRLS, the function \(m(\cdot )\) can be approximated by some function in the space of functions constituted by
for some test observation \(\textbf{x}_0\) and where \({c}_i,\; i=1,\ldots ,n\) are the parameters of interest, which can be thought of as the weights of the kernel functions \(K_{\sigma }(\cdot )\). The subscript of the kernel function, \(K_{\sigma }(\cdot )\), indicates that the kernel depends on the bandwidth parameter, \(\sigma \).
We will use the Radial Basis Function (RBF) kernel,
Notice that the RBF kernel is very similar to the Gaussian kernel, in that it does not have the normalizing term out in front and that \(\sigma \) is proportional to the bandwidth h in the Gaussian kernel often used in nonparametric local polynomial regression. This functional form is justified by a regularized least squares problem with a feature mapping function that maps \(\textbf{x}\) into a higher dimension (Hainmueller and Hazlett 2014), where this derivation of KRLS is also known as Kernel Ridge Regression (KRR). Overall, KRLS uses a quadratic loss with a weighted \(L_2\)-regularization. Then, in matrix notation, the minimization problem is
where \(\textbf{y}\) is the vector of training data corresponding to the dependent variable, \(\textbf{K}_{\sigma }\) is the kernel matrix, with \(K_{\sigma ,i,j} = K_{\sigma }(\textbf{x}_i,\textbf{x}_j)\) for \(i,j=1,\ldots ,n\), and \(\textbf{c}\) is the vector of coefficients that is optimized over. The solution to this minimization problem is
The kernel function can be user specified but in this paper we only consider the RBF kernel in Eq. (4). The kernel function’s hyperparameter \(\sigma \) and the regularization parameter \(\lambda \) can also be user specified or can be found via cross validation. The subscript of one denotes the KRLS estimator, or the first stage estimation. Finally, predictions for KRLS can be made by
2.2 An efficient KRLS estimator
The KRLS estimator, \(\widehat{m}_{1}(\cdot )\) does not take into consideration any information in the error covariance structure and therefore is inefficient. As a result, consider the \(n\times n\) error covariance matrix, \(\Omega (\theta )\), where \(\omega _{ij}(\theta )\) denotes the (i, j)th element. Assume that \(\Omega (\theta )=P(\theta )P(\theta )'\) for some square matrix \(P(\theta )\) and let \(p_{ij}(\theta )\) and \(v_{ij}(\theta )\) denote the (i, j)th element of \(P(\theta )\) and \(P(\theta )^{-1}\). Let \(\textbf{m}\equiv (m(X_1), \ldots , m(X_n))^\prime \) and \(\textbf{U} \equiv (U_1, \ldots , U_n)^\prime \). Now, premultiply the model in Eq. (1) by \(P^{-1}\), where \(P^{-1}=P^{-1}(\theta )\) and we condense the notation and the dependence on \(\theta \) is implied.
The transformed error term, \(P^{-1}\textbf{U}\) has mean \(\varvec{0}\) and covariance matrix as the identity matrix. Therefore, we consider a regression of \(P^{-1}\textbf{y}\) on \(P^{-1}\textbf{m}\). This simply re-scales the variables by the inverse of their square root of their variances. Since \(\textbf{m}=\textbf{K}_{\sigma }\textbf{c}\), the quadratic loss function with \(L_2\) regularization under the transformed variables is
The solution for vector is
Note that the solution obtained depends on the bandwidth parameter \(\sigma _2\) and ridge parameter \(\lambda _2\), which can be different than the hyperparameters used in the KRLS estimator. In practice, cross validation can be used for obtaining estimates for both hyperparameters. Here, it is assumed that \(\Omega \) is known if \(\theta \) is known. However, if \(\theta \) is unknown, it can be estimated consistently and \(\Omega \) can be replaced by \(\widehat{\Omega }=\widehat{\Omega }(\hat{\theta })\).Footnote 1
Furthermore, predictions for the generalized KRLS estimator can be made by
The two step procedure is outlined below
-
1.
Estimate Eq. (1) by KRLS from Eq. (7) with bandwidth parameter, \(\sigma _1\) and ridge parameter, \(\lambda _1\). Obtain the residuals which can then be used to get a consistent estimate for \(\Omega \).
-
2.
Estimate Eq. (8) by KRLS under the transformed variables as in Eqs. (9) and (11). Denote these estimates as GKRLS.
2.3 Selection of hyperparameters
Throughout this paper, we focus on the RBF kernel in Eq. (4), which contains the hyperparameter \(\sigma _1\) (and \(\sigma _2\)). Since these parameters are squared in the RBF kernel in Eq. (4), we can instead search for the hyperparameters \(\sigma _1^2\) and \(\sigma _2^2\). The selection of the hyperparameters \(\lambda _1,\lambda _2,\sigma _1^2\), and \(\sigma _2^2\) is selected via leave one out cross validation (LOOCV). However, prior to cross validation, it is common in penalized methods to scale the data to have mean of 0 and standard deviation of 1. This way, the penalty parameters \(\lambda _1\) and \(\lambda _2\) do not depend on the scale of the data or the magnitude of the coefficients. Note that the scaling of the data does not affect the interpretations of predictions and marginal effects since the estimates can be translated back to their original scale and location.
For the hyperparameters, \(\sigma _1^2\) and \(\sigma _2^2\), Hainmueller and Hazlett (2014) suggest setting \(\sigma ^2=q\), the number of regressors. Therefore, in items 1 and 2 in the two step procedure, \(\sigma _1^2=q\) and \(\sigma _2^2=q\). Then, only the penalty hyperparameters \(\lambda _1\) and \(\lambda _2\) need to be chosen. \(\lambda _1\) is chosen via LOOCV in item 1 of the two step procedure using Eq. (5). \(\lambda _2\) is then chosen via LOOCV in item 2 of the two step procedure using Eq. (9). If one wishes to also search for \(\sigma _1^2\) and \(\sigma _2^2\), one would perform LOOCV to find \(\lambda _1\) and \(\sigma _1^2\) simultaneously in item 1 using Eq. (5) and then perform another LOOCV to find \(\lambda _2\) and \(\sigma _2^2\) simultaneously in 2 of the two step procedure using Eq. (9).
3 Finite sample properties
In this section, finite sample properties of both KRLS and GKRLS estimators, including the estimation procedures of bias and variance, are discussed in detail.
3.1 Estimation of bias and variance
In this subsection, we estimate the bias and variance of the two step estimator. Following, De Brabanter et al. (2011), notice that the GKRLS estimator is a linear smoother.
Definition 1
An estimator \(\widehat{m}\) of m is a linear smoother if, for each \(\textbf{x}_0\in \mathbb {R}^q\), there exists a vector \(L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots ,l_n(\textbf{x}_0))^\top \in \mathbb {R}^n\) such that
where \(\widehat{m}(\cdot ):\mathbb {R}^{q}\rightarrow \mathbb {R}\).
For in sample data, Eq. (12) can be written in matrix form as \(\widehat{\textbf{m}}=\textbf{Ly}\), where \(\widehat{\textbf{m}}=(\widehat{m}(X_1),\ldots ,\widehat{m}(X_n))^\top \in \mathbb {R}^n\) and \(\textbf{L} = (l({X_1})^\top ,\ldots ,l({X_n})^\top )^\top \in \mathbb {R}^{n\times n}\), where \(\textbf{L}_{ij}=l_j(X_i)\). The ith row of \(\textbf{L}\) show the weights given to each \(Y_i\) in estimating \(\widehat{m}(X_i)\). For the rest of the paper, we will denote \(\widehat{m}_2(\cdot )\) as the prediction made by GKRLS for a single observation and \(\widehat{\textbf{m}}_2\) as the \(n\times 1\) vector of predictions made for the training data.
To obtain the bias and variance of the GKRLS estimator, we assume the following:
Assumption 1
The regression function \(m(\cdot )\) to be estimated falls in the space of functions represented by \(m(\textbf{x}_0) = \sum _{i=1}^n {c}_i K_\sigma (\textbf{x}_i,\textbf{x}_0)\) and assume the model in Eq. (1).
Assumption 2
\(\mathbb {E}[U_i| \textbf{X}] = 0\) and \(\mathbb {E}[U_iU_j|\textbf{X}] = \omega _{ij}(\theta ) \text { for some }\theta \in \mathbb {R}^p, i, j = 1,\ldots ,n \)
Using Definition 1, Assumption 1, and Assumption 2, the conditional mean and variance can be obtained by the following theorem.
Theorem 1
The GKRLS estimator in Eq. (11) is
and \(L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots , l_n(\textbf{x}_0))^\top \) is the smoother vector,
with \(K_{\sigma _2,\textbf{x}_0}^{*}= (K_{\sigma _2}(\textbf{x}_1,\textbf{x}_0),\ldots ,K_{\sigma _2}(\textbf{x}_n,\textbf{x}_0))^\top \) the kernel vector evaluated at point \(\textbf{x}_0\).
Then, the estimator, under model Eq. (1), has conditional mean
and conditional variance
Proof
see Appendix A. \(\square \)
From Theorem 1, the conditional bias can be written as
Following De Brabanter et al. (2011), we will estimate the conditional bias and variance by the following:
Theorem 2
Let \(L(\textbf{x}_0)\) be the smoother vector evaluated at \(\textbf{x}_0\) and let \(\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), [0]\ldots , \widehat{m}_2(\textbf{x}_n))^\top \) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that \(\widehat{\Omega }\rightarrow \Omega \), the estimated conditional bias and variance for GKRLS are obtained by
and
Proof
See Appendix B. \(\square \)
3.2 Bias and variance of KRLS
First, note that the KRLS estimator is also a linear smoother, so the bias and the variance take the same form as in Eqs. (18) and (19), except that the linear smoother vector \(L(\textbf{x}_0)\) will be different. Let
be the smoother vector for KRLS. Then, Eq. (7) can be rewritten as
Using Theorem 1 and Theorem 2 and applying them to the KRLS estimator, the estimated conditional bias and variance of KRLS are
where \(\widehat{\textbf{m}}_{1}\) is the \(n\times 1\) vector of fitted values for KRLS. Note that the estimate of the covariance matrix, \(\Omega \), will be the same for both KRLS and GKRLS.
4 Asymptotic properties
The asymptotic properties of GKRLS, including consistency, asymptotic normality, and bias corrected confidence intervals are covered in this section. To obtain consistency of the GKRLS estimator, we also assume:
Assumption 3
Let \(\lambda _1,\lambda _2,\sigma _1,\sigma _2>0\) and as \(n\rightarrow \infty \), for singular values of \(\textbf{L}P\) given by \(d_i\), \(\sum _{i=1}^n d_i^2\) grows slower than n once \(n>M\) for some \(M<\infty \).
Theorem 3
Under Assumptions 1–3, and let the bias corrected fitted values be denoted by
then
and the bias corrected GKRLS estimator is \(\sqrt{n}\)-consistent with \(\underset{n\rightarrow \infty }{{\text {plim}}} \; \widehat{m}_{c,n}(\textbf{x}_{i})=m(\textbf{x}_i)\) for all i.
Proof
See Appendix C. \(\square \)
The estimated conditional bias from Eq. (18) and conditional variance from Eq. (19) can be used to construct pointwise confidence intervals. Asymptotic normality of the proposed estimator is given via the central limit theorem.
Theorem 4
Under Assumptions 1–3, \(\widehat{\textbf{m}}_2\) is asymptotically normal by the central limit theorem:
where \({\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{Lm}-\textbf{m}\) and \({\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{L}\Omega \textbf{L}^\top \).
Proof
See Appendix D. \(\square \)
Since GKRLS is a biased estimator for m, we need to adjust the pointwise confidence intervals to allow for bias. Since the exact conditional bias and variance are unknown, we can use Eqs. (18) and (19) as estimates and can conduct approximate bias corrected \(100(1-\alpha )\%\) pointwise confidence intervals from Theorem 4 as
for all i. Furthermore, to test the significance of the estimated regression function at an observation point, we can use the bias corrected confidence interval to see if 0 is in the interval.
5 Partial effects and derivatives
We also derive an estimator for pointwise partial derivatives with respect to a certain variable \(\textbf{x}^{(r)}\). The partial derivative of the GKRLS estimator, \(\widehat{m}_{2}(\textbf{x}_0)\) with respect to the rth variable is
using the RBF kernel in Eq. (4) and where \(\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)\equiv \frac{\partial \widehat{m}_{2}(\textbf{x}_0)}{\partial \textbf{x}^{(r)}}\). To find the conditional bias and variance of the derivative estimator, we use the following:
Theorem 5
The GKRLS derivative estimator in Eq. (28) with the RBF kernel in Eq. (4) can be rewritten as
where \(\Delta _r \equiv \frac{2}{\sigma _2^2}{\text {diag}} (\textbf{x}_1^{(r)}-\textbf{x}_0^{(r)},\ldots ,\textbf{x}_n^{(r)}-\textbf{x}_0^{(r)})\) is a \(n\times n\) diagonal matrix, and
is the smoother vector for the first partial derivative with respect to the rth variable. Then, the conditional mean of the GKRLS derivative estimator is
and conditional variance is
Proof
see Appendix E. \(\square \)
Using Theorem 5, the conditional bias and variance can be estimated as follows
Theorem 6
Let \(S_r(\textbf{x}_0)\) be the smoother vector for the partial derivative evaluated at \(\textbf{x}_0\) and let \(\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), \ldots , \widehat{m}_2(\textbf{x}_n))^\top \) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that \(\widehat{\Omega }\rightarrow \Omega \), the estimated conditional bias and variance for GKRLS derivative estimator in Eq. (28) are obtained by
and
Proof
See Appendix F. \(\square \)
The average partial derivative with respect to the rth variable is
The bias and variance of the average partial derivative estimator is given by
and
where \(n^\prime \) is the number of observations in the testing set, \(\varvec{\iota }_{n^\prime }\) is a \(n^\prime \times 1\) vector of ones, \(\textbf{S}_{0,r}\) is the \(n^\prime \times n\) smoother matrix with the jth row as \(S_r(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime \), and \(\textbf{m}_{0,r}^{(1)}\) is the \(n^\prime \times 1\) vector of derivatives evaluated at each \(\textbf{x}_{0,j},j=1,\ldots ,n^\prime \).
5.1 First differences for binary independent variables
Unlike for the continuous case, partial effects for binary independent variables should be interpreted as and estimated by first differences. That is, the estimated effect of going from \(x^{(b)}=0\) to \(x^{(b)}=1\) can be determined by
where \(\widehat{m}_{FD_b}(\cdot )\) is the first difference estimator for the bth binary independent variable, \(x^{(b)}\) is a binary variable that takes the values 0 or 1, \(\textbf{x}_0\) is the \((q-1)\times 1\) vector of the other independent variables evaluated at some test observation, and \(L_{FD_b}(\textbf{x}_0) \equiv L(x^{(b)}=1,\textbf{x}_0)-L(x^{(b)}=0,\textbf{x}_0)\) is the first difference smoother vector. The conditional bias and variance of the first difference GKRLS estimator in Eq. (38) are shown in the following theorem.
Theorem 7
Using Theorems 1 and 2, the conditional bias and variance for the GKRLS first difference estimator in Eq. (38) are obtained by
and
where \(m_{FD_b}(\textbf{x}_0)={m}(x^{(b)}=1,\textbf{x}_0) - {m}(x^{(b)}=0,\textbf{x}_0)\).
Proof
See Appendix G. \(\square \)
Then, the conditional bias and variance can be estimated as follows:
Note that Eq. (38) provides the pointwise first difference estimates. If one is interested in the average partial effect of going from \(x^{(b)}=0\) to \(x^{(b)}=1\), the following average first difference GKRLS estimator would be used.
This average partial effect of a discrete variable is similar to the continuous case and can be compared to traditional parametric partial effects as in the case of least squares coefficients. The conditional bias and variance of the average first difference GKRLS estimator in Eq. (43) are:
where \(\textbf{L}_{FD_{0,b}}\) is the \(n^\prime \times n\) smoother matrix with the jth row as \(L_{FD_b}(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime \), and \(\textbf{m}_{FD_{0,b}}\) is the \(n^\prime \times 1\) vector of first differences evaluated at each \(\textbf{x}_{0,j},j=1,\ldots ,n^\prime \). The conditional bias and variance of the average first difference estimator can be estimated using Eqs. (41) and (42).
6 Simulations
We conduct simulations that show the performance with respect to gaining efficiency of the proposed generalized KRLS estimator. Consider the data generating process from Eq. (1):
We consider the sample size of \(n=200\) and three independent variables X that is generated from
The specification for m is:
and the partial derivatives with respect to each independent variable are given by
For the error terms, we consider two cases.
and
First, in Eq. (49), \(U_i\) is generated by an AR(1) process. Second, \(U_i\) is heteroskedastic but independent of each other with \({\text {Var}}[U_i|\textbf{X}]={\text {exp}}(X_{i,1}+0.2X_{i,2}-0.3X_{i,3})\).
In addition to the proposed estimator, we compare four other nonparametric estimators: the KRLS estimator (KRLS), Local Polynomial (LP) estimator with degree zero, Random Forest (RF), and Support Vector Machine (SVM). The KRLS estimator is used as a comparison to GKRLS to show the magnitude of the efficiency loss from ignoring the information in the error covariance matrix. In addition, the KRLS, LP, RF, and SVM estimators do not utilize the covariance matrix in estimating the regression function and excludes heteroskedasticity or autocorrelation of the errors. For the GKRLS and KRLS estimators, we set \(\sigma _1^2=\sigma _2^2=3\), the number of independent variables in this example, and implement leave one out cross validation to select the hyperparameters, \(\lambda _1\) and \(\lambda _2\).Footnote 2 The variance function under the heteroskedastic case is estimated by least squares from the regression of the log residuals on X. Taking the exponential would give the predicted variance estimates. Under the case of AR(1) errors, the covariance function is estimated from an AR(1) model. We run 200 simulations for each of the two cases and the bias corrected results are reported below in Table .Footnote 3 To evaluate the estimators, mean squared error is used as the main criterion, where we also investigate the bias and variance. To compare results, all estimators are evaluated from 300 data points generated from Eqs. (46) and (47).
Table 1 displays the evaluations, including bias, variance, and MSE of the estimators for the regression function under both error cases. Note that the GKRLS and KRLS estimates in Table 1 are bias corrected. All estimates are averaged across all simulations. Estimates based on GKRLS seem to exhibit similar finite sample bias as KRLS, and there is an obvious reduction in the variability with smaller variance of the proposed estimator relative to KRLS. Note that GKRLS estimation provides a 31.6% and a 3.6% decrease in the variance for estimating the regression function for the autocorrelated and heteroskedastic errors, relative to KRLS. With smaller variance, GKRLS also has a smaller MSE, making GKRLS superior to KRLS. Compared to the other nonparametric estimators, LP, RF, and SVM, the GKRLS estimator outperforms the others in terms of MSE and is the preferred method in the presence of heteroskedasticity or autocorrelation.
Table displays the evaluations, including bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators for the partial derivatives of the regression function with respect to each of the independent variables under both error cases.Footnote 4 Since \(X_1\) is discrete, the partial derivative is estimated by first differences discussed in Sect. 5.1. Similar to the regression estimates, for both heteroskedastic and AR(1) errors, the variability from estimating the derivative is reduced by GKRLS estimation relative to KRLS estimation. In addition, the efficiency gain in estimating both the regression and the derivative seems to be more evident in the AR(1) case compared to the heteroskedastic case. A possible explanation for this is that the covariance matrix contains more information in the off-diagonal elements compared to the diagonal covariance matrix in the heteroskedastic case. Overall, when estimating the regression function and its derivative for this simulation example, the reduction in variance and therefore MSE is clearly evident in Tables 1 and 2, making the GKRLS the preferred estimator.
Table shows the simulation results for the consistency of GKRLS. The bias, variance, and MSE are reported for sample sizes of \(n=100,200,400\). In this example, we set \(\sigma _1^2=\sigma _2^2=3\) and the hyperparameters \(\lambda _1\) and \(\lambda _2\) are found by LOOCV. For the regression function and the derivative and for both error covariance structures, the squared bias, variance, and MSE all decrease as the sample size increases, which implies that the GKRLS estimator is consistent in this simulation exercise.
7 Application
We implement an empirical application from the U.S. airline industry with heteroskedastic and autocorrelated errors using a panel of 6 firms over 15 years.Footnote 5 For the data set, we set aside a portion of the data for training and the other for testing. We estimate the model with four methods, GKRLS, KRLS, LP, and Generalized Least Squares (GLS), and compare their results in terms of mean squared error (MSE). To evaluate the out of sample performance of each method, the predicted out of sample MSEs are computed as follows
where \(MSE_e\) is the mean squared error for the \(e^{th}\) estimator and \(n^\prime \) is the number of observations in the testing data set and \(j=1,\ldots ,n^\prime \). In this empirical exercise, \(n^\prime =1\) and \(T=15\) since we leave out the first firm as a test set. To assess the estimated average derivatives, we use the bootstrap to calculate the MSEs for the average partial effects. We report the bootstrapped MSEs for the average derivative by the following.Footnote 6
where B is the number of bootstraps with \(b=1,\ldots ,B\), \(\widehat{m}_{avg,e,r,b}^{(1)}(\cdot )\) is the \(b^{th}\) bootstrapped average partial first derivative with respect to the \(r^{th}\) variable for the \(e^{th}\) estimator, and \(\frac{1}{4}\sum _e\widehat{m}^{(1)}_{avg,e,r}\) is the simple average of the average partial first derivatives with respect to the \(r^{th}\) variable from the four estimators (GLS, GKRLS, KRLS, and LP):
7.1 U.S. airline industry
We obtain the data on the efficiency in production of airline services from Greene (2018). Since the data are a panel of 6 firms for 15 years, we consider the one way random effects model:
where the dependent variable \(Y_{it} = \log C_{it}\) is the logarithm of total cost, the independent variables \(X_{it} = (\log Q_{it}, \log P_{it})^{\top }\) are the logarithms of output and the price of fuel, respectively, \(\alpha _i\) is the firm specific effect, and \(\varepsilon _{it}\) is the idiosyncratic error term. In this empirical setting, we assume \(\mathbb {E}[\varepsilon _{it}|\textbf{X}]=0,\; \mathbb {E}[\varepsilon _{it}^2|\textbf{X}]=\sigma ^2_ {\varepsilon _{i}},\; \mathbb {E}[\alpha _i|\textbf{X}]=0,\; \mathbb {E}[\alpha _i^2|\textbf{X}]=\sigma ^2_{\alpha _i},\; \mathbb {E}[\varepsilon _{it}\alpha _j|\textbf{X}]=0\) for all i, t, j, \(\mathbb {E}[\varepsilon _{it}\varepsilon _{js}|\textbf{X}]=0\) if \(t\ne s\) or \(i\ne j\), and \(\mathbb {E}[\alpha _i\alpha _j|\textbf{X}]=0\) if \(i\ne j\). Consider the composite error term \(U_{it}\equiv \alpha _i+\varepsilon _{it}\). Then, the model in Eq. (54) can be rewritten as
In Eq. (55), the independent variables are strictly exogenous to the composite error term, \(\mathbb {E}[U_{it}|\textbf{X}]=0\). The variance of the composite error term is \(\mathbb {E}[U_{it}^2|\textbf{X}]=\sigma ^2_{\alpha _i}+\sigma ^2_{\varepsilon _{i}}\). Therefore, in this empirical example, we allow for firm specific heteroskedasticity. In other words, the variance of the error terms are not constant across firms, but are constant over time for each firm. Since there is a time component, we allow an individual firm to be correlated across time but not with other firms, that is, \(\mathbb {E}[U_{it}U_{is}|\textbf{X}]=\sigma ^2_{\alpha _i}, \; t\ne s\) and \(\mathbb {E}[U_{it}U_{js}|\textbf{X}]=0\) for all t and s if \(i\ne j\). Note that the correlation across time can be different for every firm. Therefore, in this empirical framework, we allow the error terms to be heteroskedastic across firms and correlated across time.
To estimate Eq. (55) by GKRLS and KRLS in the framework set up in this paper, we can write the model in matrix notation. Consider
where \(\textbf{y}\) is the \(nT\times 1\) vector of \(\log C_{it}\), \(\textbf{m}\) is the \(nT\times 1\) vector of the regression function \(m(X_{it})\), and \(\textbf{U}\) is the \(nT\times 1\) vector of \(U_{it}\), \(i=1,\ldots ,n\) and \(t=1,\ldots ,T\). Then, the \(nT\times nT\) error covariance matrix \(\Omega \) is
where \(\Sigma _i=\sigma ^2_{\varepsilon _{i}}\textbf{I}_T +\sigma ^2_{\alpha _i} \varvec{\iota }_T\varvec{\iota }^\top _T, i=1,\ldots ,n\) has dimension \(T\times T\), \(\textbf{I}_T\) is a \(T\times T\) identity matrix and \(\varvec{\iota }_T\) is a \(T\times 1\) vector of ones. To use the GKRLS estimator in this empirical framework, we first estimate Eqs. (55) or (56) by KRLS and obtain the residuals, denoted by \(\widehat{u}_{it}\). To estimate the error covariance matrix \(\Omega \), the variances of the firm specific error and the idiosyncratic error, \(\sigma ^2_{\alpha _i}\) and \(\sigma ^2_{\varepsilon _{i}}\) need to be estimated. Consider the following consistent estimators using time averages,
where \(\widehat{\textbf{u}}_i\) is the \(T\times 1\) vector of residuals for the ith firm. Now, plugging these estimates in for \(\Omega \), the GKRLS estimator can be estimated as in the previous sections. For further details, please see Appendix H.
With regards to the other comparable estimators, the KRLS and LP estimators are used to estimate Eqs. (55) or (56) ignoring the heteroskedasticity and correlation in the composite error, \(\textbf{U}\). Note that the KRLS estimator uses the error covariance matrix in the variances and standard errors but does not use the error covariance in estimating the regression function. Lastly, the GLS estimator is used as a parametric benchmark to compare to the standard random effects panel data model.Footnote 7
The data contain 90 observations of 6 firms for 15 years, from 1970–1984. We split the data into two parts, where the first 15 observations, which corresponds to the first firm, are used as testing data and 75 observations, which corresponds to the last five firms, are set as training data to evaluate out of sample performance. Thus, the training data, \(i=1,\ldots ,5\) and \(t=1,\ldots ,15\), has a total of 75 observations. For the GKRLS and KRLS estimators, all hyperparameters are chosen via LOOCV.Footnote 8
The bias corrected average partial derivatives and corresponding standard errors are reported in Table . These averages are calculated by training each estimator on the five firms with 75 observations in the training data set. The estimates are bias corrected and the results from Sect. 5 are used in our calculations. All estimators display positive and significant relationships between cost and each of the regressors, output and price, with their average partial derivatives being positive. The elasticity with respect to output ranges from 0.5885 to 0.8436 and with respect to price ranges from 0.2260 to 0.4581. More specifically, for the GKRLS estimator, a 10% increase in output would increase the total cost by an average of 8.13% and a 10% increase in fuel price would increase the total cost by an average of 4.25% holding all else fixed. Comparing the GKRLS and KRLS methods, the estimates of the average partial derivatives are similar but the standard errors are significantly reduced for GKRLS for both output and fuel price, implying a gain in efficiency. Therefore, using the information and the structure of the error covariance in Eq. (57) in estimated the regression function allows GKRLS to provide more robust estimates of the average partial effects of each independent variable compared to KRLS.
Table 4 shows that the GLS estimator slightly overestimates the elasticity with respect to output and underestimates the elasticity with respect to fuel price compared to those of GKRLS. The LP estimator appears to provide different average partial effect estimates compared to the rest of the estimators. One possible explanation is that the bandwidths may not be the most optimal since data-driven bandwidth selection methods (e.g., cross validation) fail when there is correlation in the errors (De Brabanter et al. 2018). Since the data is panel structured, there is correlation across time, making bandwidth selection for LP estimators difficult. The LP estimates are from the local constant estimator; however, the local linear estimator provides similar estimates of the average partial effects to those of the local constant estimator. Nevertheless, the LP average partial effects of each variable are positive and significant, which are consistent with the other methods. Furthermore, GKRLS provides similar average partial effects with respect to output and price but is more efficient in terms of smaller standard errors relative to the other considered estimators.
To assess the estimators in terms of out of sample performance, we calculate the MSEs using the 15 observations in the testing data set. Table reports MSEs for the four considered estimators. The first column reports the out of sample MSEs using the 15 observations from the first firm. Out of all the considered estimators, the GKRLS estimator outperforms the others in terms of MSE. In other words, the GKRLS estimator can be seen as the superior method in estimating the regression function in this empirical example. The bootstrapped MSEs for the average partial derivatives, calculated by Eq. (52), are reported in the second and third columns of Table 5. For both the average partial derivatives with respect to output and price, GKRLS produces the lowest MSE, outperforming the other estimators. In addition, since GKRLS incorporates the error covariance structure, efficiency is gained and therefore reductions in MSEs are made relative to KRLS. Overall, GKRLS is considered to be the best method in terms of MSE for estimating both the airline cost function and the average partial effects with respect to output and price.
8 Conclusion
Overall, this paper proposes a nonparametric regression function estimator via KRLS under a general parametric error covariance. The two step procedure allows for heteroskedastic and serially correlated errors, where in the first step, KRLS is used to estimate the regression function and the parametric error covariance, and in the second step, KRLS is used to estimate the regression function using the information in the error covariance. The method improves efficiency in the regression estimates as well as the partial effects estimates compared to standard KRLS. The conditional bias and variance, pointwise marginal effects, consistency, and asymptotic normality of GKRLS are provided. Simulations show that there are improvements in variance and MSE reduction when considering GKRLS relative to KRLS. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The average derivatives are evaluated, and the average partial effects of the inputs are determined in the application. In the empirical exercise, GKRLS is more efficient compared to KRLS and is the most preferred method for estimating the airline cost function and its average partial derivatives in terms of MSE.
Notes
\(\widehat{\Omega }\) can be thought of as a working covariance matrix since the parametric functional form may be subject to misspecification. One method to avoid misspecification is to estimate \(\Omega \) nonparametrically. For example, under heteroskedasticity, one can estimate \(\Omega \) by a semiparametric KRLS estimator of the conditional variance (Dang and Ullah 2022). Other solutions may be explored as future work.
The hyperparameters of the LP, RF, and SVM estimators are chosen by their default methods in their respective R packages.
The derivatives are not reported for LP, RF, and SVM since derivative estimation for RF and SVM methods are uncommon. The derivative estimates for LP can be obtained but in this simulation the GKRLS estimator is superior with respect to MSE.
The data for the application is from Greene (2018) and can be downloaded at https://pages.stern.nyu.edu/~wgreene/Text/Edition7/tablelist8new.htm.
The R package by Callaway (2022) was used to obtain the bootstrap samples.
The R package by Croissant and Millo (2008) was used to obtained the Random Effects GLS estimator.
For the LP estimator, cross validation is used to select the hyperparameters. The local constant estimator is used, although one can use the local linear estimator, which gives similar results to that of the local constant.
We follow the proof similar to the case of dependent identically distributed observations provided by White (2001).
References
Ahu SC, Schmidt P (1995) A separability result for gmm estimation, with applications to gls prediction and conditional moment tests. Econ Rev 14(1):19–34. https://doi.org/10.1080/07474939508800301
Aigner D, Lovell C, Schmidt P (1977) Formulation and estimation of stochastic frontier production function models. J Econ 6(1):21–37
Amsler C, Prokhorov A, Schmidt P (2017) Endogenous environmental variables in stochastic frontier models. J Econ 199(2):131–140. https://doi.org/10.1016/j.jeconom.2017.05.005
Amsler C, Schmidt P, Tsay W-J (2019) Evaluating the cdf of the distribution of the stochastic frontier composed error. J Prod Anal 52(1–3):29–35. https://doi.org/10.1007/s11123-019-00554-9
Arabmazar A, Schmidt P (1981) Further evidence on the robustness of the tobit estimator to heteroskedasticity. J Econ 17(2):253–258. https://doi.org/10.1016/0304-4076(81)90029-4
Boos DD, Nychka D (2022) Rlab: functions and datasets required for ST370 class, R package version 4.0
Borchers HW (2021) pracma: Practical Numerical Math Functions, R package version 2.3.3
Callaway B (2022) BMisc: miscellaneous functions for panel data, quantiles, and printingresults. R package version 1.4.5
Croissant Y, Millo G (2008) Panel data econometrics in R: the plm package. J Stat Softw 27(2):1–43. https://doi.org/10.18637/jss.v027.i02
Dang J, Ullah A (2022) Machine-learning-based semiparametric time series conditional variance: Estimation and forecasting. J Risk Financ Manag. https://doi.org/10.3390/jrfm15010038
De Brabanter K, Cao F, Gijbels I, Opsomer J (2018) Local polynomial regression with correlated errors in random design and unknown correlation structure. Biometrika 105(3):681–690. https://doi.org/10.1093/biomet/asy025
De Brabanter K, De Brabanter J, Suykens JAK, De Moor B (2011) Approximate confidence and prediction intervals for least squares support vector regression. IEEE Trans Neural Netw 22(1):110–120. https://doi.org/10.1109/TNN.2010.2087769
Greene W (2018) Econometric analysis. Pearson, ISBN 9780134461366
Guilkey DK, Schmidt P (1973) Estimation of seemingly unrelated regressions with vector autoregressive errors. J Am Stat Assoc 68(343):642–647
Hainmueller J, Hazlett C (2014) Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach. Polit Anal 22(2):143–168
Hayfield T, Racine JS (2008) Nonparametric econometrics: the np package. J Stat Softw 27(5):1–32. https://doi.org/10.18637/jss.v027.i05
Hyndman RJ, Khandakar Y (2008) Automatic time series forecasting: the forecast package for R. J Stat Softw 26(3):1–22
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
McLeod AI, Yu H, Krougly Z (2007) Algorithms for linear time series analysis: with r package. J Stat Softw 23(5):1
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2022) e1071: Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien, . R package version 1.7-12
Schmidt P (1976a) Econometrics. Marcel Dekker Inc, New York
Schmidt P (1976b) On the statistical estimation of parametric frontier production functions. Rev Econ Stat 58(2):238–239
Schmidt P (1977) Estimation of seemingly unrelated regressions with unequal numbers of observations. J Econ 5(3):365–377. https://doi.org/10.1016/0304-4076(77)90045-8
Schmidt P, Witte AD (1984) An economic analysis of crime and justice. Academic Press, New York
Schmidt P, Witte AD (1988) Predicting recidivism using survival models. Springer-Verlag, New York
White H (2001) Asymptotic theory for econometricians. Econometrics, and mathematical economics. Emerald Group Publishing Limited, Economic Theory. 9780127466521
Funding
Open access funding provided by SCELC, Statewide California Electronic Library Consortium This research received no funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Justin Dang declares that he has no conflict of interest. Aman Ullah declares that he has no conflict of interest.
Human or animal rights
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors are thankful to the two referees for their helpful comments and suggestions on the paper.
Appendices
Appendices
A Proof of Theorem 1
First, we note that the GKRLS estimator is a linear smoother by substituting Eqs. (10) into (11)
where \(L(\textbf{x}_0)=\left[ K_{\sigma _2,\textbf{x}_0}^{*\top } (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top \) and \(K_{\sigma _2,\textbf{x}_0}^{*}= (K_{\sigma _2}(\textbf{x}_1,\textbf{x}_0),\ldots ,K_{\sigma _2}(\textbf{x}_n,\textbf{x}_0))^\top \) the kernel vector evaluated at point \(\textbf{x}_0\).
Then, the conditional mean and variance of GKRLS can be derived as follows
and
B Proof of Theorem 2
The exact bias for GKRLS for the training data is given by
and observe that the residuals are obtained by
And the expectation of the residuals is given by
De Brabanter et al. (2011) suggests estimating the conditional bias by smoothing the negative residuals
Therefore, the conditional bias can be estimated at any point \(\textbf{x}_0\) by
For the conditional variance, we assume that the error covariance matrix \(\Omega =\Omega (\theta )\) can be consistently estimated by \(\widehat{\Omega }=\widehat{\Omega }(\widehat{\theta })\). Then, using a consistent estimator of the error covariance matrix, the conditional variance of GKRLS can be estimated by
C Proof of Theorem 3
Since the bias corrected fitted values, \(\widehat{\textbf{m}}_c\), have zero conditional bias, we can focus on the conditional variance. From Theorem 1, the conditional variance of the GKRLS estimator is
where \(\textbf{A}\equiv \textbf{L}P\). Consider the singular value decomposition of \(\textbf{A}\), where \(\textbf{D}\), \(\textbf{U}\), \(\textbf{V}\) are the singular values, left singular vectors, and right singular vectors respectively.
where \(d_i, i=1,\ldots ,n\) denotes the ith diagonal element of \(\textbf{D}\), i.e. the ith singular value of \(\textbf{L}P\). To examine the sum of the variances of \(\widehat{\textbf{m}}_2\), the trace of the variance matrix is evaluated.
For large enough n, \({\text {tr}}(\textbf{D}^2)\) slows in growth and converges to some constant, M, and the average variance of \(\widehat{m}(\textbf{x}_i)\) is \(\frac{1}{n}\sum _{i=1}^n d_i^2\). Recall that \(d_i^2\) denotes the ith squared singular value of \(\textbf{L}P\) and is proportional to the variance explained by a given singular vector of \(\textbf{L}P\). Given the construction of \(\textbf{L}P\), the columns of this product matrix can be thought of as weights of the data, scaled by the standard deviation of the error term. Therefore, the number of large singular values will grow initially with n but the number of important dimensions or singular values will start to grow slowly with n. As a result, the average variance of \(\widehat{m}(\textbf{x}_i)\), which is \(\frac{1}{n}\sum _{i=1}^n d_i^2\), shrinks to zero as \(n\rightarrow \infty \). Since the average variance shrinks to zero, then each individual variance must approach zero as n becomes large.
We also provide an alternative proof of consistency. Consider the GKRLS coefficient estimator of \(\textbf{c}\) in Eq. (10):
Again, since we consider the bias corrected estimator, \(\widehat{\textbf{m}}_{2,c}\), we can focus on the conditional variance. However, below we also show that the non-bias corrected estimator has zero conditional bias in the limit. Taking the conditional bias of \(\widehat{\textbf{c}}_2\):
where the strict exogeneity assumption \(\mathbb {E}[\textbf{u}|\textbf{X}]=\varvec{0}\) is used. Furthermore, if we assume \(\lambda _2\) is fixed or does not grow as fast as n and \(\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \rightarrow \textbf{Q}\), a positive definite matrix with finite elements, when \(n\rightarrow \infty \), then \({\text {Bias}}[\widehat{\textbf{c}}_2|\textbf{X}]\rightarrow \varvec{0}\) as \(n\rightarrow \infty \).
Taking the conditional variance of \(\widehat{\textbf{c}}_2\):
Again, we assume that \(\lambda _2\) is fixed or does not grow as fast as n and \(\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \rightarrow \textbf{Q}\), a positive definite matrix with finite elements. Furthermore, if we assume that \(\left( \frac{1}{n}\Omega ^{-1}\right) \rightarrow \textbf{Q}_\Omega \), a matrix with finite elements when \(n\rightarrow \infty \), then \({\text {Var}}[\widehat{\textbf{c}}_2|\textbf{X}]\rightarrow \varvec{0}\) as \(n\rightarrow \infty \). Therefore, \(\underset{n\rightarrow \infty }{{\text {plim}}} \; \widehat{\textbf{c}}_2 = \textbf{c}\).
Now, consider the GKRLS estimator \(\widehat{\textbf{m}}_2 = \textbf{K}_{\sigma _2} \widehat{\textbf{c}}_2\). Then,
proving consistency of \(\widehat{\textbf{m}}_2\). Note that since the variance is O(1/n), \(\widehat{\textbf{m}}_2\) is \(\sqrt{n}\)-consistent.
D Proof of Theorem 4
Consider the difference between the bias corrected fitted values and the true values, \(\widehat{\textbf{m}}_2 - {\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m}\), where \({\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]=\textbf{Lm}-\textbf{m}\),
Note that \({\text {E}}[\textbf{Lu}|\textbf{X}]=\varvec{0}\) and \({\text {Var}}[\textbf{Lu}|\textbf{X}]=\textbf{L}\Omega \textbf{L}^\top \). The following results will be for the case of heteroskedastic errors, where observations are independent and heterogeneously distributed. Consider the individual variances for each observation,
and let \(s_n^2\) be the sum of the variances,
As long as the sum is not dominated by any particular term and if \(L(\textbf{x}_i)u_i\) are independent vectors distributed with mean \(\varvec{0}\) and variance \(L(\textbf{x}_i)^\top \Omega L(\textbf{x}_i)<\infty \) and \(s_n^2\rightarrow \infty \) as \(n\rightarrow \infty \), then
by Lindeberg-Feller central limit theorem. It then follows that
The following results will be for the case of autocorrelated errors, where observations are dependent and identically distributed.Footnote 9 Define \(\textbf{L}_n \equiv \textbf{K}_{\sigma _2} \left( \frac{\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2 \textbf{I}}{n}\right) ^{-1}\Omega ^{-1}\) and \({L}_n(\textbf{X}_t)\) as the \(t-\)th row of \(\textbf{L}_n\). Given (i) \(Y_t = m(\textbf{X}_t) + u_t, t=1,2,\ldots \); (ii) \(\left\{ (\textbf{X}_t, u_t) \right\} \) is a stationary ergodic sequence; (iii) (a) \(\left\{ L_n(X_{thi}) u_{th}, {\mathcal {F}}_t \right\} \) is an adapted mixingale of size \(-1\), \(h=1,\ldots ,p, i=1,\dots ,n\); (b) \(\mathbb {E}|L_n(X_{thi}) u_{th} |^2 < \infty , h=1,\ldots ,p, i=1, \ldots , n\); (c) \(\textbf{V}_n \equiv {\text {Var}}\left( \frac{1}{\sqrt{n}} \textbf{L}_n\textbf{u}\right) \) is uniformly positive definite; (iv) \(\mathbb {E}|L_n(X_{thi})|^2 < \infty , h=1,\ldots ,p, i=1,\ldots ,n\); (v) \(\underset{n\rightarrow \infty }{\lim }\ L_n(\textbf{X}_t)=L(X_t)\) and \(\underset{n\rightarrow \infty }{\lim }\ \textbf{L}_n=\textbf{L}\).
Consider \(n^{-1/2}\sum _{t=1}^n \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_t) u_t\), where \(\textbf{V}\) is any finite positive definite matrix. By Theorem 3.35 of White (2001), \(\left\{ Z_t, {\mathcal {F}}_t \right\} \) is an adapted stochastic sequence because \(Z_t\) is measurable with respect to \({\mathcal {F}}_t\). To see that \(\mathbb {E}(Z_t^2)<\infty \), note that we can write
where \(\tilde{\lambda }_i\) is the ith element of the \(n\times 1\) vector \(\tilde{\varvec{\lambda }} \equiv \textbf{V}^{-1/2} \varvec{\lambda }\). By definition of \(\varvec{\lambda }\) and \(\textbf{V}\), there exists \(\Delta < \infty \) such that \(|\tilde{\lambda }_i|<\Delta \) for all i. It follows from Minkowski’s inequality that
since for \(\Delta \) sufficiently large, \(\mathbb {E} | L_n(X_{thi}) u_{th} |^2< \Delta < \infty \) given (iii.b) and the stationarity assumption. Next, we show \(\left\{ Z_t, {\mathcal {F}}_t \right\} \) is a mixingale of size \(-1\). Using the expression for \(Z_t\) just given, we can write
Applying Minkowski’s inequality, it follows that
where \(\bar{c}_0 = \max _{h,i} c_{0hi} < \infty \) and \(\bar{\gamma }_m = \max _{h,i} \gamma _{mhi}\) is of size \(-1\). Thus, \(\lbrace Z_t, {\mathcal {F}}_t\rbrace \) is a mixingale of size \(-1\). Note that
Hence \(\textbf{V}_n\) converges to a finite matrix. Set \(\textbf{V}=\lim _{n\rightarrow \infty } \textbf{V}_n=\textbf{L}\Omega \textbf{L}^{\top }\) which is positive definite given (iii.c). Then, \(\bar{\sigma }^2 = \varvec{\lambda }^\top \textbf{V}^{-1/2} \textbf{V} \textbf{V}^{-1/2} \varvec{\lambda }=1\). Then by the martingale central limit theorem, \(n^{-1/2}\sum _{t=1}^n \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_t) u_t \overset{d}{\rightarrow }\ N(0,1)\). Since this holds for every \(\varvec{\lambda }\) such that \(\varvec{\lambda }^\top \varvec{\lambda }=1\), it follows from Cramér-Wold Theorem, that \(n^{-1/2}\textbf{V}^{-1/2} \sum _{t=1}^n L_n(\textbf{X}_t) u_t \overset{d}{\rightarrow }\ N(\varvec{0}, \textbf{I})\). Hence, \(\sqrt{n}\textbf{Lu}\overset{d}{\rightarrow } N(\varvec{0},\textbf{L}\Omega \textbf{L}^\top )\) and it then follows that
E Proof of Theorem 5
First, we note that the GKRLS derivative estimator is a linear smoother by substituting Eqs. (10) into (28),
where \(\Delta _r \equiv \frac{2}{\sigma _2^2}{\text {diag}} (\textbf{x}_1^{(r)} -\textbf{x}_0^{(r)},\ldots ,\textbf{x}_n^{(r)}-\textbf{x}_0^{(r)})\) is a \(n\times n\) diagonal matrix and
is the smoother vector for the first partial derivative with respect to the rth variable. Then, the conditional mean and variance of the GKRLS derivative can be derived as follows
and
F Proof of Theorem 6
The bias of the GKRLS derivative estimator in Eq. (28)
where \({m}_r^{(1)}(\textbf{x}_0)\) is the true first partial derivative of m with respect to the rth variable. Since this quantity as well as \(\textbf{m}\) is unknown, we estimate both to calculate the conditional bias.
where \(\widehat{\textbf{m}}_2\) is the \(n\times 1\) vector of in sample GKRLS predictions of \(\textbf{m}\) and \(\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)\) is the estimated GKRLS derivative prediction evaluated at point \(\textbf{x}_0\).
For the conditional variance, we assume that the error covariance matrix \(\Omega =\Omega (\theta )\) can be consistently estimated by \(\widehat{\Omega }=\widehat{\Omega }(\widehat{\theta })\). Then, using a consistent estimator of the error covariance matrix, the conditional variance of the GKRLS derivative estimator can be estimated by
G Proof of Theorem 7
The conditional bias of the GKRLS first difference estimator in Eq. (38) is
where \(m_{FD_b}(\textbf{x}_0) = m(x^{(b)}=1,\textbf{x}_0) - m(x^{(b)}=0, \textbf{x}_0)\) is the true first difference of m with respect to the bth variable and \(L_{FD_b}(\textbf{x}_0) = L(x^{(b)}=1,\textbf{x}_0) - L(x^{(b)}=0,\textbf{x}_0)\) is the first difference smoother vector.
The conditional variance of the GKRLS first difference estimator in Eq. (38) is
H A random effects model for airline sata used in Sect. 7
Consider the following random effects model for an airline cost function:
\(Y_{it} = \log C_{it}\), \(X_{it} = (\log Q_{it}, \log P_{it})^{\top }\), \(\alpha _i\) is the firm specific effect, and \(\varepsilon _{it}\) is the idiosyncratic error term. In this empirical setting, we assume
Consider the composite error term \(U_{it}\equiv \alpha _i+\varepsilon _{it}\). Then, the model with the composite error term is
Note that the independent variables are strictly exogenous; the regressors are mean independent of each error term and therefore of the composite error term:
In this framework, we allow for the errors to be heteroskedastic and correlated across time. The variance of the composite error term is
where \(\mathbb {E}[\alpha _i\varepsilon _{it}|\textbf{X}]=0\) by assumption. The covariance of the composite errors is
and
Therefore, this framework allows for heteroskedasticity with respect to firms and correlation across time and the correlation across time can be firm specific.
Define the \(T\times 1\) vector of errors for firm i as \(\textbf{u}_i=(u_{i1},\ldots ,u_{iT})^\top \), \(i=1,\ldots ,n\), where we stack the errors over time for each firm. Then define the \(T\times T\) error covariance matrix for each firm, \(\Sigma _i\), as
Therefore, the \(nT\times nT\) error covariance matrix \(\Omega \) is block diagonal as
To estimate the random effects model of airline cost by GKRLS, first, we follow item 1 of the two step procedure outlined in Sect. 2. To get a consistent estimate of the error covariance matrix \(\Omega \), we can estimate the error variances using the residuals from the first step as
Since averages are used to estimate the variances and by the law of large numbers \(\widehat{\sigma }^2_{\alpha _i}\) and \(\widehat{\sigma }^2_{\varepsilon _{i}}\) are consistent estimators of \({\sigma }^2_{\alpha _i}\) and \({\sigma }^2_{\varepsilon _{i}}\). Then, using these estimates for the error covariance, we follow 2 of the two step procedure to get GKRLS estimates of the cost function.
In order to apply the asymptotic results established in Sect. 4, we must have \(nT\rightarrow \infty \). Then, consistency and asymptotic normality of the GKRLS estimator under the random effects model discussed in Sect. 7 can be applied. In addition, since time averages are used to estimate the variances, we also must have \(T\rightarrow \infty \). \(T\rightarrow \infty \) is needed to apply the law of large numbers to get consistent estimates of \({\sigma }^2_{\alpha _i}\) and \({\sigma }^2_{\varepsilon _{i}}\). Since we assume that \(T\rightarrow \infty \), it must be that \(nT\rightarrow \infty \), and applying Theorems 3 and 4, the GKRLS estimator is consistent and asymptotically normal.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dang, J., Ullah, A. Generalized kernel regularized least squares estimator with parametric error covariance. Empir Econ 64, 3059–3088 (2023). https://doi.org/10.1007/s00181-023-02411-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00181-023-02411-z
Keywords
- Nonparametric estimator
- Semiparametric models
- Machine learning
- Kernel regularized least squares
- Covariance
- Heteroskedasticity
- Serial correlation