Generalized kernel regularized least squares estimator with parametric error covariance

A two-step estimator of a nonparametric regression function via Kernel regularized least squares (KRLS) with parametric error covariance is proposed. The KRLS, not considering any information in the error covariance, is improved by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. A two step procedure is used, where in the first step, a parametric error covariance is estimated by using KRLS residuals and in the second step, a transformed model using the error covariance is estimated by KRLS. Theoretical results including bias, variance, and asymptotics are derived. Simulation results show that the proposed estimator outperforms the KRLS in both heteroskedastic errors and autocorrelated errors cases. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The derivatives are evaluated, and the average partial effects of the inputs are determined in the application.


Introduction
properties, and partial effects and derivatives of the GKRLS estimator, respectively, Sect. 6 runs through a simulation example, Sect. 7 illustrates an empirical example for a random effects model with heteroskedastic and correlated errors, and Sect. 8 concludes the paper.

Generalized KRLS estimator
Consider the nonparametric regression model: where X i is a q × 1 vector of exogenous regressors, and U i is the error term such that E[U i |X 1 , . . . , X n ] = E[U i |X] = 0, where X = (X 1 , . . . , X n ) and In this framework, we allow the error covariance to be parametric, where the errors can be autocorrelated or non-identically distributed across observations.

KRLS estimator
For KRLS, the function m(·) can be approximated by some function in the space of functions constituted by for some test observation x 0 and where c i , i = 1, . . . , n are the parameters of interest, which can be thought of as the weights of the kernel functions K σ (·). The subscript of the kernel function, K σ (·), indicates that the kernel depends on the bandwidth parameter, σ . We will use the Radial Basis Function (RBF) kernel, Notice that the RBF kernel is very similar to the Gaussian kernel, in that it does not have the normalizing term out in front and that σ is proportional to the bandwidth h in the Gaussian kernel often used in nonparametric local polynomial regression. This functional form is justified by a regularized least squares problem with a feature mapping function that maps x into a higher dimension (Hainmueller and Hazlett 2014), where this derivation of KRLS is also known as Kernel Ridge Regression (KRR).
Overall, KRLS uses a quadratic loss with a weighted L 2 -regularization. Then, in matrix notation, the minimization problem is arg min c (y − K σ c) (y − K σ c) + λc K σ c, where y is the vector of training data corresponding to the dependent variable, K σ is the kernel matrix, with K σ,i, j = K σ (x i , x j ) for i, j = 1, . . . , n, and c is the vector of coefficients that is optimized over. The solution to this minimization problem is c 1 = (K σ 1 + λ 1 I) −1 y.
The kernel function can be user specified but in this paper we only consider the RBF kernel in Eq. (4). The kernel function's hyperparameter σ and the regularization parameter λ can also be user specified or can be found via cross validation. The subscript of one denotes the KRLS estimator, or the first stage estimation. Finally, predictions for KRLS can be made by

An efficient KRLS estimator
The KRLS estimator, m 1 (·) does not take into consideration any information in the error covariance structure and therefore is inefficient. As a result, consider the n × n error covariance matrix, (θ), where ω i j (θ ) denotes the (i, j)th element. Assume that (θ) = P(θ )P(θ ) for some square matrix P(θ ) and let p i j (θ ) and v i j (θ ) denote the (i, j)th element of P(θ ) and P(θ ) −1 . Let m ≡ (m(X 1 ), . . . , m(X n )) and U ≡ (U 1 , . . . , U n ) . Now, premultiply the model in Eq. (1) by P −1 , where P −1 = P −1 (θ ) and we condense the notation and the dependence on θ is implied.
The transformed error term, P −1 U has mean 0 and covariance matrix as the identity matrix. Therefore, we consider a regression of P −1 y on P −1 m. This simply re-scales the variables by the inverse of their square root of their variances. Since m = K σ c, the quadratic loss function with L 2 regularization under the transformed variables is arg min The solution for vector isĉ 2 = ( −1 K σ 2 + λ 2 I) −1 −1 y Note that the solution obtained depends on the bandwidth parameter σ 2 and ridge parameter λ 2 , which can be different than the hyperparameters used in the KRLS estimator. In practice, cross validation can be used for obtaining estimates for both hyperparameters. Here, it is assumed that is known if θ is known. However, if θ is unknown, it can be estimated consistently and can be replaced by = (θ). 1 Furthermore, predictions for the generalized KRLS estimator can be made by The two step procedure is outlined below 1. Estimate Eq. (1) by KRLS from Eq. (7) with bandwidth parameter, σ 1 and ridge parameter, λ 1 . Obtain the residuals which can then be used to get a consistent estimate for . 2. Estimate Eq. (8) by KRLS under the transformed variables as in Eqs. (9) and (11).
Denote these estimates as GKRLS.

Selection of hyperparameters
Throughout this paper, we focus on the RBF kernel in Eq. (4), which contains the hyperparameter σ 1 (and σ 2 ). Since these parameters are squared in the RBF kernel in Eq. (4), we can instead search for the hyperparameters σ 2 1 and σ 2 2 . The selection of the hyperparameters λ 1 , λ 2 , σ 2 1 , and σ 2 2 is selected via leave one out cross validation (LOOCV). However, prior to cross validation, it is common in penalized methods to scale the data to have mean of 0 and standard deviation of 1. This way, the penalty parameters λ 1 and λ 2 do not depend on the scale of the data or the magnitude of the coefficients. Note that the scaling of the data does not affect the interpretations of predictions and marginal effects since the estimates can be translated back to their original scale and location.
For the hyperparameters, σ 2 1 and σ 2 2 , Hainmueller and Hazlett (2014) suggest setting σ 2 = q, the number of regressors. Therefore, in items 1. and 2. in the two step procedure, σ 2 1 = q and σ 2 2 = q. Then, only the penalty hyperparameters λ 1 and λ 2 need to be chosen. λ 1 is chosen via LOOCV in item 1. of the two step procedure using Eq. (5). λ 2 is then chosen via LOOCV in item 2. of the two step procedure using Eq. (9). If one wishes to also search for σ 2 1 and σ 2 2 , one would perform LOOCV to find λ 1 and σ 2 1 simultaneously in item 1. using Eq. (5) and then perform another LOOCV to find λ 2 and σ 2 2 simultaneously in 2. of the two step procedure using Eq. (9).

Finite sample properties
In this section, finite sample properties of both KRLS and GKRLS estimators, including the estimation procedures of bias and variance, are discussed in detail.
Footnote 1 continued under heteroskedasticity, one can estimate by a semiparametric KRLS estimator of the conditional variance (Dang and Ullah 2022). Other solutions may be explored as future work.

Estimation of bias and variance
In this subsection, we estimate the bias and variance of the two step estimator. Following, De Brabanter et al. (2011), notice that the GKRLS estimator is a linear smoother.
Definition 1 An estimator m of m is a linear smoother if, for each x 0 ∈ R q , there exists a vector L( where m(·) : R q → R.
For in sample data, Eq. (12) can be written in matrix form as m = Ly, where m = ( m(X 1 ), . . . , m(X n )) ∈ R n and L = (l(X 1 ) , . . . , l(X n ) ) ∈ R n×n , where L i j = l j (X i ). The ith row of L show the weights given to each Y i in estimating m(X i ). For the rest of the paper, we will denote m 2 (·) as the prediction made by GKRLS for a single observation and m 2 as the n × 1 vector of predictions made for the training data.
To obtain the bias and variance of the GKRLS estimator, we assume the following:

Assumption 1
The regression function m(·) to be estimated falls in the space of functions represented by m(x 0 ) = n i=1 c i K σ (x i , x 0 ) and assume the model in Eq. (1).
Using Definition 1, Assumption 1, and Assumption 2, the conditional mean and variance can be obtained by the following theorem.
From Theorem 1, the conditional bias can be written as Following De Brabanter et al. (2011), we will estimate the conditional bias and variance by the following: Theorem 2 Let L(x 0 ) be the smoother vector evaluated at x 0 and let m 2 = ( m 2 (x 1 ), . . . , m 2 (x n )) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that → , the estimated conditional bias and variance for GKRLS are obtained by and Proof See Appendix B.

Bias and variance of KRLS
First, note that the KRLS estimator is also a linear smoother, so the bias and the variance take the same form as in Eqs. (18) and (19), except that the linear smoother vector L(x 0 ) will be different. Let be the smoother vector for KRLS. Then, Eq. (7) can be rewritten as Using Theorem 1 and Theorem 2 and applying them to the KRLS estimator, the estimated conditional bias and variance of KRLS are where m 1 is the n × 1 vector of fitted values for KRLS. Note that the estimate of the covariance matrix, , will be the same for both KRLS and GKRLS.

Asymptotic properties
The asymptotic properties of GKRLS, including consistency, asymptotic normality, and bias corrected confidence intervals are covered in this section. To obtain consistency of the GKRLS estimator, we also assume: Assumption 3 Let λ 1 , λ 2 , σ 1 , σ 2 > 0 and as n → ∞, for singular values of LP given by d i , n i=1 d 2 i grows slower than n once n > M for some M < ∞. Theorem 3 Under Assumptions 1-3, and let the bias corrected fitted values be denoted by Proof See Appendix C.
The estimated conditional bias from Eq. (18) and conditional variance from Eq. (19) can be used to construct pointwise confidence intervals. Asymptotic normality of the proposed estimator is given via the central limit theorem.
Theorem 4 Under Assumptions 1-3, m 2 is asymptotically normal by the central limit theorem: where Bias[ m 2 |X] = Lm − m and Var[ m 2 |X] = L L .
Proof See Appendix D.
Since GKRLS is a biased estimator for m, we need to adjust the pointwise confidence intervals to allow for bias. Since the exact conditional bias and variance are unknown, we can use Eqs. (18) and (19) as estimates and can conduct approximate bias corrected 100(1 − α)% pointwise confidence intervals from Theorem 4 as for all i. Furthermore, to test the significance of the estimated regression function at an observation point, we can use the bias corrected confidence interval to see if 0 is in the interval.

Partial effects and derivatives
We also derive an estimator for pointwise partial derivatives with respect to a certain variable x (r ) . The partial derivative of the GKRLS estimator, m 2 (x 0 ) with respect to the r th variable is using the RBF kernel in Eq. (4) and where m (r ) . To find the conditional bias and variance of the derivative estimator, we use the following: Theorem 5 The GKRLS derivative estimator in Eq. (28) with the RBF kernel in Eq.
(4) can be rewritten as where r ≡ 2 ) is a n × n diagonal matrix, and is the smoother vector for the first partial derivative with respect to the r th variable. Then, the conditional mean of the GKRLS derivative estimator is and conditional variance is Proof see Appendix E.
Using Theorem 5, the conditional bias and variance can be estimated as follows Theorem 6 Let S r (x 0 ) be the smoother vector for the partial derivative evaluated at x 0 and let m 2 = ( m 2 (x 1 ), . . . , m 2 (x n )) be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that → , the estimated conditional bias and variance for GKRLS derivative estimator in Eq. (28) are obtained by and Proof See Appendix F.
The average partial derivative with respect to the r th variable is The bias and variance of the average partial derivative estimator is given by and where n is the number of observations in the testing set, ι n is a n × 1 vector of ones, S 0,r is the n × n smoother matrix with the jth row as S r (x 0, j ), j = 1, . . . , n , and m (1) 0,r is the n × 1 vector of derivatives evaluated at each x 0, j , j = 1, . . . , n .

First differences for binary independent variables
Unlike for the continuous case, partial effects for binary independent variables should be interpreted as and estimated by first differences. That is, the estimated effect of going from where m F D b (·) is the first difference estimator for the bth binary independent variable, x (b) is a binary variable that takes the values 0 or 1, x 0 is the (q−1)×1 vector of the other independent variables evaluated at some test observation, and is the first difference smoother vector. The conditional bias and variance of the first difference GKRLS estimator in Eq. (38) are shown in the following theorem.

Theorem 7 Using Theorems 1 and 2, the conditional bias and variance for the GKRLS first difference estimator in Eq. (38) are obtained by
and Then, the conditional bias and variance can be estimated as follows: Note that Eq. (38) provides the pointwise first difference estimates. If one is interested in the average partial effect of going from x (b) = 0 to x (b) = 1, the following average first difference GKRLS estimator would be used.
This average partial effect of a discrete variable is similar to the continuous case and can be compared to traditional parametric partial effects as in the case of least squares coefficients. The conditional bias and variance of the average first difference GKRLS estimator in Eq. (43) are: where L F D 0,b is the n × n smoother matrix with the jth row as L F D b (x 0, j ), j = 1, . . . , n , and m F D 0,b is the n ×1 vector of first differences evaluated at each x 0, j , j = 1, . . . , n . The conditional bias and variance of the average first difference estimator can be estimated using Eqs. (41) and (42).

Simulations
We conduct simulations that show the performance with respect to gaining efficiency of the proposed generalized KRLS estimator. Consider the data generating process from Eq. (1): We consider the sample size of n = 200 and three independent variables X that is generated from The specification for m is: and the partial derivatives with respect to each independent variable are given by For the error terms, we consider two cases. and First, in Eq. (49), U i is generated by an AR(1) process. Second, U i is heteroskedastic but independent of each other with Var In addition to the proposed estimator, we compare four other nonparametric estimators: the KRLS estimator (KRLS), Local Polynomial (LP) estimator with degree zero, Random Forest (RF), and Support Vector Machine (SVM). The KRLS estimator is used as a comparison to GKRLS to show the magnitude of the efficiency loss from ignoring the information in the error covariance matrix. In addition, the KRLS, LP, RF, and SVM estimators do not utilize the covariance matrix in estimating the regression function and excludes heteroskedasticity or autocorrelation of the errors. For the GKRLS and KRLS estimators, we set σ 2 1 = σ 2 2 = 3, the number of independent variables in this example, and implement leave one out cross validation to select the hyperparameters, λ 1 and λ 2 . 2 The variance function under the heteroskedastic case is estimated by least squares from the regression of the log residuals on X . Taking the exponential would give the predicted variance estimates. Under the case of AR(1) errors, the covariance function is estimated from an AR(1) model. We run 200 simulations for each of the two cases and the bias corrected results are reported below in Table 1. 3 To evaluate the estimators, mean squared error is used as the main criterion, where we also investigate the bias and variance. To compare results, all estimators are evaluated from 300 data points generated from Eqs. (46) and (47). Table 1 displays the evaluations, including bias, variance, and MSE of the estimators for the regression function under both error cases. Note that the GKRLS and KRLS estimates in Table 1 are bias corrected. All estimates are averaged across all simulations. Estimates based on GKRLS seem to exhibit similar finite sample bias as KRLS, and there is an obvious reduction in the variability with smaller variance of the proposed estimator relative to KRLS. Note that GKRLS estimation provides a 31.6% and a 3.6% decrease in the variance for estimating the regression function for the autocorrelated and heteroskedastic errors, relative to KRLS. With smaller variance, GKRLS also has a smaller MSE, making GKRLS superior to KRLS. Compared to the other nonparametric estimators, LP, RF, and SVM, the GKRLS estimator outperforms the others in terms of MSE and is the preferred method in the presence of heteroskedasticity or autocorrelation. Table 2 displays the evaluations, including bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators for the partial derivatives of the regression function with respect to each of the independent variables under both error cases. 4 Since X 1 is discrete, the partial derivative is estimated by first differences discussed in Sect. 5.1. Similar to the regression estimates, for both heteroskedastic and AR(1) errors, the variability from estimating the derivative is reduced by GKRLS estimation relative to KRLS estimation. In addition, the efficiency gain in estimating both the regression and the derivative seems to be more evident in the AR(1) case compared to the heteroskedastic case. A possible explanation for this is that the covariance matrix contains more information in the off-diagonal elements compared to the diagonal covariance matrix in the heteroskedastic case. Overall, when estimating the regression function and its derivative for this simulation example, the reduction in variance and therefore MSE is clearly evident in Tables 1 and 2, making the GKRLS the preferred estimator. Table 3 shows the simulation results for the consistency of GKRLS. The bias, variance, and MSE are reported for sample sizes of n = 100, 200, 400. In this example, we set σ 2 1 = σ 2 2 = 3 and the hyperparameters λ 1 and λ 2 are found by LOOCV. For the regression function and the derivative and for both error covariance structures, the squared bias, variance, and MSE all decrease as the sample size increases, which implies that the GKRLS estimator is consistent in this simulation exercise.

Application
We implement an empirical application from the U.S. airline industry with heteroskedastic and autocorrelated errors using a panel of 6 firms over 15 years. 5 For the data set, we set aside a portion of the data for training and the other for testing. We estimate the model with four methods, GKRLS, KRLS, LP, and Generalized Least Squares (GLS), and compare their results in terms of mean squared error (MSE). To evaluate the out of sample performance of each method, the predicted out of sample MSEs are computed as follows where M SE e is the mean squared error for the e th estimator and n is the number of observations in the testing data set and j = 1, . . . , n . In this empirical exercise, n = 1 and T = 15 since we leave out the first firm as a test set. To assess the estimated average derivatives, we use the bootstrap to calculate the MSEs for the average partial effects. We report the bootstrapped MSEs for the average derivative by the following.

U.S. airline industry
We obtain the data on the efficiency in production of airline services from Greene (2018). Since the data are a panel of 6 firms for 15 years, we consider the one way random effects model: where the dependent variable Y it = log C it is the logarithm of total cost, the independent variables X it = (log Q it , log P it ) are the logarithms of output and the price of fuel, respectively, α i is the firm specific effect, and ε it is the idiosyncratic error term. In this empirical setting, we assume Consider the composite error term U it ≡ α i + ε it . Then, the model in Eq. (54) can be rewritten as In Eq. (55), the independent variables are strictly exogenous to the composite error term, E[U it |X] = 0. The variance of the composite error term is E[U 2 it |X] = σ 2 α i +σ 2 ε i . Therefore, in this empirical example, we allow for firm specific heteroskedasticity. In other words, the variance of the error terms are not constant across firms, but are constant over time for each firm. Since there is a time component, we allow an individual firm to be correlated across time but not with other firms, that is, E[U it U is |X] = σ 2 α i , t = s and E[U it U js |X] = 0 for all t and s if i = j. Note that the correlation across time can be different for every firm. Therefore, in this empirical framework, we allow the error terms to be heteroskedastic across firms and correlated across time.
To estimate Eq. (55) by GKRLS and KRLS in the framework set up in this paper, we can write the model in matrix notation. Consider where y is the nT × 1 vector of log C it , m is the nT × 1 vector of the regression function m(X it ), and U is the nT × 1 vector of U it , i = 1, . . . , n and t = 1, . . . , T . Then, the nT × nT error covariance matrix is where i = σ 2 ε i I T + σ 2 α i ι T ι T , i = 1, . . . , n has dimension T × T , I T is a T × T identity matrix and ι T is a T × 1 vector of ones. To use the GKRLS estimator in this empirical framework, we first estimate Eqs. (55) or (56) by KRLS and obtain the residuals, denoted by u it . To estimate the error covariance matrix , the variances of the firm specific error and the idiosyncratic error, σ 2 α i and σ 2 ε i need to be estimated. Consider the following consistent estimators using time averages, where u i is the T ×1 vector of residuals for the ith firm. Now, plugging these estimates in for , the GKRLS estimator can be estimated as in the previous sections. For further details, please see Appendix H. With regards to the other comparable estimators, the KRLS and LP estimators are used to estimate Eqs. (55) or (56) ignoring the heteroskedasticity and correlation in the composite error, U. Note that the KRLS estimator uses the error covariance matrix in the variances and standard errors but does not use the error covariance in estimating the regression function. Lastly, the GLS estimator is used as a parametric benchmark to compare to the standard random effects panel data model. 7 The data contain 90 observations of 6 firms for 15 years, from 1970-1984. We split the data into two parts, where the first 15 observations, which corresponds to the first firm, are used as testing data and 75 observations, which corresponds to the last five firms, are set as training data to evaluate out of sample performance. Thus, the training data, i = 1, . . . , 5 and t = 1, . . . , 15, has a total of 75 observations. For the GKRLS and KRLS estimators, all hyperparameters are chosen via LOOCV. 8 The bias corrected average partial derivatives and corresponding standard errors are reported in Table 4. These averages are calculated by training each estimator on the five firms with 75 observations in the training data set. The estimates are bias corrected and the results from Sect. 5 are used in our calculations. All estimators display positive and significant relationships between cost and each of the regressors, output and price, with their average partial derivatives being positive. The elasticity with respect to output ranges from 0.5885 to 0.8436 and with respect to price ranges from 0.2260 to 0.4581. More specifically, for the GKRLS estimator, a 10% increase in output would increase the total cost by an average of 8.13% and a 10% increase in fuel price would increase the total cost by an average of 4.25% holding all else fixed. Comparing the GKRLS and KRLS methods, the estimates of the average partial derivatives are similar but the standard errors are significantly reduced for GKRLS for both output and fuel price, implying a gain in efficiency. Therefore, using the information and the structure of the error covariance in Eq. (57) in estimated the regression function allows GKRLS to provide more robust estimates of the average partial effects of each independent variable compared to KRLS. Table 4 shows that the GLS estimator slightly overestimates the elasticity with respect to output and underestimates the elasticity with respect to fuel price compared to those of GKRLS. The LP estimator appears to provide different average partial effect estimates compared to the rest of the estimators. One possible explanation is that the bandwidths may not be the most optimal since data-driven bandwidth selection methods (e.g., cross validation) fail when there is correlation in the errors (De Brabanter et al. 2018). Since the data is panel structured, there is correlation across time, making bandwidth selection for LP estimators difficult. The LP estimates are from the local constant estimator; however, the local linear estimator provides similar estimates of the average partial effects to those of the local constant estimator. Nevertheless, the LP average partial effects of each variable are positive and significant, which are consistent with the other methods. Furthermore, GKRLS provides similar average partial effects with respect to output and price but is more efficient in terms of smaller standard errors relative to the other considered estimators.
To assess the estimators in terms of out of sample performance, we calculate the MSEs using the 15 observations in the testing data set. Table 5reports MSEs for the four considered estimators. The first column reports the out of sample MSEs using the 15 observations from the first firm. Out of all the considered estimators, the GKRLS estimator outperforms the others in terms of MSE. In other words, the GKRLS estimator can be seen as the superior method in estimating the regression function in this empirical example. The bootstrapped MSEs for the average partial derivatives, calculated by Eq. (52), are reported in the second and third columns of Table 5. For both the average partial derivatives with respect to output and price, GKRLS produces the lowest MSE, outperforming the other estimators. In addition, since GKRLS incorporates the error covariance structure, efficiency is gained and therefore reductions in MSEs are made relative to KRLS. Overall, GKRLS is considered to be the best method in terms of MSE for estimating both the airline cost function and the average partial effects with respect to output and price.

Conclusion
Overall, this paper proposes a nonparametric regression function estimator via KRLS under a general parametric error covariance. The two step procedure allows for heteroskedastic and serially correlated errors, where in the first step, KRLS is used to estimate the regression function and the parametric error covariance, and in the second step, KRLS is used to estimate the regression function using the information in the error covariance. The method improves efficiency in the regression estimates as well as the partial effects estimates compared to standard KRLS. The conditional bias and variance, pointwise marginal effects, consistency, and asymptotic normality of GKRLS are provided. Simulations show that there are improvements in variance and MSE reduction when considering GKRLS relative to KRLS. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The average derivatives are evaluated, and the average partial effects of the inputs are determined in the application. In the empirical exercise, GKRLS is more efficient compared to KRLS and is the most preferred method for estimating the airline cost function and its average partial derivatives in terms of MSE.
Funding Open access funding provided by SCELC, Statewide California Electronic Library Consortium This research received no funding.

Declarations
Conflicts of interest Justin Dang declares that he has no conflict of interest. Aman Ullah declares that he has no conflict of interest.

Human or animal rights This article does not contain any studies with human participants performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Proof of Theorem 1
First, we note that the GKRLS estimator is a linear smoother by substituting Eqs. (10) into (11) Then, the conditional mean and variance of GKRLS can be derived as follows

B Proof of Theorem 2
The exact bias for GKRLS for the training data is given by Therefore, the conditional bias can be estimated at any point x 0 by For the conditional variance, we assume that the error covariance matrix = (θ) can be consistently estimated by = ( θ). Then, using a consistent estimator of the error covariance matrix, the conditional variance of GKRLS can be estimated by

C Proof of Theorem 3
Since the bias corrected fitted values, m c , have zero conditional bias, we can focus on the conditional variance. From Theorem 1, the conditional variance of the GKRLS estimator is For large enough n, tr(D 2 ) slows in growth and converges to some constant, M, and the average variance of m(x i ) is 1 n n i=1 d 2 i . Recall that d 2 i denotes the ith squared singular value of LP and is proportional to the variance explained by a given singular vector of LP. Given the construction of LP, the columns of this product matrix can be thought of as weights of the data, scaled by the standard deviation of the error term. Therefore, the number of large singular values will grow initially with n but the number of important dimensions or singular values will start to grow slowly with n. As a result, the average variance of m(x i ), which is 1 n n i=1 d 2 i , shrinks to zero as n → ∞. Since the average variance shrinks to zero, then each individual variance must approach zero as n becomes large.
We also provide an alternative proof of consistency. Consider the GKRLS coefficient estimator of c in Eq. (10): Again, since we consider the bias corrected estimator, m 2,c , we can focus on the conditional variance. However, below we also show that the non-bias corrected estimator has zero conditional bias in the limit. Taking the conditional bias of c 2 : where the strict exogeneity assumption E[u|X] = 0 is used. Furthermore, if we assume λ 2 is fixed or does not grow as fast as n and 1 n −1 K σ 2 → Q, a positive definite matrix with finite elements, when n → ∞, then Bias[ c 2 |X] → 0 as n → ∞.
Taking the conditional variance of c 2 : Again, we assume that λ 2 is fixed or does not grow as fast as n and 1 n −1 K σ 2 → Q, a positive definite matrix with finite elements. Furthermore, if we assume that As long as the sum is not dominated by any particular term and if L(x i )u i are independent vectors distributed with mean 0 and variance L(x i ) L(x i ) < ∞ and s 2 n → ∞ as n → ∞, then The following results will be for the case of autocorrelated errors, where observations are dependent and identically distributed. 9 Define L n ≡ K σ 2 −1 K σ 2 +λ 2 I n −1 −1 and L n (X t ) as the t−th row of L n . Given (i) Y t = m(X t ) + u t , t = 1, 2, . . .; (ii) {(X t , u t )} is a stationary ergodic sequence; (iii) (a) {L n (X thi )u th , F t } is an adapted mixingale of size −1, h = 1, . . . , p, i = 1, . . . , n; (b) E|L n (X thi )u th | 2 < ∞, h = 1, . . . , p, i = 1, . . . , n; (c) V n ≡ Var 1 √ n L n u is uniformly positive definite; (iv) E|L n (X thi )| 2 < ∞, h = 1, . . . , p, i = 1, . . . , n; (v) lim n→∞ L n (X t ) = L(X t ) and lim n→∞ L n = L.
where V is any finite positive definite matrix. By Theorem 3.35 of White (2001), {Z t , F t } is an adapted stochastic sequence because Z t is measurable with respect to F t . To see that E(Z 2 t ) < ∞, note that we can write whereλ i is the ith element of the n × 1 vectorλ ≡ V −1/2 λ. By definition of λ and V, there exists < ∞ such that |λ i | < for all i. It follows from Minkowski's inequality that since for sufficiently large, E|L n (X thi )u th | 2 < < ∞ given (iii.b) and the stationarity assumption. Next, we show {Z t , F t } is a mixingale of size −1. Using the expression for Z t just given, we can write 9 We follow the proof similar to the case of dependent identically distributed observations provided by White (2001). Applying Minkowski's inequality, it follows that Hence V n converges to a finite matrix. Set V = lim n→∞ V n = L L which is positive definite given (iii.c). Then,σ 2 = λ V −1/2 VV −1/2 λ = 1. Then by the martingale central limit theorem, n −1/2 n t=1 λ V −1/2 L n (X t )u t d → N (0, 1). Since this holds for every λ such that λ λ = 1, it follows from Cramér-Wold Theorem, that

E Proof of Theorem 5
First, we note that the GKRLS derivative estimator is a linear smoother by substituting Eqs. (10) into (28), is the smoother vector for the first partial derivative with respect to the r th variable. Then, the conditional mean and variance of the GKRLS derivative can be derived as follows

F Proof of Theorem 6
The bias of the GKRLS derivative estimator in Eq. (28) where m (1) r (x 0 ) is the true first partial derivative of m with respect to the r th variable. Since this quantity as well as m is unknown, we estimate both to calculate the conditional bias.
where m 2 is the n × 1 vector of in sample GKRLS predictions of m and m (1) 2,r (x 0 ) is the estimated GKRLS derivative prediction evaluated at point x 0 .
For the conditional variance, we assume that the error covariance matrix = (θ) can be consistently estimated by = ( θ). Then, using a consistent estimator of the error covariance matrix, the conditional variance of the GKRLS derivative estimator can be estimated by Var[ m (1) 2,r (x 0 )|X = x 0 ] = S r (x 0 ) S r (x 0 )

G Proof of Theorem 7
The conditional bias of the GKRLS first difference estimator in Eq. (38) is is the true first difference of m with respect to the bth variable and is the first difference smoother vector. The conditional variance of the GKRLS first difference estimator in Eq. (38) is H A random effects model for airline sata used in Sect. 7 Consider the following random effects model for an airline cost function: , α i is the firm specific effect, and ε it is the idiosyncratic error term. In this empirical setting, we assume Note that the independent variables are strictly exogenous; the regressors are mean independent of each error term and therefore of the composite error term: In this framework, we allow for the errors to be heteroskedastic and correlated across time. The variance of the composite error term is where E[α i ε it |X] = 0 by assumption. The covariance of the composite errors is Therefore, this framework allows for heteroskedasticity with respect to firms and correlation across time and the correlation across time can be firm specific. Define the T × 1 vector of errors for firm i as u i = (u i1 , . . . , u i T ) , i = 1, . . . , n, where we stack the errors over time for each firm. Then define the T × T error covariance matrix for each firm, i , as Therefore, the nT × nT error covariance matrix is block diagonal as To estimate the random effects model of airline cost by GKRLS, first, we follow item 1. of the two step procedure outlined in Sect. 2. To get a consistent estimate of the error covariance matrix , we can estimate the error variances using the residuals from the first step as