Bandwidth selection for statistical matching and prediction

While there exist many bandwidth selectors for estimation, bandwidth selection for statistical matching and prediction has hardly been studied so far. We introduce a computationally attractive selector for nonparametric out-of-sample prediction problems like data matching, impact evaluation, scenario simulations or imputing missings. Even though the method is bootstrap based, we can derive closed expressions for the criterion function which avoids the need of Monte Carlo approximations. We study both, asymptotic and finite sample performance. The derived consistency, convergence rate and extensive simulation studies show the successful operation of the selector. The method is illustrated by applying it to real data for studying the gender wage gap in Spain. Specifically, the salary of Spanish women is predicted nonparametrically by the wage equation estimated for men while conditioned on their own (i.e., women’s) characteristics. An important discrepancy between observed and predicted wages is found, exhibiting a serious gender wage gap.


Introduction
While there exists a considerable literature on bandwidth selection for kernel based nonparametric densities and regression, the problem of bandwidth selection for nonparametric prediction has largely been ignored. To our knowledge, such selection methods do not exist despite the relevance and frequency of such prediction problems in practice. Examples are statistical matching or data matching (see Eurostat 2013, and references therein), problems of imputation of missing data (for recent compendia see, Rubin 2004;Su et al. 2010, andvan Buuren 2018), and the simulation of scenarios such as those mentioned below. Another prominent example is the prediction of counterfactuals like in treatment effect estimation, see the recent compendium of Frölich and Sperlich (2019). This relates our selection problem also to Vansteelandt et al. (2012) who look more generally on model selection for causal inference. We are not thinking of extrapolation outside of the support of observed covariates (Li and Heckman 2003), a problem that would go beyond the here described one. We neither refer to the bandwidth selection problem for doing forecasting with time series, see the review of Antoniadis et al. (2009) or Tschernig and Yang (2000).
To be more specific, the situations we are thinking of have the following features in common: one has two samples referring to potentially different populations, say the source and the target group. In the source group one has observed the response Y , being the variable of interest, and some auxiliary information X , i.e., some observed covariates. However, for the target group the (potential) response Y is not observedor the actually observed response is not the wanted counterpart of what we observed in the source group. If one is willing to assume (Y |X ) target ∼ (Y |X ) source one can use the source sample to predict Y for the target in order to calculate some parameter of interest like the treatment effect in causal analysis. Consider the following examples: In counterfactual exercises one typically has a response Y also observed for the target sample, but under a different situation, say, under 'treatment,' and needs therefore to predict it under 'no treatment.' The average difference between the observed and imputed Y gives the so-called 'treatment effect for the treated.' In data matching, response Y is not sampled in the target sample but in different sources. Similarly, when facing missing values where one assumes that they are missing at random when conditioning on X , one may consider as target sample the part of your data set that contains Y , and as source sample the rest. Finally, one can imagine scenarios where the explanatory random variable X of the target refers to an artificial, maybe future, population, for which we still cannot observe Y . In most of these examples, and especially for the first and the last one, the distribution of X in the target will be different from the one in the source. For being so-called 'confounders' in causal inference, this is even a requirement. However, one needs either to suppose that support(X ) target ⊆ support(X ) source , or to use methods made for extrapolating to obtain predictions of Y for the entire support(X ) target . For simplicity, one can simply think of both a common support (for source and target).
In this article, we concentrate on the mean functions, so that the underlying hypothesis is not to have (Y |X ) target ∼ (Y |X ) source but the same regression function, say E[Y |X = x], in both groups. For example, in our application, Y is salary, and we estimate for men (source population) E[Y |X ] containing education, sector and professional experience. This is used then to predict the corresponding expected wages of women (target population) to compare those predictions with the women's realized wages. The difference gives an estimate of the gender wage gap. Evidently, this strategy can be used in many different contexts, also for artificial populations like in scenarios. This describes a quite relevant statistical problem for which we need a regularization parameter like a bandwidth for kernel smoothing. In treatment effect, estimation via matching this type of a bandwidth selection problem has been studied by Frölich (2005), Häggström and Luna (2014) and Galdo et al. (2008). The former two were not interested in the nonparametric estimators but only the total averages. Only the latter considered a problem similar to ours, but proposed weighted crossvalidation (CV). Moreover, Frölich (2005) only studied if the use of standard selectors for regression yield acceptable results but did not propose a problem-specific alternative. Also for flexible methods for imputing missings this seems still to be an open problem, see van Buuren (2018). For scenario calculations, we could not even find a profound discussion of this problem in the existing literature.
However, there certainly exists a vast literature on bandwidth selection, see the reviews of Heidenreich et al. (2013) and Köhler et al. (2014). As the prediction problem we discuss is regression based, we should first recall the existing bandwidth selection procedures for regression. As outlined in the above reviews, they can essentially be divided into two groups: cross-validation (CV) and plug-in methods. Likewise one may distinguish between those that try to minimize the integrated or averaged squared error (ISE or ASE) on the one side, and those that try to minimize the expected ISE or ASE, known as MISE and MASE, on the other side. Without pronouncing in favor of one or the other in general, in many regression problems there are good reasons to be more interested in minimizing the ASE (rather than the MASE) as people want to get the best bandwidth for their specific sample fit. Also the implementation seems to be simpler for CV methods; both together explains most of their popularity. In our case, however, the first argument is no longer valid as we may want to get predictions for different samples, and then would prefer the MASE criterion. Also the second argument does no longer hold as adapting CV to our problem is not easy as can be seen from Galdo et al. (2008). Our proposal relies on MASE minimization via smooth bootstrap (Cao and González-Manteiga 1993).
Next, we concentrate on global bandwidths as this is the most popular approach in practice. We call a bandwidth global if it is supposed to be applied over the whole support of the regressor, and is therefore supposed to minimize some global loss function. Local bandwidths are optimal for (just) one point of the support and had therefore to be calculated for all regressor values at which one wants to estimate (or predict). It is therefore hardly ever used. We further limit our presentation to so-called local constant prediction, i.e., based on the classical Nadaraya-Watson estimator. For an extension to local linear methods, see the thesis of Barbeito (2020). While local linear estimators have doubtless their advantages, she showed that for bandwidth selection in our context they render the method much more cumbersome and computationally quite unattractive. Note finally that, although the theory for our method is developed along the problem of a particular bandwidth choice, it also applies more generally to model selection for prediction.
Smooth bootstrap aims to draw bootstrap samples from a nonparametric pilot estimate of the joint distribution of (X , Y ). For the source and each bootstrap sample m(x) := E[Y |X ] is estimated. The estimates allow us to approximate the mean squared error ofm(x). These errors are averaged over the x i observed in the target sample. It can be shown that there exists a closed analytical form for the resulting MASE bootstrap estimate. This simplifies the procedure importantly making it quite attractive. One may argue that the exactness of this MASE approximation hinges on the pilot estimate. Yet, in order to find the optimal bandwidth or model, it suffices that the MASE approximations take their minimum at the same point as the true, unknown MASE. Our simulations show that this is indeed the case.
In Sect. 2, we introduce the prediction problem and bandwidth selector considered, introduced for the case with one explanatory variable. Assumptions and asymptotic properties are also provided in that section. The simulations in Sect. 3 demonstrate an excellent performance of this method. Section 4 extends our approach to situations with one continuous explanatory variable and several categorical ones and illustrates the method along a study of the gender wage gap in Spain. Section 5 concludes and discusses further extensions like the one to boundary kernels, multivariate continuous covariates and local linear estimation. Asymptotic theory on the pilot choice is deferred to the Appendix, and technical proofs to the Supplementary Material.

Closed-expression for the criterion functions
For the target population, we are provided with observations {X 1 i } n 1 i=1 from density f 1 which is potentially different from f 0 , and we assume E Y 1 |X 1 = x = m(x). Then, we may consider two different, though related problems: (a) predicting the unobserved Y 1 . 1 Recall, if some outcome is observed for the target group, its conditional expectation might nonetheless seriously differ from m(·), like in our study in which we observe women's wages but want to predict their wages as if they were paid like men.
We estimate m(·) by a Nadaraya-Watson estimatorm h with bandwidth, h. Let us suppress for a moment the upper index and concentrate on the source sample. The challenge is to find a bandwidth, h, which is optimal for problems (a) and (b). The pointwise MSE, and afterward the MASE are approximated by their bootstrap versions. Similarly as Cao and González-Manteiga (1993), consider two pilot bandwidths g X , g Y . They propose to obtain the bootstrap resamples either sampling from: , or 1 Our notation differs from the one used in the treatment literature: There, Y 1 refers to outcome under treatment, Y 0 to outcome under control. In contrast, we use the upper index to indicate the group.
The latter is equivalent to resampling from the bivariate densityf g In the following, only SB2 will be considered because SB1 is just the limit case of SB2 when g Y → 0 + , for fixed n.
. The second term on the right hand side of (1) is negligible. So it suffices to consider a proxy estimator of m(x) that corresponds to considering only the first term, and we get (2) where X * has bootstrap marginal densityf g X . Using convolution (K h * q x ) (x) := K h (x − y)q x (y) dy, the point-wise MSE and its bootstrap analog are in Theorem 1, assuming (A1) K is a bounded symmetric density function sample, and f (x) = 0, then the point-wise mean squared error ofm h can be expressed as The proof of Theorem 1 is provided in the Supplementary Material. It is to be emphasized that (5) is exact and not just an approximation. In other words, there is no need to draw bootstrap resamples as (5) can be calculated directly.
The next step is notationally cumbersome because for the MASE, we need to carefully distinguish between source and target sample. Therefore, we use the upper indices again. As said in the introduction, for the sake of presentation suppose f 0 and f 1 have common support. In order to select an optimal bandwidth, our aim is to minimize where, for any random variable Z , E 0 [Z ] = E Z | X 1 j , ∀ j ∈ {1, . . . , n 1 } refers to the expectation in the source population conditioned on the target sample. Similarly to Theorem 1, we can state for the MASE and its bootstrap analog: be a simple random sample coming from the source population, and {X 1 i } n 1 i=1 a simple random sample coming from the target population. Then, the MASE prediction error is Furthermore, working out the convolutions in expression (9) we get (9) can be rewritten as follows:

Corollary 1 If K is a Gaussian kernel, then expression
In the special case of considering SB1, that is, if g Y = 0, one gets Proofs of Theorem 2 and Corollary 1 are in the Supplementary Material, whereas the proof of Corollary 2 is immediate. For the sake of simplicity, we consider g = g X = g Y . A bootstrap bandwidth selector for prediction can now be defined as As we could see, computation of h B O OT does not require the use of Monte Carlo approximation nor the nonparametric estimation of the density f 1 of the target population.
In order to verify whether this procedure works, we have first to answer two questions: Is the MASE of approximation (2), i.e., whenm h substitutesm h , a useful approximation for the MASE ofm h ? And if so, is the MASE bootstrap analog a useful approximation for the former? Figure 1 reveals that the optimal bandwidths for both estimators are very close. It also shows that the first approximation is much more sensitive to the bandwidth than the original one which makes it numerically even easier to find the minimum. Furthermore, for almost all simulations the minimizer in the bootstrap world is very close to the true one. This answers both questions with a clear 'yes.'

Asymptotic theory
We now derive the asymptotic rate of convergence for the bandwidth selector. To do this we have to look at the asymptotic analogs of MASEm h ,X 1 and its bootstrap version, say ) 2 dF 1 (x) and its bootstrap version. Note that MISE a is the MISE but with the proxy estimator (2). The proof of theorems, lemmas, propositions and corollaries of this section are in the Supplementary Material.

Asymptotic expression for the criterion function
We need some regularity conditions for the kernel, source density and regression function: Using a Taylor expansion and a change of variable, we obtain for the MISE a (h):

Lemma 1 Under the regularity conditions (B1)-(B4), the MISE a can be expressed as
. The asymptotic version of expression (11) is given by: Minimizing expression (12), we obtain the AMISE a bandwidth, namely With (11) and (13) which tends to zero at the rate n −4/5 0 as n 0 −→ ∞. From this we can conclude: Theorem 3 Under the regularity conditions (B1)-(B3), the bandwidth which minimizes MISE a has the asymptotic expression: It remains to study how accurate the approximation MISE a for the MISE is. As M refers to the 'mean,' we have to study the difference between ISE a and ISE. For this we need some conditions on the kernel (C1) and density function (C2), cf. Silverman (1978); as well as conditions for the regression function (C3), cf. Mack and Silverman (1982): (C1) (a) K is uniformly continuous with modulus of continuity w K and bounded variation V (K ). K is absolutely integrable with respect to the Lebesgue measure. where This proposition tells us that the difference rapidly vanishes.

Asymptotic expressions for the bootstrap criterion functions
Next, the smoothed bootstrap version of MISE a is to be studied. We start by com- , the theoretical analog of MASE * m h ,X 1 . For studying MISE a * , i.e., the bootstrap MISE ofm h , given in (2), Lemma 2 Under the regularity conditions (B1)-(B4), the function MISE a * is Thus, the dominant part of expression (16), namely AMISE a * (h), is given by: By minimizing expression (17), we can state an AMISE a * bandwidth, namely Next, similarly as in the non-bootstrap context, it remains to check the accuracy of the theoretical approximation we have considered for MISE * , i.e., our MISE a * . Again we can show that this approximation is appropriate in terms of ISE * .

Proposition 3 Assume conditions (C1) to (C3), and suppose further n
almost sure with respect to P, where We can also establish the convergence rate for (h MISE a * − h MISE a ) /h MISE a , using expressions (16), (18) and (36) (11), (16) and (17), respectively. Under the regularity conditions (B1)-(B4) and assuming that g is of order n It remains to discuss the selection of g. Expression (16) reveals that this pilot bandwidth should be selected to minimize the error produced by estimating expression (12) via (17). So one should minimize the expected squaredδ It is easy to see that implying that g OPT depends in turn on bandwidth h. This suggests to consider where h AMISE a was defined in (13). It can be shown that which leads us to where , 2}, and m = 1 if = 0. Unfortunately, the computation of the expectations in (25) is extremely tedious, implying the calculation of more than three hundred U -statistics of order n 0 · n 1 . An alternative is to find an upper bound for expression (24), say This requires to work withÂ g − A andB g − B. Some technical results concerning the quantification of the error of both approximations are collected in Lemmas 3, 4 and Corollary 3 in the Appendix. As can be seen in the Supplementary Material, the optimal g for the upper bounds obtained for the termsÂ g − A andB g − B happens to be n −1/2 0 .

Simulation study of performance
Our method is not only, but particularly interesting for the so-called matching methods and scenarios. Both are applied in many different situations. Although we certainly studied many more simulation designs, we found out that they could all be decomposed and/or classified into the following three different situations.
We therefore decided to limit our presentation to these exemplifying situations which can furthermore be motivated by the following examples. Imagine an impact evaluation problem with unbalanced treatment vs control group in which one has to estimate the effect of an IT training course for unemployed on their chance to find a job within the next 4 months (after the training) or on their next salary. As these courses are not compulsory, the age of participants may have a right skewed distribution, whereas the one of the nonparticipants may have a left-skewed one. It is well known that the likelihood to find a job and the next salary plotted on age are inverted U-shaped curves. This situation is well reflected in Scenario 1.
For many poverty or inequality as well as for health studies, one needs to perform data matching or imputing missing values as in the main data set the variable of interest is simply not observed or exhibits many missing values but one believes these can be 'explained' (not necessarily in the causal but a stochastic sense) by observed X (see, e.g., Dai et al. (2016) for related problems we consider). For certain tax interventions on luxury goods, one may expect most variation of the regression curve in the areas of higher expenditures. However, in surveys regarding income (X ) and expenditures (say for luxury goods, Y ), lower-and middle income groups have much less missing values. Here, the source is the subsample with, the target the subsample without Y . This gives scenario 2.
Similarly, for scenarios and ex-ante evaluation one considers the present society versus a potential future one. Imagine the regression function in Scenarios 2 and 3 had an upward bump (what for the simulation outcome does not make any difference). We are not only concerned about the pension system but also look at other aspects like the health system. Most of the variation in doctor visits and health expenditures happens (a) for people above 65, mainly men, and (b) young women. Thinking only of men, for a strongly aging society one has a similar scenario as in 2 but with an upward bump instead of a downward one in the regression function. In contrast, when looking only on men and heavy road accidents, in many countries one has an important upward bump between the age of 18 and 28, followed by a relatively flat curve. Then, one has to flip the source and target group and end up in a (numerically) equivalent situation as in Scenario 3.
Next, think of a counterfactual exercise to study discrimination in wages by gender or race. To do so one may want to estimate the wage gap controlling for (by conditioning on) certain factors like sector, studies and years of experience or age. Such conditioning is not only important for a fair comparison, but also to understand better channels of wage differences and discrimination. Notice that again, the distributions of the three mentioned covariates differ a lot between gender. We do, however, not know if the regression is equally smooth on the main support of both distributions (Scenario 1), or if we are in Scenarios 2 or 3. So it is clear that we should apply therefore our bandwidth selection method. This is the situation we face in the next section, but is covered by these three simulation scenarios.
It is important to notice that Scenario 1 is equivalent to a situation where the distributions f 0 , f 1 are quite similar, independently of the functional form of the regression function. In such situation, the density-weighted smoothness of the regression function is the same in both groups as it is in Scenario 1. It is clear that in such situation our bandwidth selector cannot do better than the corresponding counterpart for regression in the source sample. Unless one has done some prior studies about equality of distributions or smoothness of the regression, in practice one does not know if one is in such a special situation. However, our Scenario 1 will show that even then our method does not much worse, and Scenarios 2 and 3 will show that else one does much better with our method.

Description of the study
For each scenario, 100 random samples of size n 0 = n 1 = 500 are drawn. The Gaussian kernel is used to avoid divisions by zero. Recall that the bandwidth selector is the minimizer of an empirical function. As it does not have an explicit expression, numerical methods are used to approximate it. The algorithm is as follows: Step 1 Consider a grid of 50 values of h in the [0.01, 0.2], equispaced on logarithmic scale.
Step 2 h OPT 1 is the value that minimizes the MASE * m h , given in expression (10). Selection of g X is discussed in Sect. 3.2.
Step 3 From the bandwidth grid take the previous and the next one to h OPT 1 , and construct an equally spaced grid of five values in logarithmic scale between them.
Step 4 Steps 2 and 3 are repeated three times, retaining the optimal bandwidth value in the last stage. That is taken as our numerical approximate of the optimal h BOOT .
Recall from Fig. 1 that for numerical reasons, the calculated MASE * m h might decrease for very large h. In order to avoid oversmoothing due to this phenomenon, h BOOT is considered as the local minimizer of MASE * m h closest to but larger than zero. We are interested in the performance of our bandwidth selector in terms of prediction or data matching. To this aim, we compare our h BOOT , made for prediction, (h 1 in the following) with a bandwidth selector made for regression, say h 0 . Clearly, there exist many bandwidth selectors for regression. To make it comparable to h 1 , notice that a different way to look at the bandwidth selection problem for nonparametric regression is to minimize the MASE in the source population. Then, h 0 is the result of minimizing As before, estimators in (26)  . The selector is Certainly, if one wanted to use our method to select a bandwidth for regression in prac- . This, however, produces an additional bias which disadvantages the resulting selector compared to h 1 . One might conclude then in favor of h 1 only because of this bias. Notice that by construction, h 0 ≈ h 1 if source and target population are very similar. Bandwidths h 0 and h 1 are compared by means of wherem h N W j stands for the Nadaraya-Watson regression estimator obtained with source sample X 0 , Y 0 . The results of are presented in Figs. 2, 3and 4.

Selection of g
The pilot bandwidth, setting g = g X = g Y , was g = h SJ n 4/45 0 , where h SJ is the plug-in bandwidth selector proposed by Sheather and Jones (1991) for kernel density estimation. Accordingly, g has order n −1/9 0 , being the optimal rate for smoothed bootstrap, see Cao and González-Manteiga (1993). Nonetheless, an additional simulation study was carried out for this pilot bandwidth g. Consider g = h C n 1/5−α 0 with α ∈ 1 5 , 1 9 , where h C = h SJ , h CV , the latter being the cross-validation bandwidth for regression. Our criteria are for θ wherem h j is the Nadaraya-Watson estimator computed with the source sample. The results in Table 1suggest that α and h C do not have a great impact on the performance of h 1 . The values for (30) and (31) in Table 1 suggest to choose α = 1/9 and h C = h SJ .

Discussion of simulation results
Recall that for Scenario 1 is equivalent to a situation where both densities are the same, as densities and regression function are just mirrored; one would therefore expect h 0 to perform at least as good as h 1 . For all other cases, we expect h 1 to outperform h 0 (if our method works in practice). And indeed, Figs. 2, 3 and 4 reveal that in Scenario 1, h 0 is at least as good as h 1 , but else h 1 clearly beats h 0 . Not unexpected we find the following situations:    (Scenario 3, Fig. 4). This is the opposite situation to the one mentioned before. Now m(x) is quite flat in the main support of the target population, but oscillates a lot in the main support of the source population. 3. h 0 and h 1 are close (Scenario 1, Fig. 2). As discussed, this is expected because m(·) is just mirrored for the main supports of the respective populations.
Results of CPU times are collected in Table 2. This shows the practical behavior of the closed expression obtained for MASE * m h ,X 1 of the Nadaraya-Watson estimator given in expression (9) in terms of computer time. The order of computational complexity of this method is O(n 0 · n 1 ), as shown in Table 2. This is empirically compared with the efficiency in terms of computer time of expression (5), which is the closed expression of MSE * x . Packages parallel and Bolstad of the free software R have been used. Notice that selecting our global bandwidth is about as fast as calculating one MSE-based local one. Finally, as said in the introduction and in Sect. 5, when using our selection method for the local linear estimator, see Barbeito (2020), computation time increases by a factor > 60.
In Table 3 , we study the performance of h * MASE for Scenario 3 (calculated from 100 random samples) over different sample sizes, namely n 0 ∈ {25, 50, 100, 200} combined with n 1 ∈ {25, 50, 100}, as well as the case of n 0 = n 1 = 1000 (for computational reasons without combinations). As expected, the performance of the bandwidth selector clearly improves as n 0 increases, while the target sample size, n 1 , does not matter much. If we consider a classic approach such as the cross-validation bandwidth (as presented in Marron 1985), we have for Scenario 3, 100 runs and n 0 = n 1 = 100 a very small median for the values obtained for the bandwidths selected, but large values for expressions (28) and (29), see Table 4, i.e., our method beats the classical CV selector by far.

Estimating the gender wage gap in Spain
Consider the problem of estimating the gender wage gap based on a 2014 survey in Spain, i.e., after its recovery from the financial crisis. Wage gaps have been of long-standing political concern as an indicator of discrimination. Yet, the fact that women are paid lower wages than men may well be due to differences in education, experience and/or the sector they work in. The challenge is then to account for factors X that might explain differences in wages. Until today, gender wage gap studies that account for those characteristics are mainly based on fully parametric models. Often they separate the male and female populations for looking at the Blinder-Oaxaca decomposition, see Blau and Kahn (2017) for a recent review. Moral-Arce et al. (2012) introduced a semiparametric version of this decomposition and extended it to quantiles to study the heterogeneity of the gap. They studied the gender wage gap in Spain before the economic crisis, specifically the development around the millennium. For more references consult these two articles.
Our approach is different in several aspects. First, it is fully nonparametric. Having done all nonparametrically, any found difference cannot be explained by the choice of the (semi-)parametric wage model. One may argue that we missed other important factors. However, as gender is externally given, such objection only shifts the discussion to the definition of the gender wage gap. Second, instead of deriving a nonparametric Blinder-Oaxaca decomposition, we explore the heterogeneity of the wage gap over sectors and education. Third, we calculate the counterfactual wages of women as if they were paid like men, to compare these with their realized wages. One may say that in the parametric approach one could do something similar but with the simpler linear model. However, apart from the fact that this can be a poor predictor, its parameters are the result of least square projections inside one population. That means, the coefficients of the linear approximation of men's wages are not made for predicting the counterfactual wages of women. Consequently, they are inappropriate, except if the characteristics of women and their jobs were the same as for men.
The data set used is the EPA of 2014 of the National Institute for Statistics (INE) in Spain, provided at webpage http://www.ine.es. It consists of 64383 observations for women, and 108134 for men, giving the gross annual wage, which is our response variable Y . We only consider full-time employees, being aware that already the distribution of full-and part-time jobs might be discriminatory; that is, our findings are conditioned on the full-time employed population. We are provided with the following covariates: -Years and months of service, which is a quantitative variable that indicates the professional experience of the individual. Consider the two different populations; men (source) and women (target). We have one (pseudo-)continuous covariate (Years and months of service, denoted as X 0 in the source with density f 0 , X 1 in the target with density f 1 ). The other variables are categorical and denoted as Z 0 and Z 1 , respectively. We observe the men's Y 0 such that we can estimate their m(·). Under the hypothesis of no wage discrimination, women are paid by the same wage function, such that m(·) can be used to predict their wages Y 1 . Therefore, we are not interested in the optimal bandwidth for predicting men's mean wage function, but for predicting women's counterfactual wages, i.e., we need h 1 .
In order to account for the categorical variables CNAE and education nonparametrically, and because we want to explore the heterogeneity of the wage gap over these variables, we split the sample into the eighteen times seven subsamples. However, those with no studies were tiny; and it is not always clear what 'no studies' actually means. Therefore, these were not further considered. One may ask if this is necessary or whether educational levels may have similar distributions for men and women once we fixed the sector. Figure 5shows the differences in distribution of Studies between men and women for the examples of sectors F and R. Indeed, in some sectors like R the distributions looks somewhat more similar; but in most sectors they clearly do not. Therefore, we keep all information and split along sector and education. Notice that such split is equivalent to a nonparametric estimation with the full sample, and setting the smoothing parameter for these two variables to zero. We are left now in each subsample with wage and professional experience. For the overall average wage gap, one can still integrate over Z . Expressing all this in formulas looks as follows: Say Z is the categorical variable with r modes, X 0 , Z 0 , Y 0 is observed in the source population, X 1 , Z 1 in the target population. We want being estimated from the source population. Then, one predicts the average counterfactual wage of women byθ being the average counterfactual wage of women given z, i.e., for a specific educational level in a given sector. The estimator ofθ is a weighted average of theθ ẑ Certainly, it is up to the practitioner which of the r dimensions of Z (s)he wants to integrate out and which to keep. In our case study, we looked at all combinations but present here only the results of allθ z to explore the heterogeneity of the wage gap over educational level and sectors, see Figs. 6 and 7 . We plotted the absolute and relative wage gap, calculated as the (absolute and relative) difference between the average realized minus predicted wages per sector and educational level. In the Supplementary Material and in Barbeito (2020) are given many more details like all numerical numbers together with n 0,z , n 1,z and MSE estimates. Note that for each mode z (i.e., all combinations of sectors and education) one may want to calculate the optimal h B O OT ,z . So we did, and the values are given in Table 6 in the Supplementary Material. We used the Gaussian kernel and g z = h SJ n 4/45 0,z . We have compared these results to those provided by h 0 , given in expression (27), and a plug-in bandwidth minimizing the MASE (namely, h PI ). The bandwidths obtained were quite different. For instance, considering sector F (Construction) and level of studies 3, h 0 = 2.01, h 1 = 0.17, h PI = 5.26. Considering sector P (Education) and level of studies 5, h 0 = 2.96, h 1 = 8.05, h PI = 3.62. Not surprisingly, they suggest for those sectors accordingly different wage gaps compared to our method. Recalling the findings of the simulation study, we then trust more in the results provided by our method.
From Figs. 6 and 7, we see that for most of the educational levels and sectors, the observed salaries of women are between 10 and 30% lower than the wage equation obtained from men's salaries would predict. This calculus does not account for potential discrimination by gender regarding equal opportunities, for instance, whether it is harder for women to get into better paying sectors. The same holds for equal opportunities in education and professional experience, e.g., due to maternal leaves. That is, we only study the monetary discrimination in a given sector, once a certain educational level and professional experience was achieved. Discrimination in opportunities comes on top, but is harder to measure and often not translated into monetary terms. This limitation applies in general, and is not particular to our method.
Finally, a reviewer suggested to compare our results with those one obtains using the gam command from package mgcv in R calling Y=Z1+Z2+Z1Z2+s(x,by=Z1Z2), i.e., using continuous variable X and two factors (Z 1 ,Z 2 ). This was done using the estimated weightsf 1 X 0 i /f 0 X 0 i with both densities being estimated by kernel density estimation with their optimal bandwidths along Sheather and Jones (1991). This weighting tries to optimize the prediction similar to the CV modification in Galdo et al. (2008). As can be seen in Fig. 8, this gives by far a larger variability over sectors and education with little credible results like 'positive wage discrimination' of up to 40% and many cases of negative wage discrimination between 40 to almost 60%. Strangely, when we rerun this exercise without weights, these extremes remained but some figures changed. Note also that for these estimates, especially in the case of using estimated weights, we do not know how to obtain confidence intervals or standard errors. We guess that some bootstrap methods could be developed. A check of the manual reveals that the simulated confidence bands provided in R are not helpful here.

Conclusions and discussion
This paper contributes to the existing, large literature on bandwidth selection, by providing a selector designed for prediction and data-matching problems when the distribution of the covariates, X , in the target population is potentially different from the one in the source population. This is a standard problem in counterfactual exercises like causal analysis and studies on discrimination, but also quite frequent for the calculus of scenarios. It offers thereby an interesting alternative to Häggström and Luna (2014) and Galdo et al. (2008) which seem to be the only existing contributions in this direction so far. Our method has a closed form, is computationally attractive, asymptotically well understood and shows a very satisfying behavior in simulations, even for moderate and small sample sizes. Along the quite exemplifying problem of studying the gender wage gap in Spain after the last big economic crises, we do not only motivate our method but also illustrate its usefulness and application for such a highly relevant issue. Moreover, we show a first extension toward its use in a multivariate context. We succeed to carve out the heterogeneity and depth of this wage gap over different sectors and education levels. This application study is completed with some comparisons using alternative selectors or estimation methods.
There are certainly several interesting extensions thinkable, most of them having already been mentioned at some point in this paper. First note that the proposed selection procedure could be interesting for more general model selection problems in the context of prediction and data matching, not only for finding the bandwidth. Recall that modified cross-validation is frequently used for all kind of features selection, in classical kernel regression as well as in various recent machine learning procedures.
A second extension would be to explicitly account for potential boundary effects. Certainly, if the support of f 0 covers the support of f 1 such that all observations x i made in the target sample are interior points of f 0 or if the conditional expectation is relatively flat at the boundaries, such correction is not needed. Also, unless many x i of the target sample suffer strongly from boundary effects, the selector itself should hardly be affected as it just searches the minimum of the MASE which in turn is calculated from a comparison of two estimates that would both suffer from the same boundary effects. However, especially for the final estimate, an obvious remedy would be to use boundary kernels like the local linear (or 'equivalent') one of Jones (1993). For our selector, we had then to replace assumption (B1) by (B1') K is a second order kernel, a density for interior points, else a left, respectively, right boundary kernel. A careful check of the proof suggests that this would not change our results but make our presentation and implementation somewhat more cumbersome. For our application, we realized that it had hardly any impact for the above mentioned reasons ( f 0 spans typically over f 1 , etc.).
A third extension has already been studied in the thesis of Barbeito (2020), namely the one toward local linear estimators (Fan and Gijbels 1992). While the empirical results turned out to be alike to the simulation results for the Nadaraya-Watson, the expressions for the bootstrap version of the MASE become extremely complex, leading to a quite important loss of efficiency in the computation of the bandwidth selector.
A fourth extension would be allow for various continuous covariate simultaneously. In practice one could follow then the suggestion of Köhler et al. (2014) and use a multivariate kernel with a bandwidth matrix proportional to −1/2 X with X being the variance-covariance matrix of X in the source group, times a constant to be chosen along our criterion. This approach, however, needs some deeper examinations.
A fifth extension would be to use our selector if one wants to reveal the distribution of the unobserved Y for the target group similar to Dai et al. (2016). This is not only interesting for analyzing poverty and vulnerability; we noticed that also in our discrimination study, in some sectors the residual variance was still pretty large. For those sectors, one would like to compare real with counterfactual distributions, not just the mean.

dx
Similarly, we can state a result for the differenceB g − B.
Lemma 4 Given the expressions forB g and B, thenB g − B consists of a sum of 60 terms similar to those in expression (34). Specifically, where k 1 = 66. The functions ν(x) and the values of r , , a and [s] are collected in Tables 1-4 in the Supplementary Material. Additionally, term B 1 is of order O r 1,n 0 with r 1,n 0 being also given in the Supplementary Material.
As an immediate consequence of the Tchebycheff inequality, we can conclude: