Abstract
We consider inference for linear regression models estimated by weighted-average least squares (WALS), a frequentist model averaging approach with a Bayesian flavor. We propose a new simulation method that yields re-centered confidence and prediction intervals by exploiting the bias-corrected posterior mean as a frequentist estimator of a normal location parameter. We investigate the performance of WALS and several alternative estimators in an extensive set of Monte Carlo experiments that allow for increasing complexity of the model space and heteroskedastic, skewed, and thick-tailed regression errors. In addition to WALS, we include unrestricted and fully restricted least squares, two post-selection estimators based on classical information criteria, a penalization estimator, and Mallows and jackknife model averaging estimators. We show that, compared to the other approaches, WALS performs well in terms of the mean squared error of point estimates, and also in terms of coverage errors and lengths of confidence and prediction intervals.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Many empirical studies in economics assume that the data are generated by a linear regression model where a distinction is made between ‘focus regressors’ and ‘auxiliary regressors’. The focus regressors are included because we believe the model is not credible without them or because they are the subject of our investigation, while the number and the identity of the auxiliary regressors is less certain. The parameters of primary interest are the coefficients on the focus regressors (the ‘focus parameters’), while the coefficients on the auxiliary regressors are treated as nuisance parameters. Instead of a single model for the data generating process (DGP), there is a ‘model space’ containing a finite but potentially large number of models, namely the unrestricted model that includes all auxiliary regressors, the fully restricted model that includes none, and all intermediate models. Adding auxiliary regressors tends to reduce omitted variable bias in estimating the focus parameters, but tends to increase sampling variability. Examples include studies concerning the determinants of economic growth (Sala-i-Martin et al. 2004; Magnus et al. 2010), risk premia (Sousa and Sousa 2019), product and labor market reforms (Duval et al. 2021), the impact of legalized abortion on crime (Donohue and Levitt 2001), and the relationship between body mass and income (Dardanoni et al. 2011).
Model uncertainty can be approached via ‘model selection’ or via ‘model averaging’. In the model selection approach we attempt to find the ‘best’ model given the data, the model space, and a specific purpose (e.g., estimation of particular parameters or prediction of future outcomes). Given this best model, one then employs its estimates for the intended purpose. Like any other data-driven statistical decision, model selection is subject to sampling uncertainty which, if ignored, can lead to overestimate accuracy (Kabaila and Mainzer 2018). Typical examples are the classical pre-test estimator and post-selection estimators that select the model with the smallest value of some information criterion, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). More recently, considerable attention has been devoted to penalization estimators based on model sparsity and an absolute penalty criterion, such as the least absolute shrinkage and selection operator (LASSO), which address the sampling uncertainty problem by performing variable selection and regularization at the same time. These estimators typically require the choice of some ‘tuning’ parameters that control the trade-off between bias and variance. They also tend to be biased and to have nonstandard sampling distributions, so that inference based on the normal approximation can be misleading (Knight and Fu 2000; Claeskens and Hjort 2008).
The second approach is model averaging. In contrast to model selection, one is not concerned with finding a ‘best’ model but with finding a ‘best’ estimator of the focus parameters or a ‘best’ predictor of the outcome. The (well-established) terminology is a little confusing because we don’t average over models but over estimators. In fact, one takes a weighted average of the estimators from all the available models, with data-dependent weights to account for the uncertainty associated with each model. There are many proposed model averaging estimators, typically obtained either from a Bayesian perspective (Bayesian model averaging: BMA) or from a frequentist perspective (frequentist model averaging: FMA). BMA weights can be interpreted as posterior model probabilities, while FMA weights are decreasing functions of some measure of predictive inaccuracy, such as Mallows’ \(C_p\) (Hansen 2007) or leave-one-out cross-validation (Hansen and Racine 2012). There also exist Bayesian-frequentist ‘fusions’, such as weighted-average least squares (WALS), introduced by Magnus et al. (2010), which is frequentist but with a Bayesian flavor. We refer to Steel (2020) for an extensive survey of the various types of model averaging estimators and their use in economics. Like for model selection estimators, most of these estimators tend to be biased and their sampling distribution is not well approximated by the normal distribution. Furthermore, there is increasing evidence that, even after correcting for bias, inference for model averaging estimators can be misleading if based on the normal approximation (see, among others, Claeskens and Hjort 2008; Hansen 2014; Liu 2015; and DiTraglia 2016).
The finite-sample bias and variance of WALS have recently been analyzed by De Luca et al. (2021), who exploit results on the frequentist properties of the Bayesian posterior mean in a normal location model. The current paper extends their results to inference by proposing a simulation-based approach that yields re-centered confidence and prediction intervals using the bias-corrected posterior mean as a frequentist estimator of the normal location parameter. We assess its finite-sample performance by an extensive Monte Carlo experiment. To facilitate comparisons with the simulation study by Zhang and Liu (2019), we stay close to their framework and consider a finite model space that contains the true data-generating process (M-closed environment) but has little additional structure. Unlike Zhang and Liu (2019), who restrict attention to inference about a single auxiliary parameter, we consider inference about a single focus parameter, interpreted as the causal effect of a policy or intervention in the presence of a potentially large number of auxiliary parameters. This is likely to be the most interesting case for applied economists. We compare the performance of WALS point estimates and confidence intervals with the performance of several competing approaches, including least squares estimators for the unrestricted and fully restricted models, post-selection estimators based on AIC and BIC, Mallows and jackknife model averaging estimators, and one version of the LASSO (the adaptive LASSO). In addition, we discuss prediction intervals for the outcome of interest, which involves linear combinations of all focus and auxiliary parameters. The main conclusion of our Monte Carlo experiment is that, compared to other estimators, the coverage errors for WALS are small and confidence and prediction intervals are short, centered correctly, and allow for asymmetry. They are also easy and fast to compute.
The remainder of this paper is organized as follows. Section 2 introduces the framework and briefly describes the estimators that we consider. Section 3 discusses how to construct confidence intervals for a single parameter of interest. Section 4 describes the Monte Carlo experiment. Sections 5–7 contain the simulation results, separately for point estimates (Sect. 5), confidence intervals (Sect. 6), and prediction intervals (Sect. 7). Section 8 concludes. There are two appendices. Appendix A formalizes the nine estimators introduced in Sect. 2, while Appendix B describes the algorithm for simulation-based WALS confidence intervals.
2 Framework and Estimators
Our framework is the linear regression model
where \(\varvec{y}\) \((n \times 1)\) is the vector of observations on the outcome of interest, \(\varvec{X}_{1}^{}\) \((n \times k_{1}^{})\) and \(\varvec{X}_{2}^{}\) \((n \times k_{2}^{})\) are matrices of nonrandom regressors, \(\varvec{\beta }_{1}^{}\) and \(\varvec{\beta }_{2}^{}\) are unknown parameter vectors, and \(\varvec{\epsilon }\) is a vector of random disturbances. The \(k_{1}^{}\) columns of \(\varvec{X}_{1}^{}\) contain the ‘focus regressors’ which we want in the model on theoretical or other grounds, while the \(k_{2}^{}\) columns of \(\varvec{X}_{2}^{}\) contain the ‘auxiliary regressors’ of which we are less certain. These auxiliary regressors could be controls that are added to avoid omitted-variable bias or transformations and interactions of the set of original regressors to allow for nonlinearities. We assume that \(k_{1}^{} \ge 1\), \(k_{2}^{} \ge 0\), and that \(\varvec{X}=(\varvec{X}_{1}^{},\varvec{X}_{2}^{})\) has full column-rank \(k=k_{1}^{}+k_{2}^{} \le n\). The disturbance vector \(\varvec{\epsilon }\) has zero mean and a positive definite variance matrix, diagonal but not necessarily scalar. The DGP thus allows for nonnormality and heteroskedasticity.
Table 1 lists the nine estimators of \(\varvec{\beta }=(\varvec{\beta }_1',\varvec{\beta }_2')'\) that we consider in this paper. Except for LS-R and WALS, all other estimators also appear in Zhang and Liu (2019). In the remainder of this section, we describe briefly the various estimators with some emphasis on WALS. Appendix A provides a more detailed description of all estimators.
Our first two estimators are the least squares (LS) estimators of \(\varvec{\beta }\) in the unrestricted model that includes all auxiliary regressors and the fully restricted model that includes none. We shall refer to these two estimators as the unrestricted LS (LS-U) estimator and the fully restricted LS (LS-R) estimator, respectively. Under an \({\mathcal {M}}\)-closed environment, the LS-U estimator is unbiased but is likely to have a large variance, especially when the sample size is small, the number of auxiliary variables is large, and the regressors are highly correlated. The LS-R estimator is subject to omitted variable bias when \(\varvec{X}_1' \varvec{X}_2\ne \varvec{0}\) and \(\varvec{\beta }_2 \ne \varvec{0}\), but has a smaller variance than the LS-U estimator under homoskedastic errors. These estimators require neither model selection nor model averaging as they rely on two ad hoc specifications of the unknown DGP.
When we account explicitly for uncertainty about the auxiliary regressors, the model space contains \(J=2_{}^{k_{2}^{}}\) possible models. Model selection tries to find a single ‘best’ model based on a specific criterion, while model averaging takes a weighted average of the estimators from all the models in the model space. For example, if \(\widehat{\varvec{\beta }}_{1j}^{}\) and \(\widehat{\varvec{\beta }}_{2j}^{}\) are the LS estimators of \(\varvec{\beta }_{1}^{}\) and \(\varvec{\beta }_{2}^{}\) in model j, then a model averaging estimator takes the form
where the \(\lambda _{j}^{}\) are nonnegative data-dependent model weights that add up to one. Even for moderate values of \(k_{2}^{}\) the computational burden of calculating \(2_{}^{k_{2}^{}}\) estimates and the associated model weights can be substantial.
One possibility is to reduce the number of models by preordering, as suggested by Hansen (2007). If we can order the auxiliary regressors a priori, then we only need to consider \(k_2^{}+1\) nested models, with model p containing the focus regressors and the first p auxiliary regressors. Except for a few cases in which the auxiliary regressors admit a natural preordering (e.g., polynomial regression models), the question of how we should order the auxiliary regressors is difficult to answer, and if we use preliminary regressions to order the regressors then the statistical noise generated by these preliminary investigations should not be ignored.
Two common model selection strategies are based on information criteria such as AIC and BIC. AIC and BIC are known to represent two extreme strategies favoring, respectively, more and less complicated model structures. The IC-A and IC-B post-selection estimators are the LS estimators in the models with the smallest \(\text {AIC}\) and \(\text {BIC}\) respectively. As implemented in Zhang and Liu (2019), these estimators require preordering and the assumption of homoskedastic errors. There is no model averaging here, only model selection.
The adaptive LASSO (ALASSO) estimator, proposed by Zou (2006), does not rely on preordering. It solves a penalized LS problem with a penalty on the weighted sum of the absolute values of the estimated components of the full vector \(\varvec{\beta }\) and weights that depend on the LS-U estimates and a tuning parameter selected by generalized cross-validation. Following Zhang and Liu (2019), this version of the ALASSO estimator does not distinguish between focus and auxiliary regressors.
The Mallows model averaging (MMA) estimator was introduced by Hansen (2007). Although it can be applied to the full model space consisting of \(2_{}^{k_2^{}}\) models, it is typically based on preordering in order to reduce the computational burden. This estimator is asymptotically efficient in the mean squared error (MSE) sense when the errors in (1) are homoskedastic (Hansen 2007; Wan et al. 2010; Zhang 2021).
The jackknife model averaging (JMA) estimator, introduced by Hansen and Racine (2012) , is also generally based on preordering but allows for heteroskedasticity. Under homoskedasticity it has the same (nonstandard) limiting distribution as MMA (Zhang and Liu 2019, p. 824) and it remains asymptotically efficient under heteroskedasticity (Hansen and Racine 2012; Zhang 2021). The modified JMA (JMA-M) estimator, introduced by Zhang and Liu (2019), is similar but is defined by weights that minimize a penalized cross-validation criterion.
The weighted-average least squares (WALS) estimator was introduced by Magnus et al. (2010) and reviewed by Magnus and De Luca (2016). Unlike other model averaging estimators, the WALS approach exploits a preliminary transformation of the auxiliary regressors that reduces the computational burden from order \(2^{k_2}\) to order \(k_2\) and leads to other important simplifications. In particular, after this transformation, model (1) may equivalently be written as
where \(\varvec{Z}_{2}' \varvec{M}_{1}^{} \varvec{Z}_{2}^{}\) is equal to the identity matrix of order \(k_2^{}\). The WALS estimator \(\widehat{\varvec{\gamma }}=(\widehat{\varvec{\gamma }}_{1}', \widehat{\varvec{\gamma }}_{2}')\) of \(\varvec{\gamma }=(\varvec{\gamma }_{1}^{}, \varvec{\gamma }_{2}^{})\) is a weighted average of the LS estimators of \(\varvec{\gamma }\) over the J models in the model space.
From Theorem 2 of Magnus and Durbin (1999), the MSE of \(\widehat{\varvec{\gamma }}_{1}^{}\) depends on the MSE of \(\widehat{\varvec{\gamma }}_{2}^{}\). Thus, if we can choose the model weights optimally such that \(\widehat{\varvec{\gamma }}_{2}^{}\) is a ‘good’ estimator of \(\varvec{\gamma }_{2}^{}\) (in the MSE sense), the same weights will also provide a ‘good’ estimator of \(\varvec{\gamma }_{1}^{}\). Moreover, the dependence of \(\widehat{\varvec{\gamma }}\) on the estimators from all possible models is completely captured by a random diagonal matrix \(\varvec{W}\), whose \(k_2^{}\) diagonal elements are partial sums of the model weights \(\lambda _j^{}\) in (2). It follows that we can restrict attention to the WALS estimator of \(\varvec{\gamma }_2\), whose computational burden is of order \(k_{2}^{}\) as we need to determine only the diagonal elements of \(\varvec{W}\), not the full set of model weights.
The components of \(\widehat{\varvec{\gamma }}_{2}^{}\) are shrinkage estimators of the components of \(\varvec{\gamma }_{2}^{}\). Under the assumption of homoskedastic normal errors in (1) and the additional restriction that the hth diagonal element of the matrix \(\varvec{W}\) depends only on the hth component of the LS-U estimator of \(\varvec{\gamma }_{2}^{}\), our shrinkage estimators are also independent. The initial \(k_{2}^{}\)-dimensional problem then reduces to \(k_{2}^{}\) identical one-dimensional problems, namely: given a single observation x from the normal location model \({{\mathcal {N}}}(\eta , \sigma _{}^2)\), what is the estimator m(x) of \(\eta \) with minimum MSE? Since the risk properties of m(x) are little affected by estimating the variance parameter (Danilov 2005), we assume that \(\sigma _{}^2\) is known.
A Bayesian approach to the above problem requires two elements: a normal location model for the independently and identically distributed (i.i.d.) elements \(\{x_h^{}\}\) of the vector of t-ratios \(\varvec{x}=\widehat{\varvec{\gamma }}_{2,u}^{}/s_u^{}\), where \(s_u^2\) is the unbiased LS estimator of the error variance; and a prior distribution for the i.i.d. elements \(\{\eta _h^{}\}\) of the vector of ‘theoretical’ t-ratios \(\varvec{\eta }= {\varvec{\gamma }}_{2}^{}/\sigma _u^{}\). For a proper treatment of admissibility, robustness, near-optimality in terms of minimax regret, and ignorance about \(\eta _h^{}\), we select a prior that is symmetric, leads to bounded risk, and satisfies the ‘neutrality condition’ \({\mathbb {P}}[|\eta _h^{}|< 1] = 1/2\). The Bayesian approach to the normal location problem then yields the posterior mean \(m_{h}^{}=m(x_{h}^{})\) as an estimator of \(\eta _{h}^{}\), from which the WALS estimators of \(\varvec{\gamma }_{1}^{}\) and \(\varvec{\gamma }_{2}^{}\) and therefore of \(\varvec{\beta }_{1}^{}\) and \(\varvec{\beta }_{2}^{}\) are easily derived (see Appendix A for the details).
The mixture of Bayesian and frequentist approaches requires special attention when assessing the sampling properties of our model averaging estimator. First, for a prior which is symmetric around zero, the posterior mean \(m_{h}^{}\) suffers from attenuation bias, so \(\widehat{\varvec{\beta }}_{1}^{}\) and \(\widehat{\varvec{\beta }}_{2}^{}\) are in general biased estimators of \(\varvec{\beta }_1^{}\) and \(\varvec{\beta }_2^{}\). Second, for any nonnegative bounded prior density, the posterior variance of \(\eta _h^{}\) represents a first-order approximation to the sampling standard deviation (not the sampling variance) of the posterior mean \(m_h^{}\).
De Luca et al. (2021) presented Monte Carlo tabulations of the bias and variance of \(m_h^{}\) under three neutral priors: Laplace, Weibull, and Subbotin. For each prior considered, they also compared two alternative plug-in estimators of these sampling moments of \(m_h\): the frequentist maximum likelihood (ML) estimator and the Bayesian double shrinkage (DS) estimator. Based on these plug-in estimators, they derived new estimators for the sampling bias and variance of the WALS estimator. This paper investigates the implications of their findings for the construction of WALS confidence and prediction intervals.
3 Confidence Intervals
We concentrate on \((1-\alpha )\)-level confidence intervals for the lth component \(\beta _{l}^{}\) of \(\varvec{\beta }\), which could be either a focus or an auxiliary parameter. All confidence intervals take the form
where \(\check{\beta }_{l}^{}\) is an estimator of \(\beta _{l}\) and the quantities \({\underline{c}}_{l}\) and \({\overline{c}}_{l}\) are chosen to attain the desired coverage level. If \({\underline{c}}_l^{} = {\overline{c}}_l^{}\), the interval is called symmetric. We consider sixteen types of confidence intervals — ten from Zhang and Liu (2019) and six based on WALS — that differ depending on the choice of \(\check{\beta }_{l}^{}\), \({\underline{c}}_{l}\), and \({\overline{c}}_{l}\).
LS-U and LS-R: \(\check{\beta }_{l}^{}\) is either the LS-U or the LS-R estimator and \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = z_{1-\alpha /2}^{}\,s_l^{}\), where \(z_{1-\alpha /2}^{}\) is the \((1 - \alpha /2)\)th quantile of the standard normal distribution and \(s_l^{}\) is the standard error of \(\check{\beta }_{l}^{}\).
IC-A and IC-B: \(\check{\beta }_{l}^{}={\widehat{\beta }}_{l}^{}({\widehat{p}})\) and \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = z_{1-\alpha /2}^{}\,s_l^{}\), where \({\widehat{\beta }}_{l}^{}({\widehat{p}})\) is the LS estimator in the model with the smallest AIC or BIC, \({\widehat{p}}\) is the number of auxiliary regressors in the selected model, and \(s_l^{}\) is the standard error of \(\check{\beta }_{l}\). Zhang and Liu (2019) call these confidence intervals ‘naive’ because they ignore model selection noise.
ALASSO: \(\check{\beta }_{l}^{}\) is the ALASSO estimator and \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = n^{-1/2} q_{l}^*(\alpha )\), where \(q_{l}^*(\alpha )\) is the \(\alpha \)th quantile of the conditional distribution of \(\vert \sqrt{n}(\check{\beta }_{l}^* - \check{\beta }_{l}^{}) \vert \) given the data and \(\check{\beta }_{l}^*\) is the ALASSO estimate from a bootstrap sample. These confidence intervals and rely on the asymptotic validity of the bootstrap for the ALASSO estimator (Chatterjee & Lahiri, Chatterjee and Lahiri (2011); Camponovo 2015).
MMA: \(\check{\beta }_{l}^{}\) is the MMA estimator and we consider two alternative approaches to the choice of \({\underline{c}}_{l}\) and \({\overline{c}}_{l}\). In the bootstrap approach (MMA-B) we set \({\underline{c}}_{l} = {\overline{c}}_{l} = n^{-1/2} q_{l}^*(\alpha )\), where \(q_{l}^*(\alpha )\) is the \(\alpha \)th quantile of the bootstrap distribution of \(\vert \sqrt{n}(\check{\beta }_{l}^* - \check{\beta }_{l}) \vert \) and \(\check{\beta }_{l}^*\) is the MMA estimate from a bootstrap sample, while in the simulation-based approach (MMA-S) we set \({\underline{c}}_{l} = n^{-1/2} q_{l}(1 - \alpha /2)\) and \({\overline{c}}_{l} = - n^{-1/2} q_{l}(\alpha /2)\), where \(q_{l}(\alpha )\) is the \(\alpha \)th quantile of the simulated asymptotic distribution of the estimator based on Zhang and Liu (2019, Theorem 2). The first interval is symmetric, the second is not.
JMA: \(\check{\beta }_{l}\) is the JMA estimator and we again consider two alternative approaches to the choice of \({\underline{c}}_{l}\) and \({\overline{c}}_{l}\). In the first (JMA-B) we set \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = n_{}^{-1/2} q_{l}^*(\alpha )\), where \(q_{l}^*(\alpha )\) is the \(\alpha \)th quantile of the bootstrap distribution of \(\vert \sqrt{n}(\check{\beta }_{l}^* - \check{\beta }_{l}^{}) \vert \) and \(\check{\beta }_{l}^*\) is the JMA estimate from a bootstrap sample, while in the second (JMA-S) we set \({\underline{c}}_{l}^{} = n_{}^{-1/2} q_{l}^{}(1 - \alpha /2)\) and \({\overline{c}}_{l} = - n_{}^{-1/2} q_{l}^{}(\alpha /2)\), where \(q_{l}^{}(\alpha )\) is based on Zhang and Liu (2019, Theorem 4).
JMA-M: \(\check{\beta }_{l}\) is the JMA-M estimator and \({\underline{c}}_{l} = {\overline{c}}_{l} = z_{1-\alpha /2}\,s_{l}^*\), where \(s_{l}^*\) is the standard error in the ‘just-fitted’ model, that is, the model obtained from the ordered sequence of models by deleting all redundant regressors at the end of the sequence.Footnote 1 Symmetry of these intervals is justified by the asymptotic normality of the JMA-M estimator (Zhang and Liu 2019, Theorem 5).
WALS: We consider three different methods for constructing confidence intervals, namely uncentered-and-naive (UN), centered-and-naive (CN), and simulation-based (S). The algorithm underlying the last two methods is presented in Appendix B.
In the UN method, \(\check{\beta }_{l}\) is the WALS estimator \({{\widehat{\beta }}}_l^{}\), \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = z_{1-\alpha /2}^{}\,s_l^{}\), and \(s_l^{}\) is either the plug-in ML estimator or the plug-in DS estimator of the standard error of \({{\widehat{\beta }}}_l^{}\). The resulting intervals take the classical normal approximation to the sampling distribution of \({{\widehat{\beta }}}_{l}^{}\) at face value and neglect the bias of the WALS estimator.
In the CN method, \(\check{\beta }_{l}\) is the bias-corrected WALS estimator, \(\check{\beta }_{l} = {{\widehat{\beta }}}_{l}^{} - b_l^{}\), where \(b_l^{}\) is either the plug-in ML estimator or the plug-in DS estimator of the bias of \({{\widehat{\beta }}}_l^{}\). As in the UN method, \({\underline{c}}_{l}^{} = {\overline{c}}_{l}^{} = z_{1-\alpha /2}^{}\,s_l^{}\), but now \(s_l^{}\) depends on the bias-corrected WALS estimator and is computed by the simulation-based algorithm discussed in Appendix B. The CN method again takes the classical normal approximation at face value but re-centers to correct for estimation bias and accounts for randomness in the estimated bias.
The S method also yields re-centered confidence intervals by using the bias-corrected posterior mean as an estimator of the normal location parameter, and accounts for its randomness by exploiting a large set of pseudo-random Monte Carlo draws. However, since it does not require critical values from the normal distribution, its confidence intervals are not necessarily symmetric.
4 Monte Carlo Design
Our setup closely follows Zhang and Liu (2019) with some exceptions explained later in this section. We have \(k_1=2\) focus regressors: \(\varvec{x}_{11}\) (the constant term) and \(\varvec{x}_{12}\); and \(k_2\) auxiliary regressors: \(\varvec{x}_{21},\dots ,\varvec{x}_{2k_2}\). Our parameter of interest is the coefficient \(\beta _{12}\) on \(\varvec{x}_{12}\), which may be interpreted as the causal effect of \(\varvec{x}_{12}\) on \(\varvec{y}\).
The \(k_2+1\) regressors \(\varvec{x}_{12},\varvec{x}_{21},\dots , \varvec{x}_{2k_2}\) are drawn from a multivariate normal distribution with mean zero and variance \(\sigma _x^2\,\varvec{\Sigma }_x^{}(\rho )\), where
with \(-1/k_2< \rho <1\). We set \(\sigma _x^2 = \rho = 0.7\).
The error term is generated by \(\epsilon _i = \sigma _i u_i\), where the \(u_i\) are independently distributed following either a standard normal distribution or a skewed \(t^*\)-distribution \(t^*(\mu ,\sigma ,d,\lambda )\) with mean \(\mu \), variance \(\sigma ^2\), d degrees of freedom and skewness parameter \(|\lambda |<1\) (defined for \(d>3\)). In addition to the standard normal distribution, we consider three skewed \(t^*\)-distributions with \(\mu =0\) and \(d=5\): (i) the standard t(5)-distribution, which is obtained by setting \(\sigma = \sqrt{5/3}\) and \(\lambda =0\), (ii) a distribution with moderate positive skewness (\(\sigma =1\) and \(\lambda =0.5\)), and (iii) a distribution with large positive skewness (\(\sigma =1\) and \(\lambda =0.8\)). We also consider four homoskedastic and four heteroskedastic error distributions, as shown in Table 2. In the homoskedastic cases we take \(\sigma _i=2.5\) when the distribution of \(u_i\) has variance one. For the standard t(5)-distribution the variance is 5/3, so we need the correction factor \(2.5/\sqrt{5/3}=\sqrt{15/4}\). In the heteroskedastic cases we define
where \(x_{12}^{(i)}\) and \(x_{21}^{(i)}\) respectively denote observation i on the second focus regressor and the first auxiliary regressor, and the scaling is chosen such that \({{\mathbb {E}}}[\tau _i]=1\) for all i.
Setting \(k_2=8\), we have \(2^{k_{2}^{}}=256\) possible models that include the two focus regressors and a subset of the eight auxiliary regressors. We fix \(\varvec{\beta }_1=(1,1)'\) and consider four configurations of the eight auxiliary parameters, as shown in Table 3.
Our setup is intentionally similar to that in Zhang and Liu (2019) with three important exceptions:
-
Our parameter of interest is one of the focus parameters, not one of the auxiliary parameters, because it is focus parameters that we are primarily interested in.
-
Zhang and Liu (2019) ignore the possibility of skewness in the error distribution. In fact, of the eight cases in Table 2 they only consider two: homoskedastic under normality (case 1) and heteroskedastic under a t-distribution (case 6). In the heteroskedastic setup we take 5 rather than 4 degrees of freedom, so as to ensure the existence of both skewness and kurtosis. In addition, our scaling in design 6 gives \({{\mathbb {E}}}[\sigma _i]=2.5/\sqrt{{{\mathbb {V}}}[t(5)]}\approx 1.94\) thus ensuring comparability with the other designs, whereas in the case considered by Zhang and Liu (2019) we would have \({{\mathbb {E}}}[\sigma _i]\approx 3.23\). Finally, we let \(\tau _i\) depend on one focus and one auxiliary regressor (instead of two auxiliary regressors).
-
To the three cases (a)–(c) in Table 3, we have added case (d) to show what can happen when the preliminary ordering is poor. As in case (b), the auxiliary regressors with nonzero coefficients enter with a decreasing order of importance as measured by the magnitude of their coefficients (since we set \(|\xi |<1\)). In addition, case (d) implies that all submodels in the preordered sequence of \(k_2+1\) nested models (except for the unrestricted model) are subject to omitted-variable bias.
We set \(\xi =0.5\) and consider sample sizes of \(n=100\) and \(n=400\). By combining the eight specifications of the regression error in Table 2 with the four configurations of the auxiliary parameters in Table 3, we obtain 32 simulation designs for \(n=100\) and 32 simulation designs for \(n=400\). Using 5,000 Monte Carlo replications for each design (instead of 500 replications as in Zhang and Liu 2019), we compute the bias, variance, and MSE of the nine estimators discussed in Sect. 2: LS-U, LS-R, IC-A, IC-B, ALASSO, MMA, JMA, JMA-M, and WALS. The LS-U, LS-R and WALS estimators are implemented in Stata, the other estimators in MATLAB.Footnote 2 Since WALS has been shown to be robust to different choices of the prior (De Luca et al. 2018, 2021), we focus on the Laplace prior to exploit its computational advantages in computing the posterior mean.
5 Monte Carlo Results: Point Estimates
In this and the next two sections we present the results of the Monte Carlo experiment in a number of graphs. The current section discusses point estimates; confidence intervals and prediction intervals are discussed in Sects. 6 and 7 , respectively.
In Figs. 1 and 2 we present the first two sampling moments of the nine estimators for \(n=100\). The sixteen plots in Fig. 1 represent the homoskedastic designs, the sixteen plots in Fig. 2 the heteroskedastic designs. Each plot contains the squared bias–variance decomposition of the MSE of the nine estimators and, in addition, two ‘iso-MSE’ lines, which consist of all points with the same MSE as the LS-U estimator (red dash-dotted line) and the WALS estimator (blue dashed line). Design 1a refers to distribution 1 (normal, homoskedastic) and configuration (a), and so on, as described in Tables 2 and 3 .
The similarity of the sixteen plots in Fig. 1 is remarkable. The LS-U, LS-R, ALASSO, and WALS estimators are not affected by preordering, hence their moments and MSEs are the same across configurations. This is not the case for the other five estimators, IC-A, IC-B, MMA, JMA, and JMA-M, for which the effect of preordering can be substantial (comparing across rows), but the effect of nonnormality (skewness and excess kurtosis) appears to be small (comparing across columns). The LS-R estimator has a large bias which dominates the small variance, and hence its MSE is large. ALASSO has a small bias but a large variance, hence a large MSE. The MSE is also large for IC-B based on the BIC criterion because of its large bias, especially in configurations (b) and (d) where the ordering is unfavorable. The IC-A estimator based on the AIC criterion behaves about the same as the LS-U estimator in configurations (a) and (c), but considerably worse in configurations (b) and (d). As predicted by the asymptotic theory, MMA (Mallows) and JMA (jackknife) perform essentially the same under homoskedasticity and are indistinguishable in the figure, but again their performance deteriorates when the preordering is unfavorable. Unlike Zhang and Liu (2019), we find that JMA is 7–14% more efficient relative to JMA-M (in MSE sense) in the sixteen designs of Fig. 1.
The dominating estimator is WALS, whose bias is more than offset by a much smaller variance, thus capturing the essence of model averaging. The efficiency of WALS relative to the next-best JMA estimator is about 12% in configurations (a) and (c), 23% in configuration (b) and 31% in configuration (d). The MSE of WALS is 0.23–0.24 depending on the error distribution, hence showing considerable robustness to violations of the normality assumption, probably because \(n=100\) is already large enough to justify asymptotic approximations.
Now consider the case of heteroskedastic errors, still for \(n=100\), as plotted in Fig. 2. Averaging over all estimators and all designs, this leads to a deterioration of the MSE by about 30% but does not change the ordering of estimators. Contrary to what the asymptotic theory predicts, MMA is 2% more efficient than JMA under heteroskedasticity. WALS remains the preferred estimator in terms of MSE.
When the sample size increases, things change. Since we work in an M-closed environment and the number of models remains fixed, the LS-U estimator remains unbiased and its variance and MSE decrease at the rate of 1/n. So eventually it dominates all other estimators unless we also let \(k_2^{}\) increase.
Fig. 3 only presents designs 1 and 5 because the t- and skewed \(t^*\)-distributions produce moments that are almost identical. For example, in the homoskedastic case the MSE ranges from 0.066 to 0.070 for LS-U and from 0.083 to 0.087 for WALS, while in the heteroskedastic case it ranges from 0.095 to 0.101 for LS-U and from 0.110 to 0.115 for WALS. When n increases from 100 to 400, one would expect the variance to decrease by about 75%, and this is more or less what happens. Averaged over all estimators, the variance decreases by about 73% in both the homoskedastic and the heteroskedastic cases. The (absolute) bias also decreases but at a lower speed. The LS-U estimator is unbiased, while the bias of the LS-R estimator does not change with n. Averaging over the remaining estimators we find a decrease of the absolute bias of about 35% in the homoskedastic case and 29% in the heteroskedastic case. The decrease in absolute bias of the WALS estimator is particularly slow. The resulting MSE decreases by about 60% averaged over all estimators, both under homoskedasticity and heteroskedasticity. The preferred estimator is now the LS-U estimator, with ALASSO as second-best and WALS as third-best. These three estimators are not influenced by the order of the auxiliary variables. For the other estimators (except LS-R which clearly performs badly) a poor choice of preordering may lead to poor behavior of the estimator.
Let us now extend our design in four directions. First, we consider not only \(n=100\) and \(n=400\) but also two intermediate values 200 and 300. Second, we extend the number of auxiliary variables from \(k_2=8\) to \(16, 24, 32, \dots , 64\) by setting \(\varvec{\beta }_2=(\xi ,\xi ^2,\dots ,\xi ^{k_2/2},0,0,\dots ,0)'\). Third, we consider not only \(\xi =0.5\) but also \(\xi =-0.5\), so that we allow for both positive and negative influences or, what is the same, for positive and negative correlations between the regressors. Fourth, in addition to \(\sigma _x^2=\rho =0.7\) (high correlation), we also consider \(\sigma _x^2=\rho =0.3\) (low correlation). In total, our second Monte Carlo experiment includes 128 simulation designs for the different combinations of n, \(k_2\), \(\xi \), and \(\rho \), each with 5,000 Monte Carlo replications. To simplify the presentation, we restrict ourselves to homoskedastic normal errors and two estimators: LS-U and WALS.
Fig. 4 considers the efficiency of the WALS estimator relative to the LS-U estimator, i.e. the ratio of the MSE of LS-U and WALS. Theory predicts that, in every setup, WALS will dominate LS-U when n is ‘small’ and LS-U will dominate WALS when n is ‘large’. The question is where to draw the line between small and large. We see that LS-U dominates when n is larger than about 250. But when \(\xi \) is negative or the correlation is small, WALS also dominates LS-U for large values of n, certainly larger than 400. As expected, we also see that an increase in \(k_2^{}\) increases the efficiency of WALS relative to LS-U.
6 Monte Carlo Results: Confidence Intervals
We now consider confidence intervals for \(\beta _{12}\) of the form (4) with nominal coverage probability of (at least) \(1 - \alpha \). We compare the sixteen methods discussed in Sect. 3. The confidence intervals for ALASSO, MMA-B, and JMA-B are based on 499 bootstrap replications, those for MMA-S and JMA-S are based on 499 Monte Carlo replications, those for WALS (DS-S, ML-S, DS-CN, and ML-CN) on 5,000 Monte Carlo replications. For given \(\alpha \), we calculate \(\check{\beta }_{12}\), \({\underline{c}}_{12}(\alpha )\), and \({\overline{c}}_{12}(\alpha )\) for each method and each replication of the 32 simulation designs. We then obtain the coverage probability and the length of the interval by averaging over the 5,000 Monte Carlo replications for each simulation design.
Figs. 5 and 6 summarize the simulation results for \(n=100\) and \(n=400\), respectively. Both figures contain 16 panels, one for each method. On the horizontal axis we plot the coverage probabilities for the three values of \(\alpha \): 10% (red long-dashed line), 5% (green dashed line), and 1% (blue dash-dotted line). The lengths of the intervals are plotted on the vertical axis. Since there are 32 designs (labeled 1a–8d), there are 32 points in each panel for each level of \(\alpha \) (marked as triangles for \(\alpha =10\%\), squares for \(\alpha =5\%\), and circles for \(\alpha =1\%\)). The markers are full for the homoskedastic designs and empty for the heteroskedastic designs. Not all points are visible because many overlap, but what really matters is how much the coverage probabilities differ from their nominal levels and how short the confidence intervals are.
Regarding the coverage probabilities we see that there are five methods that produce accurate coverage probabilities, namely LS-U and the four centered versions of WALS: centered-and-naive (WALS-DS-CN and WALS-ML-CN) and simulation-based (WALS-DS-S and WALS-ML-S). The other eleven methods are much less accurate. In particular, the naive confidence intervals for IC-A and IC-B lead to large undercoverage errors because they ignore model selection noise. The MMA-B and JMA-B confidence intervals are more accurate than the simulation-based algorithms proposed by Zhang and Liu (2019) , but the underlying undercoverage errors are still sizeable and increase with the sample size.Footnote 3 The JMA-M confidence intervals also have nonnegligible undercoverage errors which tend to increase with the sample size. ALASSO performs well for \(n=400\), but the undercoverage errors of its 90% and 95% confidence intervals for \(n=100\) are rather large (\(-0.19\) for \(\alpha =10\%\) and \(-0.08\) for \(\alpha =5\%\)). The UN confidence intervals for WALS do not perform well because they use critical values from the normal distribution and ignore estimation bias. Ignoring estimation bias is much more important than naively using critical values from the normal distribution, as shown by first comparing UN and CN intervals (large difference) and then CN and S intervals (small difference). Obviously to use the correct critical values is better, but the improvement is very small. Similar conclusions are obtained when looking at higher moments of the bias-corrected WALS estimator of the focus parameter \(\beta _{12}\), computed via the simulation-based algorithm discussed in Appendix B. We find that this estimator is left-skewed and exhibits positive excess kurtosis, but the deviations from zero are in general very small.
Regarding the interval lengths for our five favourite methods we see that for \(n=100\) the interval lengths in the homoskedastic designs are about 1.7 when \(\alpha =10\%\), 2.1 when \(\alpha =5\%\), and 2.7 when \(\alpha =1\%\); about 12% higher in the heteroskedastic designs. For \(n=400\) the interval lengths decrease by about 50%. WALS performs slightly better than LS-U, but the differences are small and require further investigation in an extended design.
In the extended design defined in the Sect. 5 we consider only the classical LS-U confidence interval and the two simulation-based WALS confidence intervals, WALS-DS-S and WALS-ML-S, based on the plug-in DS and ML estimators of the bias of the posterior mean in the normal location model.Footnote 4 The coverage probabilities of the three methods (LS-U, WALS-DS-S, WALS-ML-S) are compared in Fig. 7 for the 90%, 95% and 99% confidence levels. The coverage errors of the three methods are in general small. In the 128 simulation designs considered in our second Monte Carlo experiment they are always smaller than 0.03 in absolute value and they are more often positive (overcoverage) than negative (undercoverage). The fact that WALS-DS-S yields slightly larger coverage errors than WALS-ML-S is consistent with the finite-sample properties of the underlying plug-in estimators of the bias of the posterior mean in the normal location model. Specifically, under the Laplace prior, the plug-in DS estimator of the bias of the posterior mean has always a larger bias than the plug-in ML estimator. We also find that the absolute value of the coverage errors for WALS-DS-S increases with \(\alpha \), reaching a maximum of 0.006 when \(\alpha =1\%\), 0.018 when \(\alpha =5\%\), and 0.028 when \(\alpha =10\%\).
Fig. 8 shows the relative length of the confidence intervals: LS-U divided by WALS-DS-S and LS-U divided by WALS-ML-S. We only present results for the 95% level since they are indistinguishable from those for the 90% and 99% levels. For all cases we find that the simulation-based WALS confidence intervals are always smaller than the classical LS-U confidence intervals, even when the LS-U estimator dominates the WALS estimator in terms of MSE. The average length reduction with respect to classical LS-U confidence intervals is about 1.8% for WALS-ML-S and about 5.4% for WALS-DS-S. This result agrees with the fact that, although more biased, the plug-in DS estimator of the bias of the posterior mean has better MSE performance than the plug-in ML estimator, at least for small or moderate values of the location parameter.
The relative gains of WALS on LS-U in terms of confidence interval length are much smaller than those in terms of MSE obtained for the point estimators, which agrees with the findings of Kabaila and Leeb (2006) and Wang and Zhou (2013) for other model averaging approaches to inference. A possible explanation is the randomness of the estimated bias. We have seen that re-centering based on the bias-corrected estimator is important to obtain small coverage errors. However, correcting for bias increases sampling variability, which is reflected in the length of the confidence interval.
7 Monte Carlo Results: Prediction Intervals
Finally we consider the problem of predicting a single observation \(y_f\) from model (1) and covariate vector \(\varvec{x}_f^{} = (\varvec{x}_{1f}', \varvec{x}_{2f}')_{}'\), that is,
where \(\varvec{\epsilon }\) and \(\epsilon _f\) are independent of each other and jointly normally distributed with zero means and variances \({{\mathbb {V}}}[\varvec{\epsilon }]=\sigma _{}^2 \varvec{I}_n^{}\) and \({{\mathbb {V}}}[\epsilon _f^{}]=\sigma _{}^2\). If \(\widehat{\varvec{\beta }}_1^{}\) and \(\widehat{\varvec{\beta }}_2^{}\) denote the WALS estimators of \(\varvec{\beta }_1^{}\) and \(\varvec{\beta }_2^{}\), then the WALS predictor of \(y_f^{}\) is defined as
and its prediction error is
Because of (9), the WALS predictor of \(y_f\) may be viewed as a weighted average of the predictors from all \(2_{}^{k_2^{}}\) models in the model space. We are interested in constructing prediction intervals for \({{\mathbb {E}}}[y_f^{}]= \varvec{x}_{1f}'\varvec{\beta }_{1}^{} + \varvec{x}_{2f}'\varvec{\beta }_{2}^{}\). Unlike the confidence intervals described in Sect. 3, these prediction intervals require dealing with the sampling uncertainty on all model parameters, focus and auxiliary.
We consider two variants of WALS prediction intervals. The first, which we call the naive approach, starts from the bias-corrected WALS estimator \(\check{\varvec{\beta }}=\widehat{\varvec{\beta }} - b(\widehat{\varvec{\beta }})\) and then constructs a symmetric prediction interval with nominal coverage probability \(1 - \alpha \):
where \(\check{\varvec{V}}\) is the Monte Carlo variance of \(\check{\varvec{\beta }}\) estimated from \(\varvec{B}_{}^*\), the \(R \times k\) matrix containing the replications of the bias-corrected WALS estimator in step (iv) of the algorithm described in Appendix B. This approach assumes normality of the bias-corrected WALS estimator, which is why it is called naive. The other, which we call the simulation-based approach, does not assume normality of the bias-corrected WALS estimator and builds the prediction interval directly from the quantiles of the empirical distribution of the elements of the vector \(\varvec{B}_{}^* \varvec{x}_f^{}\). This prediction interval need not be symmetric around \(\varvec{x}_f'\check{\varvec{\beta }}\).
Fig. 9 presents the relative efficiency of the WALS predictor of \({{\mathbb {E}}}[y_f^{}]=\varvec{x}_f'\varvec{\beta }\) relative to the LS-U predictor in the 128 simulation designs with homoskedastic normal errors under alternative values of n, \(k_{2}^{}\), \(\xi \), and \(\rho \). In each design, \(\varvec{x}_f\) is drawn randomly from a multivariate normal distribution with mean zero and variance \(\sigma _x^2 \varvec{\Sigma }_x^{}(\rho )\) and then kept fixed for all replications of the same simulation design. Thus, \(\varvec{x}_f\) changes with \(k_2\) and \(\rho \). The figure has the same format as Fig. 4, except that efficiency is now measured by the ratio of the mean squared prediction errors (LS-U relative to WALS). WALS clearly dominates LS-U in all designs, and by an even larger margin than what we have seen for the focus parameter. As expected, the relative efficiency of WALS increases with the number of auxiliary parameters in the DGP. The typical profile of the relative efficiency of the WALS predictor is concave in \(k_2^{}\), revealing very large gains when moving from a small number (\(k_2^{} = 8\)) to a moderate number (\(k_2^{} = 24\)) of auxiliary parameters.
Fig. 10 shows the actual coverage probabilities of the prediction intervals for LS-U and WALS for nominal probabilities of 90%, 95%, and 99%. For WALS we only present the simulation-based intervals, both DS and ML, because the naive prediction intervals are always very close. This figure is the analog of Fig. 7 and, perhaps not surprisingly, prediction interval coverage errors are only slightly larger than confidence interval coverage errors. There is only one design (\(n=400\), \(\xi =-0.5\), \(\rho =0.7\)) out of the 128 considered for which the coverage error is sizable (around 6%), and this coverage error is not much larger than for LS-U in the same design.
Fig. 11 plots the relative lengths of the 95% prediction intervals based on LS-U and WALS, hence the analog of Fig. 8. The disadvantage of using LS-U is now even more evident than before. LS-U prediction intervals are 2–3% larger than WALS-ML and 5–10% larger than WALS-DS. Furthermore, the relative length of the LS-U prediction intervals, viewed as a function of \(k_{2}^{}\), is concave for all designs, again revealing large gains when moving from \(k_2^{} = 8\) to \(k_2^{} = 24\).
8 Conclusions
In this paper we extend the theory of WALS estimation to inference by proposing a simulation-based method for confidence and prediction intervals. To highlight the properties of WALS and put them in perspective we also consider its main competitors. We discuss both confidence intervals for a focus parameter and prediction intervals for the outcome of interest by an extensive set of Monte Carlo experiments that allow for increasing complexity of the model space and include heteroskedastic, skewed, and thick-tailed error distributions.
In the homoskedastic case the dominating estimator is WALS, whose bias is more than offset by a smaller variance, especially when the sample size is small, thus capturing the essence of model averaging. In the heteroskedastic case, the performance of all estimators deteriorates but their relative position in terms of MSE changes little. With a large sample size, the preferred estimator is the unrestricted estimator LS-U, closely followed by ALASSO and WALS.
Regarding coverage probabilities, we find that LS-U and WALS perform well, while all other methods are much less accurate. Comparing the length of confidence intervals, WALS performs slightly better than the LS-U estimator, though differences are small. Finally, regarding prediction intervals, WALS clearly dominates LS-U. The relative efficiency of WALS increases with the number \(k_2^{}\) of auxiliary parameters and its typical profile is concave in \(k_2^{}\). Coverage errors of prediction intervals are only slightly larger than of confidence intervals, and when we compare the relative lengths of 95% prediction intervals based on LS-U and WALS the dominance of WALS is even stronger.
Post-selection/averaging inference is a challenging issue, which is likely to play a prominent role in future developments and applications of model selection/averaging techniques. In addition to estimating the coefficients of interest accurately, many economic problems require us to evaluate the precision of the estimated relationships and their statistical significance. Our new methods for WALS confidence and prediction intervals provide an easy, accurate, and computational convenient solution for these difficult tasks. For the latest set of Stata, R, and Python routines covering WALS inference of (univariate) generalized linear models (linear, logistic and Poisson regressions) we refer the reader to De Luca et al. (2022).
Code availability
All routines are available from the authors upon request.
Notes
The just-fitted model is unknown in practice, so \(s_{l}^*\) is not a feasible estimator. In the simulations we follow Zhang and Liu (2019) and assume that the just-fitted model is known. As a consequence, the correct intervals will be larger than reported since some of the model selection noise has been ignored.
The MATLAB routines were kindly provided by Xinyu Zhang and Chu-An Liu. All Stata routines are available from the authors upon request.
With \(n=100\) the undercoverage errors of MMA-B and JMA-B are \(-0.03\) for \(\alpha =10\%\) and \(-0.02\) for \(\alpha =5\%\), while with \(n=400\) the undercoverage errors become \(-0.07\) for \(\alpha =10\%\) and \(-0.05\) for \(\alpha =5\%\).
The centered-and-naive results for WALS-DS-CN and WALS-ML-CN are again very close to those obtained with the simulation-based approach.
To ensure nonsingularity of \(\varvec{D}_{p}\) we must add the requirement that the ith unit vector in \({\mathbb {R}}^n\) (the vector whose ith component is 1 and all other components are 0) does not lie in the column space of \(\varvec{X}\) for any i.
References
Camponovo, L. (2015). On the validity of the pairs bootstrap for lasso estimators. Biometrika, 102, 981–987.
Chatterjee, A., & Lahiri, S. N. (2011). Bootstrapping Lasso estimators. Journal of the American Statistical Association, 106, 608–625.
Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridge University Press.
Danilov, D. (2005). Estimation of the mean of a univariate normal distribution when the variance is not known. Econometrics Journal, 8, 277–291.
Dardanoni, V., Modica, S., & Peracchi, F. (2011). Regression with imputed covariates: A generalized missing-indicator approach. Journal of Econometrics, 162, 362–368.
De Luca, G., Magnus, J. R., & Peracchi, F. (2018). Weighted-average least squares estimation of generalized linear models. Journal of Econometrics, 204, 1–17.
De Luca, G., Magnus, J. R., & Peracchi, F. (2021). Sampling properties of the Bayesian posterior mean with an application to WALS estimation. Journal of Econometrics (to appear). https://doi.org/10.1016/j.jeconom.2021.04.008.
De Luca, G., Magnus, J. R., & Yue, Y. (2022). Computational aspects of weighted-average least squares (WALS) inference. Mimeo.
DiTraglia, F. (2016). Using invalid instruments on purpose: Focused moment selection and averaging for GMM. Journal of Econometrics, 195, 187–208.
Donohue, J. J., & Levitt, S. D. (2001). The impact of legalized abortion on crime. Quarterly Journal of Economics, 116, 379–420.
Duval, R., Furceri, D., & Miethe, J. (2021). Robust political economy correlates of major product and labor market reforms in advanced economies: Evidence from BAMLE for logit models. Journal of Applied Econometrics, 36, 98–124.
Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75, 1175–1189.
Hansen, B. E. (2014). Model averaging, asymptotic risk, and regressor groups. Quantitative Economics, 5, 495–530.
Hansen, B. E., & Racine, J. (2012). Jackknife model averaging. Journal of Econometrics, 167, 38–46.
Kabaila, P., & Leeb, H. (2006). On the large-sample minimal coverage probability of confidence intervals after model selection. Journal of the American Statistical Association, 101, 619–629.
Kabaila, P., & Mainzer, R. (2018). Two sources of poor coverage of confidence intervals after model selection. Statistics and Probability Letters, 140, 185–190.
Knight, K., & Fu, W. (2000). Asymptotics for Lasso-type estimators. Annals of Statistics, 28, 1356–1378.
Liu, C.-A. (2015). Distribution theory of the least squares averaging estimator. Journal of Econometrics, 186, 142–159.
Magnus, J. R., & De Luca, G. (2016). Weighted-average least squares (WALS): A survey. Journal of Economic Surveys, 30, 117–148.
Magnus, J. R., & Durbin, J. (1999). Estimation of regression coefficients of interest when other regression coefficients are of no interest. Econometrica, 67, 639–643.
Magnus, J. R., Powell, O., & Prüfer, P. (2010). A comparison of two averaging techniques with an application to growth empirics. Journal of Econometrics, 154, 139–153.
Sala-i-Martin, X., Doppelhofer, G., & Miller, R. I. (2004). Determinants of long-term growth: A Bayesian averaging of classical estimates (BACE) approach. American Economic Review, 94, 813–835.
Sousa, J. M., & Sousa, R. M. (2019). Asset returns under model uncertainty: Evidence from the Euro area, the US and the UK. Computational Economics, 54, 139–176.
Steel, M. F. J. (2020). Model averaging and its use in economics. Journal of Economic Literature, 58, 644–719.
Wan, A. T. K., Zhang, X., & Zou, G. (2010). Least squares model averaging by Mallows criterion. Journal of Econometrics, 156, 277–283.
Wang, H., & Zhou, S. Z. F. (2013). Interval estimation by frequentist model averaging. Communications in Statistics-Theory and Methods, 42, 4342–4356.
Zhang, X. (2021). A new study on asymptotic optimality of least squares model averaging. Econometric Theory, 37, 388–407.
Zhang, X., & Liu, C.-A. (2019). Inference after model averaging in linear regression models. Econometric Theory, 35, 816–841.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Acknowledgements
Giuseppe De Luca acknowledges financial support from the MIUR PRIN (Grant Number PRJ-0324).
Funding
Open access funding provided by Universitá degli Studi di Palermo.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank Paolo Li Donni, Chu-An Liu, and Xinyu Zhang for useful discussions; and the referees for their constructive comments.
Appendices
Appendix A: The Nine Estimators
In this appendix, we formalize the nine estimators which were introduced in Sect. 2 and listed in Table 1.
Least squares (LS). The LS-U estimator of \(\varvec{\beta }=(\varvec{\beta }_1',\varvec{\beta }_2')'\) is \(\widehat{\varvec{\beta }}_u=(\varvec{X}' \varvec{X})^{-1} \varvec{X}' \varvec{y}\) and the LS-R estimator is \(\widehat{\varvec{\beta }}_r=(\widehat{\varvec{\beta }}_{1,r}',\varvec{0}')'\), where \(\widehat{\varvec{\beta }}_{1,r}=(\varvec{X}_1' \varvec{X}_1^{})^{-1} \varvec{X}_1' \varvec{y}\).
Post-selection estimators based on information criteria. Suppose that, after preordering, the pth model (\(p=0,1,\dots ,k_2\)) has \(k_1\) focus regressors and p auxiliary regressors. Assume that the underlying error is homoskedastic and let \({\widehat{\sigma }}^2_{p} = \varvec{y}' \varvec{M}_{p}^{*} \varvec{y}/n\) be the maximum likelihood (ML) estimator of its variance, where
is the usual idempotent matrix in model p, \(\varvec{I}_n\) is the identity matrix of order n, and \(\varvec{X}_{p}^{*}=(\varvec{X}_1,\varvec{X}_{2,p})\) is the matrix containing the first \(k_1+p\) regressors. The AIC for model p is
and the BIC is
The IC-A and IC-B estimators are the LS estimators in the models with the lowest value of \(\text {AIC}_{p}\) and \(\text {BIC}_{p}\) respectively.
Adaptive LASSO (ALASSO). This estimator solves the optimization problem
where \(\beta _{l}\) is the lth component of \(\varvec{\beta }\) and the weight \(\psi _{l,n} = \psi _n^{}/{\widehat{\beta }}_{l,u}^2\) depends on a tuning penalty parameter \(\psi _n^{}\) selected by generalized cross-validation and the lth component \({\widehat{\beta }}_{l,u}\) of \(\widehat{\varvec{\beta }}_u\).
Mallows model averaging (MMA). To reduce the computational burden this estimator is typically based on preordering. Letting \(s_u^2\) be the unbiased LS estimator of \(\sigma ^2\) in the unrestricted model and \(\varvec{M}^{*}(\varvec{w})=\sum _{p=0}^{k_2} w_{p} \varvec{M}_{p}^{*}\), where \(\varvec{M}_{p}^{*}\) is defined in (5), the MMA weights are obtained by solving
subject to \(\sum _{p} w_{p}=1\) and \(0\le w_{p}\le 1\) for all p. Denoting the optimal weights by \(\widehat{\varvec{w}}=({\widehat{w}}_{0}^{}, \ldots , {\widehat{w}}_{k_2^{}}^{})\), the MMA estimator takes the form
Jackknife model averaging (JMA). Let \(\varvec{D}_{p}\) be the diagonal matrix containing the diagonal elements of \(\varvec{M}_{p}^{*}\) on its diagonal and zeros elsewhere and define \(\varvec{M}^{\dagger }(\varvec{w})=\sum _{p=0}^{k_2} w_{p} \varvec{D}_{p}^{-1} \varvec{M}_{p}^{*}\).Footnote 5 Then the JMA weights are obtained by solving
subject to \(\sum _{p} w_{p}=1\) and \(0\le w_{p}\le 1\) for all p. The modified JMA (JMA-M) estimator is defined by weights that solve
subject to the same constraints as in (7), where the tuning parameter \(\psi _n\) is set equal to \(\log n\). The JMA and JMA-M estimators take the same form as (6) where \({\widehat{w}}_{p}\) is given by the solution of (7) and (8), respectively.
Weighted-average least squares (WALS). We first transform \(\varvec{X}_{2}^{}\) and \(\varvec{\beta }_{2}^{}\) by defining \(\varvec{Z}_{2}^{} = \varvec{X}_{2}^{} \varvec{\Delta }_{2}^{} \varvec{\Psi }_{}^{-1/2}\) and \(\varvec{\gamma }_{2}^{}= \varvec{\Psi }_{}^{1/2} \varvec{\Delta }_{2}^{-1} \varvec{\beta }_{2}^{}\), where \(\varvec{\Delta }_{2}^{}\) is a diagonal \(k_{2}^{}\times k_{2}^{}\) matrix such that all diagonal elements of \(\varvec{\Psi }= \varvec{\Delta }_{2}^{} \varvec{X}_{2}' \varvec{M}_{1}^{} \varvec{X}_{2}^{} \varvec{\Delta }_{2}^{}\) are equal to one and \(\varvec{M}_{1}^{}=\varvec{I}_{n}^{} - \varvec{X}_{1}^{} (\varvec{X}_{1}' \varvec{X}_{1}^{})^{-1} \varvec{X}_{1}'\). We also rescale \(\varvec{X}_{1}^{}\) and \(\varvec{\beta }_{1}^{}\) by defining \(\varvec{Z}_{1}^{}=\varvec{X}_{1}^{} \varvec{\Delta }_{1}^{}\) and \(\varvec{\gamma }_{1}^{}=\varvec{\Delta }_{1}^{-1}\varvec{\beta }_{1}^{}\), where \(\varvec{\Delta }_{1}^{}\) is a diagonal \(k_{1}^{}\times k_{1}^{}\) matrix such that all diagonal elements of \(\varvec{Z}_{1}' \varvec{Z}_{1}^{}\) are equal to one. The equivalence between models (1) and (3) follows from the fact that \(\varvec{Z}_{1}^{} \varvec{\gamma }_{1}^{}= \varvec{X}_{1}^{} \varvec{\beta }_{1}^{}\) and \(\varvec{Z}_{2}^{} \varvec{\gamma }_{2}^{}= \varvec{X}_{2}^{} \varvec{\beta }_{2}^{}\). In addition, the transformations ensure that \(\varvec{Z}_{2}' \varvec{M}_{1}^{} \varvec{Z}_{2}^{}=\varvec{I}_{k_{2}^{}}^{}\).
Averaging the LS estimators \(\widehat{\varvec{\gamma }}_{1j}\) and \(\widehat{\varvec{\gamma }}_{2j}\) over the \(J=2_{}^{k_{2}^{}}\) models in the model space gives the WALS estimators
where the \(\lambda _{j}^{}=\lambda _{j}^{}(\widehat{\varvec{\gamma }}_{2,u}^{})\) are nonnegative data-dependent model weights that add up to one, \(\widehat{\varvec{\gamma }}_{1,r}^{} = (\varvec{Z}_{1}' \varvec{Z}_{1}^{})_{}^{-1} \varvec{Z}_{1}' \varvec{y}\) is the LS-R estimator of \(\varvec{\gamma }_{1}^{}\), \(\widehat{\varvec{\gamma }}_{2,u}^{} = \varvec{Z}_{2}' \varvec{M}_{1}^{} \varvec{y}\) is the LS-U estimator of \(\varvec{\gamma }_{2}^{}\), \(\varvec{Q}=(\varvec{Z}_{1}' \varvec{Z}_{1}^{})^{-1}\varvec{Z}_{1}' \varvec{Z}_{2}^{}\), \(\varvec{W} = \sum _{j}\lambda _{j}^{} (\varvec{I}_{k_{2}^{}}^{}-\varvec{S}_{j}^{} \varvec{S}_{j}')\), \(\varvec{S}_{j}^{}\) is a \(k_{2}^{} \times r_{j}^{}\) selection matrix of rank \(0\le r_{j}^{}\le k_{2}^{}\) (that is, \(\varvec{S}_{j}'=[\varvec{I}_{r_{j}^{}}:\varvec{0}]\) or a column-permutation thereof), and \(r_j^{}\) is the number of exclusion restrictions implied by model j.
The results in (9) show that the dependence of \(\widehat{\varvec{\gamma }}_{1}^{}\) and \(\widehat{\varvec{\gamma }}_{2}^{}\) on the estimators from all the available models is completely captured by the random diagonal matrix \(\varvec{W}\), whose \(k_{2}^{}\) diagonal elements \(w_{h}^{}\) are partial sums of the \(\lambda _j\). Since \(0\le w_h \le 1\), this implies that the components of \(\widehat{\varvec{\gamma }}_{2}^{}\) in (9) are shrinkage estimators of the components of \(\varvec{\gamma }_{2}^{}\). The assumption that \(\varvec{\epsilon }\sim {{\mathcal {N}}}(\varvec{0}, \sigma ^2 \varvec{I}_n)\) implies that \(\widehat{\varvec{\gamma }}_{2,u}\sim {{\mathcal {N}}}(\varvec{\gamma }_2, \sigma ^2 \varvec{I}_{k_{2}^{}}^{})\). Hence, if we restrict each \(w_{h}^{}\) to depend only on the hth component of \(\widehat{\varvec{\gamma }}_{2,u}^{}\), the components of \(\widehat{\varvec{\gamma }}_{2}^{}\) will also be independent.
If we again treat the error variance \(\sigma _{}^2\) as known and equal to its unbiased LS estimator \(s_u^2\), it follows that the \(k_{2}^{}\)-vector of t-ratios \(\varvec{x}=\widehat{\varvec{\gamma }}_{2,u}^{}/s_u^{}\) is distributed as \({{\mathcal {N}}}(\varvec{\eta },\varvec{I}_{k_{2}^{}}^{})\), where \(\varvec{\eta }= {\varvec{\gamma }}_{2}^{}/\sigma _u^{}\) is the \(k_{2}^{}\)-vector of ‘theoretical’ t-ratios. The individual components \(x_h^{}\) of \(\varvec{x}\) are therefore independently distributed as \({{\mathcal {N}}}(\eta _{h}^{}, 1)\). As for the prior information on the vector \(\varvec{\eta }\), we assume that its components are i.i.d. with a density that is symmetric around zero, positive and nonincreasing on \((0,\infty )\), differentiable (except possibly at 0), and satisfies the ‘neutrality condition’ \({\mathbb {P}}[|\eta _h|<1] = 1/2\).
Given the normal location model for \(x_h^{}\) and the chosen prior for \(\eta _h^{}\), the Bayesian approach yields the posterior mean \(m_{h}^{}=m(x_{h}^{})\) as the estimator of \(\eta _{h}^{}\). The WALS estimators of \(\varvec{\gamma }_{1}^{}\) and \(\varvec{\gamma }_{2}^{}\) then take the form
with \(\varvec{m} = (m_{1}^{},\dots ,m_{k_{2}^{}})'\), and the WALS estimators of \(\varvec{\beta }_{1}^{}\) and \(\varvec{\beta }_{2}^{}\) are
Appendix B: Algorithm for Simulation-Based WALS Confidence Intervals
Let \(\varvec{x}=\widehat{\varvec{\gamma }}_{2,u}^{}/s_{u}=(x_{1}^{},\dots ,x_{k_{2}}^{})_{}'\) be the \(k_2^{}\)-vector of t-ratios from the unrestricted model and \(\widehat{\varvec{\eta }}=({{\widehat{\eta }}}_{1}^{},\dots ,{{\widehat{\eta }}}_{k_{2}}^{})_{}'\) an estimator of the \(k_2^{}\)-vector of parameters \(\varvec{\eta }=(\eta _{1}^{},\dots ,\eta _{k_{2}}^{})_{}'\) in the multivariate normal location model \(\varvec{x}\sim {{\mathcal {N}}}(\varvec{\eta }, \varvec{I}_{k_2})\).
The simulation-based WALS confidence intervals for \(\varvec{\beta }=(\varvec{\beta }_{1}',\varvec{\beta }_{2}')_{}'\) are obtained by the following five-step algorithm:
-
(i)
Compute \(\widehat{\varvec{\eta }}\) and use its generic element \({{\widehat{\eta }}}_{h}^{}\) to generate the R-vectors \(\varvec{x}_{h}^*=(x_{1h}^*,\dots ,x_{Rh}^*)_{}'\) of independent pseudo-random draws from the \({{\mathcal {N}}}({{\widehat{\eta }}}_{h}^{}, 1)\) distribution.
-
(ii)
Compute the \(R \times k_{2}\) matrix \({\check{\varvec{M}}}^{*}\) of pseudo-random draws for the bias-corrected posterior means with generic element
$$\begin{aligned} \check{m}_{rh}^{*}=m_{rh}^*-\delta _{rh}^* \qquad (r=1,\dots ,R; \, h=1,\dots ,k_{2}^{}), \end{aligned}$$where \(m_{rh}^*=m(x_{rh}^*)\) is the posterior mean evaluated at \(x_{rh}^*\) and \(\delta _{rh}^*\) is either the plug-in ML estimator \(\delta (x_{rh}^*)\) or the plug-in DS estimator \(\delta (m_{rh}^*)\) of the bias of \(m_{rh}^*\).
-
(iii)
Generate the \(R\times k_1^{}\) matrix \(\varvec{B}_{1r}^{*}\) of independent pseudo-random draws from the distribution \({{\mathcal {N}}}(\widehat{\varvec{\beta }}_{1r}^{}, \varvec{V}_{1r}^{})\), where
$$\begin{aligned} \widehat{\varvec{\beta }}_{1,r}^{}=\varvec{\Delta }_{1}^{} (\varvec{Z}_{1}' \varvec{Z}_{1}^{})_{}^{-1} \varvec{Z}_{1}' \varvec{y}, \qquad \varvec{V}_{1r}^{}=s_{u}^2 \varvec{\Delta }_{1}^{} (\varvec{Z}_{1}' \varvec{Z}_{1}^{})_{}^{-1} \varvec{\Delta }_{1}^{} \end{aligned}$$are the LS-R estimate of \(\varvec{\beta }_{1}^{}\) in the fully restricted model and its estimated variance matrix, respectively.
-
(iv)
Compute the \(R \times k\) matrix \(\check{\varvec{B}}^{*}=(\check{\varvec{B}}_{1}^{*}, \check{\varvec{B}}_{2}^{*})\) of pseudo-random draws for the bias-corrected WALS estimator \(\check{\varvec{\beta }}=(\check{\varvec{\beta }}_{1}',\check{\varvec{\beta }}_{2}')_{}'\) of \(\varvec{\beta }\), where
$$\begin{aligned} \check{\varvec{B}}_{1}^{*}=\varvec{B}_{1r}^{*} - s_{u}^{} \check{\varvec{M}}_{}^{*} \varvec{Z}_{2}' \varvec{Z}_{1} (\varvec{Z}_{1}' \varvec{Z}_{1}^{})_{}^{-1} \varvec{\Delta }_{1}^{}, \qquad \check{\varvec{B}}_{2}^*=s_{u}^{} \check{\varvec{M}}_{}^{*} \varvec{\Psi }_{}^{-1/2} \varvec{\Delta }_{2}^{}. \end{aligned}$$ -
(v)
Compute the (\(1-\alpha \))-level confidence interval for the generic component \(\beta _{l}^{}\) of \(\varvec{\beta }\) \((l=1,\dots ,k)\) as \([q_{l}^{*}(\alpha /2),q_{l}^{*}(1-\alpha /2)]\), where \(q_{l}^{*}(\alpha /2)\) and \(q_{l}^{*}(1-\alpha /2)\) are, respectively, the \(\alpha /2\) and \((1-\alpha /2)\) empirical percentiles of the R replications corresponding to the lth column \(\check{\varvec{b}}_{l}^{*}\) of \(\check{\varvec{B}}^{*}\).
Remark 1
To achieve good performance in terms of coverage probabilities, the initial estimator \(\widehat{\varvec{\eta }}\) in the first step of the algorithm must be (approximately) unbiased for \(\varvec{\eta }\). This leaves us three possible choices: (i) the ML estimator \(\varvec{x}\), (ii) the DS bias-corrected posterior mean \(\varvec{m}(\varvec{x})- \varvec{\delta }(\varvec{m}(\varvec{x}))\), and (iii) the ML bias-corrected posterior mean \(\varvec{m}(\varvec{x})- \varvec{\delta }(\varvec{x})\). In our experience, the differences between these three estimators are small. In the simulations, we use (ii) for the WALS-DS-S confidence intervals and (iii) for the WALS-ML-S confidence intervals. The main difference between these two methods is the choice of the plug-in estimator for the bias of the posterior mean in the second stage of the algorithm, namely \(\delta (m_{rh}^*)\) for WALS-DS-S or \(\delta (x_{rh}^*)\) for WALS-ML-S.
Remark 2
Like in other parametric bootstrap approaches, our simulation-based method ignores uncertainty caused by randomness of the regressors. Thus, as typically assumed in the WALS theory for point estimation, we treat the regressors as fixed.
Remark 3
An important difference with the MMA-S and JMA-S confidence intervals proposed by Zhang and Liu (2019) is that they simulate from the limiting distribution of the model averaging estimator, while in the simulation-based WALS algorithm we don’t. The WALS confidence intervals are based on the finite-sample properties of the plug-in estimators of the frequentist bias of the posterior mean in the normal location model (De Luca et al. 2021).
Remark 4
The \(R \times k\) matrix \(\check{\varvec{B}}^*\) of Monte Carlo replications obtained from step (iv) of the algorithm can be used to estimate any aspect of the sampling distribution of the bias-corrected WALS estimator. For example, we use it to compute the standard error of \(\check{\beta }_l\) in the CN method for confidence intervals, and the complete variance matrix of \(\check{\varvec{\beta }}\) in the naive approach to prediction intervals.
Remark 5
Our algorithm is very fast, especially with the Laplace prior. For example, in applications with \(n=400\) observations and \(k_2=40\) auxiliary regressors, we can compute point estimates, their estimated moments, and confidence intervals for all parameters based on 100,000 Monte Carlo replications in about 3.5 seconds by using a workstation with one Intel(R) Core(TM) i7-4790 CPU/3.60 GHz processor and 32 GB of RAM.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
De Luca, G., Magnus, J.R. & Peracchi, F. Weighted-Average Least Squares (WALS): Confidence and Prediction Intervals. Comput Econ 61, 1637–1664 (2023). https://doi.org/10.1007/s10614-022-10255-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10614-022-10255-5