Abstract
In linear regression, when the covariates are highly collinear, ridge regression has become the standard treatment. The choice of ridge parameter plays a central role in ridge regression. In this paper, instead of ending up with a single ridge parameter, we consider a model averaging method to combine multiple ridge estimators with \(M_n\) different ridge parameters, where \(M_n\) can go to infinity with sample size n. We show that when the fitting model is correctly specified, the resulting model averaging estimator is \(n^{1/2}\)-consistent. When the fitting model is misspecified, the asymptotic optimality of the model averaging estimator is also established rigorously. The results of simulation studies and our case study concerning the urbanization level of Chinese ethnic areas demonstrate the usefulness of the model averaging method.
Similar content being viewed by others
References
Buckland ST, Burnham KP, Augustin NH (1997) Model selection: an integral part of inference. Biometrics 53:603–618
Chen X, Zou G, Zhang X (2013) Frequentist model averaging for linear mixed-effects models. Front Math China 8:497–515
Claeskens G, Croux C, van Kerckhoven J (2006) Variable selection for logistic regression using a prediction-focused information criterion. Biometrics 62:972–979
Clue E, Vineis P, De Iorio M (2011) Significance testing in ridge regression for genetic data. BMC Bioinform 12:372
Dempster A, Schatzoff M, Wermuth N (1975) A simulation study of alternatives to ordinary least squares. J Am Stat Assoc 70:77–106
Flynn CJ, Hurvich CM, Simonoff JS (2013) Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. J Am Stat Assoc 108:1031–1043
Gao Y, Zhang X, Wang S, Zou G (2016) Model averaging based on leave-subject-out cross-validation. J Econ 192:139–151
Ghosh S, Yuan Z (2009) An improved model averaging scheme for logistic regression. J Multivar Anal 100:1670–1681
Golub G, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21:215–223
Hansen BE (2007) Least squares model averaging. Econometrica 75:1175–1189
Hansen BE, Racine J (2012) Jacknife model averaging. J Econ 167:38–46
Hoerl A, Kennard R (1970) Ridge regression, biased estimation for nonorthogonal problems. Technometrics 12:55–67
Lee T (1987) Algorithm as 223: optimum ridge parameter selection. J R Stat Soc C 36:112–118
Leung G, Barron AR (2006) Information theory and mixing least-squares regressions. IEEE Trans Inf Theory 52:3396–3410
Liu Q, Okui R (2013) Heteroskedasticity-robust Cp model averaging. Econ J 16:463–472
Lu X, Su L (2015) Jackknife model averaging for quantile regressions. J Econ 188:40–58
Magnus J, De Luca G (2016) Weighted-average least squares (WALS), a survey. J Econ Surv 30:117–148
Moral-Benito E (2015) Model averaging in economics: an overview. J Econ 29:46–75
Nordberg L (1982) A procedure for determination of a good ridge parameter in linear regression. Commun Stat Simul Comput 11:285–309
Rao CR (1973) Linear statistical inference and its applications, 2nd edn. Wiley, New York
Schomaker M (2012) Shrinkage averaging estimation. Stat Pap 53(4):1015–1034
Schomaker M, Wan ATK, Heumann C (2010) Frequentist model averaging with missing observations. Comput Stat Data Anal 54:3336–3347
Wan ATK, Zhang X, Zou G (2010) Least squares model averaging by Mallows criterion. J Econ 156:277–283
Wang H, Zhang X, Zou G (2009) Frequentist model averaging estimation: a review. J Syst Sci Complex 22:732–748
Yu Y, Thurston S, Hauser R, Liang H (2013) Model averaging procedure for partially linear single-index models. J Stat Plan Inference 143:2160–2170
Yuan Z, Yang Y (2005) Combining linear regression models: when and how? J Am Stat Assoc 100:1202–1214
Zhang X (2015) Consistency of model averaging estimators. Econ Lett 130:120–123
Zhang X, Liang H (2011) Focused information criterion and model averaging for generalized additive partial linear models. Ann Stat 39:174–200
Zhang X, Wang W (2017) Optimal model averaging estimation for partially linear models. Stat Sin (forthcoming)
Zhang X, Wan ATK, Zhou SZ (2012) Focused information criteria, model selection and model averaging in a Tobit model with a non-zero threshold. J Busi Econ Stat 30:132–142
Zhang X, Wan A, Zou G (2013) Model averaging by jackknife criterion in models with dependent data. J Econ 174:82–94
Zhang X, Zou G, Liang H (2014) Model averaging and weight choice in linear mixed-effects models. Biometrika 101:205–218
Zhang X, Zou G, Carroll R (2015) Model averaging based on Kullback-Leibler distance. Stat Sin 25:1583–1598
Acknowledgements
The authors would like to thank two anonymous referees for their insightful comments and very constructive suggestions that have substantially improved earlier versions of this paper. Zhao’s research was supported by a grant from the Ministry of Education of China (Grant No. 17YJC910011) and a grant from Minzu University of China (Grant No. 2017QNPY34). Yu’s research was supported by the National Natural Science Foundation of China (Grant Nos. 11661079 and 11301463).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendices
1.1 A.1 Notations and regularity conditions
Let \(\lambda _{\min }(\mathbf{B})\) and \(\lambda _{\max }(\mathbf{B})\) be the minimum and maximum eigenvalues of a general real matrix \(\mathbf{B}\). Denote \(\left\| \mathbf{B}\right\| \) as the spectral norm of a real matrix \(\mathbf{B}\), i.e. \( \left\| \mathbf{B}\right\| = \lambda _{\max }^{1/2}(\mathbf{B}'\mathbf{B})\). Let \(R^{}(\mathbf{w})={E\{L^{}(\mathbf{w})|\tilde{\mathbf{X}}\}=\text {E}\left\{ \Vert \varvec{{\mu }}-\widehat{\varvec{{\mu }}}^{}(\mathbf{w})\Vert ^2|\tilde{\mathbf{X}}\right\} }\), \(\xi _n=\inf \limits _{w\in \mathcal{W}} R(\mathbf{w})\), and \(\mathbf{w}_m^0\) be a weight vector in which the m-th element is one and the others are zeros. We need the following regularity conditions, where all limiting processes are with respect to \(n\rightarrow \infty \).
Condition (C.1)
\(\lambda _{\min }(\mathbf{X}'\mathbf{X}/n)\) and \(\lambda _{\max }(\mathbf{X}'\mathbf{X}/n)\) are bounded below and above by positive constants \(c_0\) and \(c_1\), a.s., respectively, and \(n^{-1/2}\mathbf{X}'\mathbf{e}=O_p(1)\).
Condition (C.2)
There exists a \( {m^*}\in \{1,\ldots ,M_n\}\) such that \(n^{-1/2}k_{m^*}=O_p(1) \).
Condition (C.3)
\(p^*={O(n^{-1})}\), a.s., where \( p^*=\max _{1\le m\le M_n}\max _{1\le i\le n}P^{m}_{ii}\), and \(P^{m}_{ii}\) is the i-th diagonal element of \(\mathbf{P}_m = \mathbf{X}(\mathbf{X}'\mathbf{X}+k_m\mathbf{I}_p)^{-1}\mathbf{X}'\).
Condition (C.4)
\({\sup _{i\in \{1,\ldots ,n\}}}E{(e^4_i|\tilde{\mathbf{x}}_i)=O(1),}\) a.s., where \(e_i\) is the random error defined in (8).
Condition (C.5)
\({\varvec{{\mu }}'\varvec{{\mu }}}/{n}={O(1)}, {a.s.}.\)
Condition (C.6)
\(\xi _n^{-2}\sum \limits _{m=1}^{M_n} R(\mathbf{w}_m^0)={o(1)}, {a.s..}\)
The first part of Condition (C.1) guarantees the identifiability of the model and is common in the literature on model selection (Flynn et al. 2013). The second part of Condition (C.1) is mild and holds under some typical situations, e.g., for the case that \(\{\mathbf{X}_i,e_i\}\)’s are independent and satisfy some moment conditions. Condition (C.2) is mild and it requires that there exists an \( m^*\) such that \(k_{m^*}\) grows at a rate no greater than \(n^{1/2}\). In fact, the ridge parameters adopted in Sect. 3 satisfy this condition. The reason is that \(p\max _{1\le j\le p}\widehat{\alpha }_j^2 \ge \Vert \varvec{{\beta }}_0\Vert ^2 + \varepsilon _n\), with \(\varepsilon _n = 2\varvec{{\beta }}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{e}+ \Vert (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{e}\Vert ^2 = O_p(n^{-1/2})\) under Condition (C.1) and therefore \(k^* = \widehat{\sigma }^2/\max _{ 1\le j\le p}\widehat{\alpha }_j^2 \le p\widehat{\sigma }^2/( \Vert \varvec{{\beta }}_0\Vert ^2 + \varepsilon _n) = O_p(1)\). Condition (C.3) is reasonable and weaker than Condition (C.2) of Zhang (2015). The reason is that \(\{\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'-\mathbf{P}_m\}\) is a positive definite matrix and thus \(\mathbf{l}_{i}'\{\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'-\mathbf{P}_m\}\mathbf{l}_{i}\ge 0\), where \(\mathbf{l}_{i}\) is the i-th column of \(\mathbf{I}_n\). Condition (C.4) is a commonly used condition in the literature (Wan et al. 2010). This excludes the situation where the random error distribution comes from a specific family of heavy-tailed distributions, such as the t-distribution whose degree of freedom is no greater than 4 or a Pareto distribution whose shape parameter is no greater than 4. Conditions (C.5) and (C.6) are the same as (23) and (21) in Zhang et al. (2013), respectively.
1.2 A.2 Proof of Theorem 1
This proof follows the framework in Zhang (2015). Define \(\mathbf{D}_m\) as the \(n\times {n}\) diagonal matrix with \(h^{m}_{ii} = (1-P^{m}_{ii})^{-1}\) being its i-th diagonal element. From (6) and the fact that
we have
Now we define \(\mathbf{Q}_m\) as the \(n\times {n}\) diagonal matrix whose i-th diagonal element is \(Q_{m,ii}=P^m_{ii}/(1-P^m_{ii})\). Then \( \mathbf{D}_m=\mathbf{Q}_m + \mathbf{I}_n\). By (A.2),
Let \(\mathbf{V}\) be the \(M_n \times M_n \) matrix whose (m, j)-th entry is
It follows from (A.3)–(A.4) that
Next, we will show that
with \(m^*\) being defined in Condition (C.2), and that for any \(\mathbf{w}\in \mathcal{W}\),
To show (A.6), note that
then (A.6) holds under Conditions (C.1)–(C.2). Moreover, by the definition of \(\mathbf{Q}_m\), Conditions (C.1)–(C.3), and Eq. (A.4), it is seen that (A.7) also holds because of the fact that
uniformly for every \(m,j\in \{1,\ldots ,M_n\}\). Denote
Based on (A.6), (A.7), and Condition (C.1), we have
In addition, by the definitions of \(\widehat{\mathbf{w}}\) and \(\eta _n\), we have that
which, together with (A.5), implies that
and thus
Then, by Condition (C.1) and (A.7)–(A.8), we have
which concludes the proof.
1.3 A.3 Proof of Theorem 2
Let
where \(\mathbf{Q}_m\) is defined in Appendix A.2, and \(\mathbf{B}(\mathbf{w})=\sum _{m=1}^{M_n} w_m\mathbf{B}_m\). By (A.3), we have
Define \(\mathbf{M}(\mathbf{w})=\mathbf{A}(\mathbf{w})\mathbf{B}(\mathbf{w})+\mathbf{B}(\mathbf{w})\mathbf{A}(\mathbf{w})+\mathbf{B}(\mathbf{w})\mathbf{B}(\mathbf{w})\), then it is seen from (A.9) that
where
with \(\mathbf{P}(\mathbf{w}) = \sum _{m=1}^{M_n} w_m \mathbf{P}_m\). Since \(\mathbf{e}'\mathbf{e}\) is independent of \(\mathbf{w}\), to prove Theorem 2, by (A.10), it suffices to show that
and
We first prove (A.11). Recall that
By (A.13), the Dominated Convergence Theorem, Conditions (C.4) and (C.6), and the assumption that there exists a positive constant \(\bar{\sigma }^2\) such that \(\lambda _{\tiny \max }(\varOmega )=\bar{\sigma }^2<\infty \) a.s., we have that, for any fixed \(\delta >0\), as \(n\rightarrow \infty \)
where the fourth inequality follows from the Chebyshev’s inequality and the last inequality is a direct result of (A.13). Similarly, by Conditions (C.3), (C.4) and (C.6) and Dominated Convergence Theorem, we obtain
where we have used the fact that
Moreover, by Condition (C.3) and Eq. (A.16)
In addition, it follows from Conditions (C.1) and (C.3) that uniformly in m,
and
Therefore,
Then, from Condition (C.5) and Eq. (A.16),
By Conditions (C.4)–(C.5) and Eqs. (A.16) and (A.18), we have
Likewise, with Condition (C.4) and Eqs. (A.16) and (A.18), we have
Equation (A.11) can then be proved by combining Eqs. (A.14)–(A.17) and (A.19)–(A.21) together.
We now prove (A.12). Note that
then, to show (A.12), it remains to verify that
and
Since \(\mathbf{P}_m\) is positive semi-definite and \(\lambda _{\tiny \max }(\mathbf{P}_m)\le 1\) for any \(1\le m\le M_n\), for any \(\mathbf{w}\in \mathcal{W}\), \(\mathbf{P}(\mathbf{w})\) is positive semi-definite and
Then, by Conditions (C.3)–(C.4), (A.16) and (A.26), we can verify (A.23) by noting that
where we have used the fact that for any fixed \(\delta >0\)
In addition, it follows from Condition (C.3) and (A.26) that,
Combining (A.16) and (A.27) will lead to (A.24). In addition, recalling that \( \left\| {\mathbf{A}(\mathbf{w})\varvec{{\mu }}} \right\| ^2 \le R(\mathbf{w})\), we have
which with (A.23) will lead to (A.25). This concludes the proof.
Rights and permissions
About this article
Cite this article
Zhao, S., Liao, J. & Yu, D. Model averaging estimator in ridge regression and its large sample properties. Stat Papers 61, 1719–1739 (2020). https://doi.org/10.1007/s00362-018-1002-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-1002-4