Abstract
The varying coefficient models are very important tools to explore the hidden structure between the response variable and its predictors. However, variable selection and identification of varying coefficients of the models are poorly understood. In this paper, we develop a novel method to overcome these difficulties using local polynomial smoothing and the SCAD penalty. Under some regularity conditions, we show that the proposed procedure is consistent in separating the varying coefficients from the constant ones. The resulting estimator can be as efficient as the oracle. Simulation results confirm our theories. Finally, we study the Boston housing data using the proposed method.
Similar content being viewed by others
References
Breiman L (1995) Better subset selection using nonnegative garrote. Techonometrics 37:373–384
Candes E, Tao T (2007) The dantzig selector statistical estimation when p is much larger than n. Ann Stat 35:2313–2351
Fan J-Q, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J-Q, Huang T (2005) Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 11:1031–1057
Fan J-Q, Zhang J-T (2000) Two-step estimation of functional linear model with application to longitudinal data. J R Stat Soc Ser B 62:303–322
Fan J-Q, Zhang W-Y (2008) Statistical methods with varying coefficient models. Stat Interface 1:179–195
Hastie T-J, Tibshirani R-J (1993) Varying-coefficient models. J R Stat Soc B 55:757–796
Härdle W, Liang H, Gao J-T (2000) Partially linear models. Springer, Heidelberg
Hohsuk N, Ingrid V-K (2012) Efficient model selection in semivarying coefficient models. Electron J Stat 6:2519–2534
Hu T, Xia Y-C (2012) Adaptive semi-varying coefficient model selection. Stat Sin 22:575–599
Leng C-L (2009) A simple approach for varying-coefficient model selection. J Stat Plan Inference 139:2138–2146
Li R, Liang H (2008) Variable selection in semiparametric regression modeling. Ann Stat 36:261–286
Mack Y-P, Silverman B-W (1982) Weak and strong uniform consistency of kernel regression estimates. Z Wahrsch verw Gebiete 61:405–415
Tang Y-L, Wang H, Zhu Z-Y, Song X-Y (2012) A unified variable selection approach for varying coefficient models. Stat Sin 22:601–628
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc B 67:91–108
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Wang H-S, Xia Y-C (2009) Shrinkage estimation of the varying coefficient model. J Am Stat Assoc 104:747–757
Xia Y-C, Li W-K (1999) On the estimation and testing of functional coefficient linear models. Stat Sin 9:735–757
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhao P-X, Xue L-G (2009) Variable selection for semiparametric varying coefficient partially linear models. Stat Probab Lett 79:2148–2157
Zhang W-Y, Lee S-Y, Song X-Y (2002) Local polynomial fitting in semivarying coefficient model. J Multivar Anal 82:166–188
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1416–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320
Acknowledgments
The authors would like to thank the Editor and the referee for their careful reading and for their comments which greatly improved the paper, and also thank Bingyi Jing and Hansheng Wang for beneficial discussions, Yanlin Tang for sending R code for the procedures proposed in their papers.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by Program for New Century Excellent Talents in University (NCET-12-0536).
Appendix: Assumptions and proofs
Appendix: Assumptions and proofs
To study the asymptotic properties of the proposed method, Let \(H=(A,B)=(\beta (U_{1}), \ldots , \beta (U_{n}))^{T}\). Moreover, the following standard regularity conditions are needed (Fan and Huang 2005).
(C1) For an \(s>2, E|Y_{i}|^{2s}<\infty \) and \(E||\mathbf {X}_{i}||^{2s}<\infty \).
(C2) The density function of \(U_{i}, f(u)\), is continuous and positively bounded away from 0 on \([0,1]\).
(C3) Matrix \(\Omega (u)=E(\mathbf {X}_{i}\mathbf {X}_{i}^{T}|U_{i}=u)\) is nonsingular and has bounded second order derivatives on [0,1]. Function \(E(||\mathbf {X}_{i}||^{4}|U_{i}=u)\) is also bounded.
(C4) The second order derivative of \(f(u)\) and \(\sigma ^{2}(u)=E(\varepsilon _{i}^{2}|U_{i}=u)\) is also bounded.
(C5) \(K(u)\) is a symmetric density function with a compact support.
(C6) The second order derivatives of coefficients \(a_{j}(u),j=1,\ldots , p\), are continuous.
Note that (C2) guarantees the maximal distance between two consecutive index variables is \(O_{p}(logn/n)\). For an arbitrary index value \(u\in [0, 1]\), let \(u^{*}\) be its nearest neighbor among the observed index values, i.e., \(u^{*}=argmin_{\bar{u} \in \{U_{t}: 1\le t\le n\}}|u-\bar{u}|\). Under the smoothness assumption (C6), we have \(||\beta (u)-\beta (u^{*})||=O_{p}(logn/n)\) also, which is an order substantially smaller than the optimal nonparametric convergence rate (i.e., \(n^{-2/5}\)). Practically, this means that the observed index values are sufficiently dense on the support. Thus, it suffices to approximate the entire coefficient curve \(\beta (u)\) by \(\{\beta ({U_{t}}): 1\le t \le n\}\).
Lemma 1
Suppose (\(\xi _{i},U_{i}), i = 1,\ldots ,n\) are i.i.d random vectors, where \(\xi _{i}\) are scalar random variables. Suppose \(E|\xi _{i}|^{s}<\infty \) and \(\sup _{u}\int |y|^{s}f(u,v)dv<\infty \) where \(f\) denotes the joint density of \((\xi _{i},U_{i})\). Let K be a bounded positive function with bounded support, satisfying the Lipschitz condition, then
provided \(n^{2\delta -1}h\rightarrow \infty \) for some \(\delta <1-s^{-1}\).
The proof of the Lemma can be found in Mack and Silverman (1982), or Fan and Zhang (2000).
Lemma 2
If \((\hbox {C}1)-(\hbox {C}6)\) hold, and \(nh^{-1/2}a_{1n}\rightarrow 0, nh^{-1/2}b_{1n}\rightarrow 0\). then we must have
Proof
For an arbitrary matrix \(G=(g_{ij}), ||G||^{2}=\sum g_{ij}^{2}\). We use \(S=(s_{ij})\in R^{n\times 2p}\) to denote an arbitrary \(n\times 2p\) matrix with rows \(\mathbf {s}_{1}^{T},\ldots ,\mathbf {s}_{n}^{T}\) and columns \(\mathbf {v}_{1}^{T},\ldots ,\mathbf {v}_{2p}^{T}\). Let \(H_{0}=(\beta _{0}(U_{1}),\ldots ,\beta _{0}(U_{n}))^{T}\) with the columns \(\mathbf {h}_{01}^{T},\ldots ,\mathbf {h}_{0,2p}^{T}\). By Fan and Li (2001), it suffices to show that for any small probability \(\varepsilon >0\), we can always find a constant \(C>0\), such that
By definition of \(Q_{\lambda }(H)\), we have
where \(D_{u_{t},i}=(\mathbf {X}_{i},(U_{t}-U_{i})\mathbf {X}_{i})^{T}\). By simple algebraic calculation and the fact that \(||\mathbf {h}_{0j}||=0\) for any \(p_{1}<j\le p, p+p_{0}<j\le 2p\), we have
where \(\hat{{\varSigma }}(U_{t})=n^{-1}\sum _{i=1}^{n}D_{u_{t},i}D_{u_{t},i}^{T}K_{h}(U_{t}-U_{i})\) and \(\hat{e}_{t}=n^{-1/2}h^{1/2}\sum _{i=1}^{n}D_{u_{t},i}\left( D_{u_{t},i}^{T}[\beta (U_{t}) -\beta (U_{i})]+\varepsilon _{i}\right) K_{h}(U_{t}-U_{i})\). Let \(\hat{\lambda }_{t}^{min}\) be the smallest eigenvalue of \(\hat{{\varSigma }}(U_{t})\), \(\hat{\lambda }_{min}=\min \{\hat{\lambda }_{t}^{min} ,\) \(t=1,\ldots ,n\}\), and \(\hat{\mathbf {e}}=(\hat{\mathbf {e}}_{1},\ldots ,\hat{\mathbf {e}}_{n})^{T}\in \mathbf {R}^{n\times 2p}\), we have
By the condition \(n^{-1}||S||^{2}=C^{2}\), we have
After some algebraic calculations, we have \(n^{-1}||\hat{\mathbf {e}}||=O_{p}(1)\). By Lemma 1 and (C3), we have \(P(\lambda _{min}\rightarrow \lambda ^{min}_{0})\rightarrow 1\), where \(\lambda ^{min}_{0}=\inf _{u\in [0,1]}\lambda _{min}(f(u)\Omega (u))\), \(\lambda _{min}(A)\) stands for the minimal eigenvalue of an arbitrary positive definite matrix A. By (C2), (C3), and Lemma 1, we have \(\lambda ^{min}_{0}>0\). Consequently, the last term in (5) is dominated by the first two terms because in the last term \(nh^{-1/2}(a_{1n}+b_{1n})\rightarrow 0\). Last, we note that the first term in (5) is a quadratic function in C while the second term is linear in C. As long as C is sufficiently large, the right hand side of (5) is guaranteed to be positive with probability arbitrarily close to 1. This proves (4). The proof is complete. \(\square \)
Proof of Theorem 1
(1)We only need to prove that \(P(||\hat{\mathbf {b}}_{\lambda ,j}||=0)\rightarrow 1\) with \(j=p\). The proofs for \(p_{0}<j <p\) are similar. If the claim is not true (i.e., \(||\hat{\mathbf {b}}_{\lambda ,j}||\ne 0\) ), then it must be the solution of the following normal equation
where \(\alpha _{1}\) is a n-dimensional vector with its \(t\)th component given by
and
By standard arguments of kernel smoothing, and applying Lemmas 1 and 2, we have \(||\alpha _{1}||=O_{p}(nh^{-1/2})\). On the other hand, under the theorem condition, we know that \(nh^{-1/2}||\alpha _{2}||\ge nh^{-1/2} b_{2n} \rightarrow \infty \). This implies that \(P(||\alpha _{1}||<||\alpha _{2}||)\rightarrow 1\). Consequently, we know that, with probability tending to one, the normal Eq. (6) cannot hold. This implies that \(\hat{\mathbf {b}}_{\lambda ,j}\) must be located at the place where the objective function \(Q_{\lambda }(H)\) is not differentiable. Since the only place where \(Q_{\lambda }(H)\) is not differentiable for \(\mathbf {b}_{p}\) is the origin, we know \(P(||\hat{\mathbf {b}}_{\lambda ,j}||=0)\rightarrow 1\).
Similarly, we can prove the second part of the theorem. Hence, this completes the proof. \(\square \)
Proof of Theorem 2
By theorem 1, we know that \(\hat{\mathbf {a}}_{\lambda .j}=0, p_{1}<j\le p, \hat{\mathbf {b}}_{\lambda .j}=0, p_{0}<j\le p\) with probability tending to one. Consequently, we know that \(a_{a,\lambda }(u)\) must be the solution of the following normal equation
where \(L=\left( p'_{\lambda _{11}}(||\hat{\mathbf {a}}_{1,\lambda }||)\frac{\hat{\mathbf {a}}_{1}(u)}{||\hat{\mathbf {a}}_{1,\lambda }||},\ldots ,p'_{\lambda _{1p_{0}}}(||\hat{\mathbf {a}}_{p_{0},\lambda }||)\frac{\hat{\mathbf {a}}_{p_{0}}(u)}{||\hat{\mathbf {a}}_{p_{0},\lambda }||}\right) ^{T}\). It implies that \(\hat{\mathbf {a}}_{a,\lambda }\) is of the form
where \({\varSigma }_{1}(u)=n^{-1}\sum _{i=1}^{n}\mathbf{{X}}_{ia}\mathbf{{X}}_{ia}^{T}K_{h}(U_{i}-u)\). Comparing with the oracle estimator, we know that
where \({\varSigma }_{2}(u)=n^{-1}\sum _{i=1}^{n}\mathbf{{X}}_{ia}\mathbf{{X}}_{ia}^{T}(u_{i}-u)K_{h}(u_{i}-u)\), \({\varSigma }_{3}(u)= n^{-1}\sum _{i=1}^{n}\mathbf{{X}}_{ib}\mathbf{{X}}_{ib}^{T}K_{h}(u_{i}-u)\), \(\lambda _{1,min}=\min \{\lambda _{min}({\varSigma }_{1}(u)), u\in [0,1]\}\), \(\lambda _{2,max}=\max \{\lambda _{max}({\varSigma }_{2}(u)),u\in [0,1]\}\), \(\lambda _{3,max}\) \(=\max \{\lambda _{max}({\varSigma }_{3}(u)),u\in [0,1]\}\). For \(J_{1}\), applying Lemma 1, we have \(J_{1}\le C \sqrt{p_{0}} a_{1n}=o_{p}(n^{-2/5})\). By Lemma 2, we have \(J_{2}=o_{p}(n^{-2/5}), J_{3}=o_{p}(n^{-2/5})\). This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Ma, XJ., Zhang, JX. A new variable selection approach for varying coefficient models. Metrika 79, 59–72 (2016). https://doi.org/10.1007/s00184-015-0543-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-015-0543-y