Abstract
Partially linear models (PLMs) have been widely used in statistical modeling, where prior knowledge is often required on which variables have linear or nonlinear effects in the PLMs. In this paper, we propose a model-free structure selection method for the PLMs, which aims to discover the model structure in the PLMs through automatically identifying variables that have linear or nonlinear effects on the response. The proposed method is formulated in a framework of gradient learning, equipped with a flexible reproducing kernel Hilbert space. The resultant optimization task is solved by an efficient proximal gradient descent algorithm. More importantly, the asymptotic estimation and selection consistencies of the proposed method are established without specifying any explicit model assumption, which assure that the true model structure in the PLMs can be correctly identified with high probability. The effectiveness of the proposed method is also supported by a variety of simulated and real-life examples.
Similar content being viewed by others
References
Boyd, S., Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Braun, M. (2006). Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7, 2303–2328.
Combettes, P., Wajs, V. (2005). Signal recovery by proximal forward–backward splitting. Multiscale Modeling and Simulation, 4, 1168–1200.
Engle, F., Granger, C., W. J., Rice, J., Weiss, A. (1986). Nonparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81, 310–320.
Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70, 849–911.
Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of American Statistical Association, 58, 13–30.
He, X., Wang, J., Lv, S. (2018). Scalable kernel-based variable selection with sparsistency. arXiv:1802.09246.
Huang, J., Wei, F., Ma, S. (2012). Semiparametric regression pursuit. Statistica Sinica, 22, 1403–1426.
Jaakkola, T., Diekhans, M., Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of seventh international conference on intelligent systems for molecular biology (pp. 149–158).
Lian, H., Liang, H., Ruppert, D. (2015). Separation of covariates into nonparametric and parametric parts in high-dimensional partially linear additive models. Statistica Sinica, 25, 591–607.
Lin, D., Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika, 81, 61–71.
Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace Hilbertien. Reports of the Paris Academy of Sciences, Series A, 255, 2897–2899.
Prada-Sánchez, J., Febrero-Bande, M., Cotos-Yáñez, T., Gonzlez-Manteiga, W., Bermúdez-Cela, J., Lucas-Dominguez, T. (2000). Prediction of SO2 pollution Incidents near a power station using partially linear models and an historical matrix of predictor-response Vectors. Environmetrics, 11, 209–225.
Raskutti, G., Wainwright, M., Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13, 389–427.
Rotnitzky, A., Jewell, N. (1990). Hypothesis testing of regression parameters in semiparametric generalized Linear models for cluster correlated Data. Biometrika, 77, 485–497.
Schmalensee, R., Stoker, M. (1999). Household gasoline demand in the united states. Econometrica, 67, 645–662.
Sun, W., Wang, J., Fang, Y. (2013). Consistent selection of tuning parameters via variable selection stability. Journal of Machine Learning Research, 14, 3419–3440.
Wahba, G. (1998). Support vector machines, reproducing kernel hilbert spaces, and randomized GACV. In B. Scholkopf, C. Burges, A. Smola (Eds.), Advances in kernel methods: support vector learning (pp. 69–88). Cambridge: MIT Press.
Wu, Y., Stefanski, L. (2015). Automatic structure recovery for additive models. Biometrika, 102, 381–395.
Xue, L. (2009). Consistent variable selection in additive models. Statistica Sinica, 19, 1281–1296.
Yafeh, Y., Yosha, O. (2003). Large shareholders and banks: Who monitors and how? The Economic Journal, 113, 128–146.
Yang, L., Lv, S., Wang, J. (2016). Model-free variable selection in reproducing kernel Hilbert space. Journal of Machine Learning Research, 17, 1–24.
Ye, G., Xie, X. (2012). Learning sparse gradients for variable selection and dimension reduction. Machine Learning, 87, 303–355.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with group variables. Journal of the Royal Statistical Society Series B, 68, 49–67.
Zhang, H., Cheng, G., Liu, Y. (2011). Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of American Statistical Association, 106, 1099–1112.
Zhou, D. (2007). Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220, 456–463.
Acknowledgements
This research is supported in part by HK GRF-11302615, HK GRF-11331016, and City SRG-7004865. The authors would like to thank the associate editor and two anonymous referees for their constructive suggestions. The authors would also like to thank Dr. Heng Lian (City University of Hong Kong) for sharing his code on the DPLM method.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: technical proofs
Appendix: technical proofs
Proof of Theorem 1
For some constant \({a_1}\), denote
Then it suffices to bound \(P(\mathcal{C})\). First,
where \(U_n=\frac{1}{n(n-1)}\sum _{i,j=1}^n(y_i-y_j)^2\), and \(M_0 = 4A^2 + 2\sigma ^2+1\) with A being the upper bound of \(f^*(\mathbf{x})\) on \(\mathcal{X}\) and \(\sigma ^2=\mathrm{Var}(\epsilon )\). Next, we bound \(P_1, P_2,\) and \(P_3\) separately. To bound \(P_1\), we have \(P_1\le {E(|y|)}n^{-1/8}\) by the Markov’s inequality, where E(|y|) is a bounded quantity. To bound \(P_2\), note that \(E(U_n)=E(E(U_n|\mathbf{x}_i,\mathbf{x}_j))=E((f^*(\mathbf{x}_i)-f^*(\mathbf{x}_j))^2)+E((\epsilon _i-\epsilon _j)^2) \le 4A^2 + 2\sigma ^2\). And thus by Bernstein’s inequality for U-statistics (Hoeffding 1963), we have that
To bound \(P_3\), within the set \(\{ |y| \le n^{\frac{1}{8}}~\text{ and }~ U_n{\tiny {\tiny }} \le M_0\}\), equlaity (1) and by Lemma 3 in the supplementary file, we have with probability at least \(1-\delta _n\) that
which implies \(P_3\le \delta _n\), and thus \(P(\mathcal {C})\le \delta _n+O(n^{-{1}/{8}})\) for some constant \(a_1\). Specially, when \(\lambda _0= n^{-{1}/{4}}\), \(\lambda _1=n^{-{1}/{4(p+2)}}\) and \(s=n^{-{1}/{4(p+6)(p+2+2\theta )}}\), there exists a constant \(c_5\) such that with probability at least \(1-\delta _n\)
Next, we establish the estimation consistency. By Assumptions 1 and 2 and equlaity (1), for some constant \(a_2\) there holds
where \(a_2=c_0^2c_4\int e^{-\mathbf{t}^\mathrm{T}\mathbf{t}}{\mathbf{t}}^\mathrm{T}{\mathbf{t}} \mathrm{d}\mathbf{t}\), \(\mathbf{t}=(\mathbf{u}-\mathbf{x})/s\) and \(\int e^{-\mathbf{t}^\mathrm{T}\mathbf{t}}{\mathbf{t}}^\mathrm{T}{\mathbf{t}} \mathrm{d}\mathbf{t}\) is a bounded quantity. Specially, with \(s=n^{-{1}/{4(p+6)(p+2+2\theta )}}\), we have \(\mathcal{E}({\mathbf{g}}^*,{\mathbf{H}}^*) - 2 \sigma _s^2\le a_2n^{-{1}/{4(p+2+2\theta )}}\). Therefore, for some constant \(c_6\), triangle inequality implies that
This completes the Proof of Theorem 1. \(\square \)
Proof of Theorem 2:
First we show that for any \(l \in \mathcal{L}^*\), \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) for any \(l'\in \mathcal{S}\). Note that \(\Vert \widehat{c}_{ll'}\Vert _2=0\) implies that \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) based on the representer theorem for the RKHS, and thus it suffices to show \( \Vert \widehat{\mathbf{c}}_{ll'}\Vert _2=0\) for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\).
Suppose \(\Vert \widehat{\mathbf{c}}_{ll'}\Vert _2>0\) for some \(l \in \mathcal{L}^*\) and \(l' \in \mathcal{S}\). The derivative of (5) with respect to \({\mathbf{c}}_{ll'}\) yields that
where
\(A_2(\widehat{\mathbf{c}}_{ll'})=\frac{ \pi _{ll'} {\mathbf{K}} \widehat{\mathbf{c}}_{ll'} }{\left( \widehat{\mathbf{c}}_{ll'}^\mathrm{T} {\mathbf{K}} \widehat{\mathbf{c}}_{ll'} \right) ^{{1}/{2}}}\), and \(x_{ijl}=x_{il}-x_{jl}.\) For the right-hand side of (8), its norm divided by \(n^{{1}/{2}}\) is \(n^{-{1}/{2}}\lambda _1\Vert A_2(\widehat{\mathbf{c}}_{ll'})\Vert _2 \ge n^{-{1}/{2}} \lambda _{1}\pi _{ll'} \psi _{min}\psi _{max}^{-{1}/{2}}\), which diverges to infinity by Assumption 5. For the left-hand side of (8), by Assumption 1, \(x_{ijl}, x_{jil'}\), and every elements of \({\mathbf{K}}_{\mathbf{x}}\) are bounded. Denote \(A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}},\widehat{\mathbf{c}})=\sum _{i,j=1}^n A_1 ({\mathbf{x}}_i,y_i,{\mathbf{x}}_j,y_j)\), we will show that \(|A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}},\widehat{\mathbf{c}})|\) is bounded as well.
For some constant \(a_{3}\) and \(\delta _n\in (0,1)\), denote
and thus it suffices to bound \(P(\mathcal{D})\). First, we have
where \(U_n\) and \(M_0\) are defined as in Theorem 1. Note that \(P_1+P_2=O(n^{-1/8})\) as in the Proof of Theorem 1. To bound \(P_4\), by Cauchy–Schwarz inequality, we conclude that
Within the set \(\{|y|\le n^{{1}/{8}} ~\text{ and }~U_n\le M_0\}\), following similar proofs of Lemma 1 and Proposition 2, we have for some constant \(a_{3}\), with probability at least \(1-{\delta _n}\) there holds
which implies \(P_4\le \delta _n\), and thus we have \(P(\mathcal {D})\le {\delta _n} + O(n^{-{1}/{8}})\). Combining the above results, the norm of the left-hand side of (8) divided by \(n^{{1}/{2}}\) converges to zero in probability, which contradicts with the fact that the right-hand side of (8) diverges to infinity when \(n\rightarrow \infty \). Therefore, for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\), \(\Vert \widehat{\mathbf{c}}_{ll'}\Vert _2\equiv 0\), implying \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\), and thus there holds \(\widehat{\mathcal{L}} \subset \mathcal{L}^*\).
Next, we show \(\Vert \widehat{H}_{ll'}\Vert _{L^2_{\rho _{\mathbf{x}}}}^2\ne 0\) for any \(l\in \mathcal{N}^* \) and some \(l'\in \mathcal{S}\). By Lemma 4, for the set \(\mathcal{X}_s=\{ \mathbf{x}\in \mathcal{X}: d(\mathbf{x},\partial \mathcal{X})>s, p(\mathbf{x})>s+c_1s^{\theta }\}\) and some constant as in the supplementary file, there holds
which converges to zero in probability by Theorem 1. Suppose that there exist some \(l\in \mathcal{N}^*\) such that \(\Vert \widehat{H}_{ll'}\Vert ^2_{L_{\rho _X}^2}=0\) for any \(l'\in \mathcal{S}\), then
However, Assumption 4 implies that for some \(l'\in \mathcal{S}\), \(\int \nolimits _{\mathcal{X}_{s}} ({H}^*_{ll'}(\mathbf{x}))^2 \mathrm{d} \rho _{\mathbf{x}}>\int \nolimits _{\mathcal{X}\backslash \mathcal{X}_{t}} ({H}^*_{ll'}(\mathbf{x}))^2 \mathrm{d} \rho _{\mathbf{x}}\) when s is sufficiently small, which is a positive constant, and thus leads to contradiction. Therefore, \(\Vert \widehat{H}_{ll'}\Vert _{L^2_{\rho _{\mathbf{x}}}}^2\ne 0\) for any \(l\in \mathcal{N}^* \) and some \(l'\in \mathcal{S}\), and thus there holds \(\widehat{\mathcal{N}}\subset \mathcal{N}^*\).
Finally, since \(\mathcal{S}=\mathcal{L}^*\cup \mathcal{N}^*=\widehat{L}\cup \widehat{\mathcal{N}}\) and \(\mathcal{L}^*\cap \mathcal{N}^*=\widehat{L}\cap \widehat{\mathcal{N}}=\varnothing \), combining with the above results we have \(P(\widehat{\mathcal{L}} = \mathcal{L}^*)\rightarrow 1\) and \(P(\widehat{\mathcal{N}}=\mathcal{N}^*)\rightarrow 1\) when n diverges. Moreover, we have \(P(\widehat{\mathcal{L}}=\mathcal{L}^*,\widehat{\mathcal{N}}=\mathcal{N}^*) \ge 1 - P(\widehat{\mathcal{L}} \ne \mathcal{L}^*) - P(\widehat{\mathcal{N}}\ne \mathcal{N}^*) \rightarrow 1\) as \(n \rightarrow \infty \). This completes the Proof of Theorem 2. \(\square \)
About this article
Cite this article
He, X., Wang, J. Discovering model structure for partially linear models. Ann Inst Stat Math 72, 45–63 (2020). https://doi.org/10.1007/s10463-018-0682-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0682-9