Advertisement

Discovering model structure for partially linear models

  • 180 Accesses

Abstract

Partially linear models (PLMs) have been widely used in statistical modeling, where prior knowledge is often required on which variables have linear or nonlinear effects in the PLMs. In this paper, we propose a model-free structure selection method for the PLMs, which aims to discover the model structure in the PLMs through automatically identifying variables that have linear or nonlinear effects on the response. The proposed method is formulated in a framework of gradient learning, equipped with a flexible reproducing kernel Hilbert space. The resultant optimization task is solved by an efficient proximal gradient descent algorithm. More importantly, the asymptotic estimation and selection consistencies of the proposed method are established without specifying any explicit model assumption, which assure that the true model structure in the PLMs can be correctly identified with high probability. The effectiveness of the proposed method is also supported by a variety of simulated and real-life examples.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3

References

  1. Boyd, S., Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

  2. Braun, M. (2006). Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7, 2303–2328.

  3. Combettes, P., Wajs, V. (2005). Signal recovery by proximal forward–backward splitting. Multiscale Modeling and Simulation, 4, 1168–1200.

  4. Engle, F., Granger, C., W. J., Rice, J., Weiss, A. (1986). Nonparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81, 310–320.

  5. Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70, 849–911.

  6. Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.

  7. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of American Statistical Association, 58, 13–30.

  8. He, X., Wang, J., Lv, S. (2018). Scalable kernel-based variable selection with sparsistency. arXiv:1802.09246.

  9. Huang, J., Wei, F., Ma, S. (2012). Semiparametric regression pursuit. Statistica Sinica, 22, 1403–1426.

  10. Jaakkola, T., Diekhans, M., Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of seventh international conference on intelligent systems for molecular biology (pp. 149–158).

  11. Lian, H., Liang, H., Ruppert, D. (2015). Separation of covariates into nonparametric and parametric parts in high-dimensional partially linear additive models. Statistica Sinica, 25, 591–607.

  12. Lin, D., Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika, 81, 61–71.

  13. Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace Hilbertien. Reports of the Paris Academy of Sciences, Series A, 255, 2897–2899.

  14. Prada-Sánchez, J., Febrero-Bande, M., Cotos-Yáñez, T., Gonzlez-Manteiga, W., Bermúdez-Cela, J., Lucas-Dominguez, T. (2000). Prediction of SO2 pollution Incidents near a power station using partially linear models and an historical matrix of predictor-response Vectors. Environmetrics, 11, 209–225.

  15. Raskutti, G., Wainwright, M., Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13, 389–427.

  16. Rotnitzky, A., Jewell, N. (1990). Hypothesis testing of regression parameters in semiparametric generalized Linear models for cluster correlated Data. Biometrika, 77, 485–497.

  17. Schmalensee, R., Stoker, M. (1999). Household gasoline demand in the united states. Econometrica, 67, 645–662.

  18. Sun, W., Wang, J., Fang, Y. (2013). Consistent selection of tuning parameters via variable selection stability. Journal of Machine Learning Research, 14, 3419–3440.

  19. Wahba, G. (1998). Support vector machines, reproducing kernel hilbert spaces, and randomized GACV. In B. Scholkopf, C. Burges, A. Smola (Eds.), Advances in kernel methods: support vector learning (pp. 69–88). Cambridge: MIT Press.

  20. Wu, Y., Stefanski, L. (2015). Automatic structure recovery for additive models. Biometrika, 102, 381–395.

  21. Xue, L. (2009). Consistent variable selection in additive models. Statistica Sinica, 19, 1281–1296.

  22. Yafeh, Y., Yosha, O. (2003). Large shareholders and banks: Who monitors and how? The Economic Journal, 113, 128–146.

  23. Yang, L., Lv, S., Wang, J. (2016). Model-free variable selection in reproducing kernel Hilbert space. Journal of Machine Learning Research, 17, 1–24.

  24. Ye, G., Xie, X. (2012). Learning sparse gradients for variable selection and dimension reduction. Machine Learning, 87, 303–355.

  25. Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with group variables. Journal of the Royal Statistical Society Series B, 68, 49–67.

  26. Zhang, H., Cheng, G., Liu, Y. (2011). Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of American Statistical Association, 106, 1099–1112.

  27. Zhou, D. (2007). Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220, 456–463.

Download references

Acknowledgements

This research is supported in part by HK GRF-11302615, HK GRF-11331016, and City SRG-7004865. The authors would like to thank the associate editor and two anonymous referees for their constructive suggestions. The authors would also like to thank Dr. Heng Lian (City University of Hong Kong) for sharing his code on the DPLM method.

Author information

Correspondence to Xin He.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 230 KB)

Appendix: technical proofs

Appendix: technical proofs

Proof of Theorem 1

For some constant \({a_1}\), denote

$$\begin{aligned} \mathcal {C}&=\Big \{ \mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}})-2\sigma ^2_s \ge a_{1} \left( \log \frac{4}{\delta _n}\right) ^{1/2} \\&\qquad \big ( n^{-1/4} + n^{-1/2} \lambda _0^{-1} + n^{-1/2}\lambda _1^{-2} +s^{p+6} +\lambda _0 +\lambda _1 \big ) \Big \}. \end{aligned}$$

Then it suffices to bound \(P(\mathcal{C})\). First,

$$\begin{aligned} P(\mathcal {C})&= P \left( \mathcal {C}\cap \{ |y| \le n^{1/8}~\text{ and }~ U_n \le M_0\} \right) \\&\quad +P \left( \mathcal {C}\cap \{ |y| \le n^{1/8}~\text{ and }~ U_n \le M_0\}^{C} \right) \\&\le P\left( |y|> n^{1/8}\right) + P\left( |y| \le n^{1/8}~\text{ and }~U_n> M_0 \right) \\&\quad +P\Big (\mathcal {C}\cap \{ |y| \le n^{1/8}~\text{ and }~ U_n \le M_0\}\Big )=P_1+P_2+P_3, \end{aligned}$$

where \(U_n=\frac{1}{n(n-1)}\sum _{i,j=1}^n(y_i-y_j)^2\), and \(M_0 = 4A^2 + 2\sigma ^2+1\) with A being the upper bound of \(f^*(\mathbf{x})\) on \(\mathcal{X}\) and \(\sigma ^2=\mathrm{Var}(\epsilon )\). Next, we bound \(P_1, P_2,\) and \(P_3\) separately. To bound \(P_1\), we have \(P_1\le {E(|y|)}n^{-1/8}\) by the Markov’s inequality, where E(|y|) is a bounded quantity. To bound \(P_2\), note that \(E(U_n)=E(E(U_n|\mathbf{x}_i,\mathbf{x}_j))=E((f^*(\mathbf{x}_i)-f^*(\mathbf{x}_j))^2)+E((\epsilon _i-\epsilon _j)^2) \le 4A^2 + 2\sigma ^2\). And thus by Bernstein’s inequality for U-statistics (Hoeffding 1963), we have that

$$\begin{aligned} P_2&\le P\left( U_n> M_0 \big | |y|\le n^{1/8}\right) \\&\le P\left( U_n -E(U_n) >1 \big | |y|\le n^{1/8}\right) \le \exp \left( -\frac{n^{{1}/{2}}}{16}\right) . \end{aligned}$$

To bound \(P_3\), within the set \(\{ |y| \le n^{\frac{1}{8}}~\text{ and }~ U_n{\tiny {\tiny }} \le M_0\}\), equlaity (1) and by Lemma 3 in the supplementary file, we have with probability at least \(1-\delta _n\) that

$$\begin{aligned} 0&\le \mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}}) - 2\sigma ^2_s \le a_1 \left( \log \frac{4}{\delta _n}\right) ^{1/2} \\&\quad \left( n^{-{1}/{4}}+ M_0^2n^{-{1}/{2}}\lambda _0^{-1} + M_0^2n^{-{1}/{2}}\lambda _1^{-2} +s^{p+6} + \lambda _0 +\lambda _1 \right) , \end{aligned}$$

which implies \(P_3\le \delta _n\), and thus \(P(\mathcal {C})\le \delta _n+O(n^{-{1}/{8}})\) for some constant \(a_1\). Specially, when \(\lambda _0= n^{-{1}/{4}}\), \(\lambda _1=n^{-{1}/{4(p+2)}}\) and \(s=n^{-{1}/{4(p+6)(p+2+2\theta )}}\), there exists a constant \(c_5\) such that with probability at least \(1-\delta _n\)

$$\begin{aligned} 0\le \mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}}) -2\sigma ^2_s \le c_5 \left( \log \frac{4}{\delta _n}\right) ^{{1}/{2}}n^{-\frac{1}{4(p+2+2\theta )}}. \end{aligned}$$

Next, we establish the estimation consistency. By Assumptions 1 and 2 and equlaity (1), for some constant \(a_2\) there holds

$$\begin{aligned} 0\le \mathcal{E}({\mathbf{g}}^*,{\mathbf{H}}^*) - 2 \sigma _s^2&\le \iint w(\mathbf{x},\mathbf{u})c_0^2\Vert \mathbf{x}-\mathbf{u}\Vert _2^6\rho _{\mathbf{x}}\rho _{\mathbf{u}}\\&\le c_0^2c_4s^{p+6}\int e^{-\mathbf{t}^\mathrm{T}\mathbf{t}}{\mathbf{t}}^\mathrm{T}{\mathbf{t}} \mathrm{d}\mathbf{t}\le a_2s^{p+6}, \end{aligned}$$

where \(a_2=c_0^2c_4\int e^{-\mathbf{t}^\mathrm{T}\mathbf{t}}{\mathbf{t}}^\mathrm{T}{\mathbf{t}} \mathrm{d}\mathbf{t}\), \(\mathbf{t}=(\mathbf{u}-\mathbf{x})/s\) and \(\int e^{-\mathbf{t}^\mathrm{T}\mathbf{t}}{\mathbf{t}}^\mathrm{T}{\mathbf{t}} \mathrm{d}\mathbf{t}\) is a bounded quantity. Specially, with \(s=n^{-{1}/{4(p+6)(p+2+2\theta )}}\), we have \(\mathcal{E}({\mathbf{g}}^*,{\mathbf{H}}^*) - 2 \sigma _s^2\le a_2n^{-{1}/{4(p+2+2\theta )}}\). Therefore, for some constant \(c_6\), triangle inequality implies that

$$\begin{aligned} |\mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}}) - \mathcal{E}({\mathbf{g}}^*,{\mathbf{H}}^*)|&\le |\mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}})-2\sigma ^2_s| + |\mathcal{E}({\mathbf{g}}^*,{\mathbf{H}}^*)-2\sigma ^2_s|\\&\le c_6 \left( \log \frac{4}{\delta _n}\right) ^{{1}/{2}}n^{-\frac{1}{4(p+2+2\theta )}}. \end{aligned}$$

This completes the Proof of Theorem 1. \(\square \)

Proof of Theorem 2:

First we show that for any \(l \in \mathcal{L}^*\), \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) for any \(l'\in \mathcal{S}\). Note that \(\Vert \widehat{c}_{ll'}\Vert _2=0\) implies that \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) based on the representer theorem for the RKHS, and thus it suffices to show \( \Vert \widehat{\mathbf{c}}_{ll'}\Vert _2=0\) for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\).

Suppose \(\Vert \widehat{\mathbf{c}}_{ll'}\Vert _2>0\) for some \(l \in \mathcal{L}^*\) and \(l' \in \mathcal{S}\). The derivative of (5) with respect to \({\mathbf{c}}_{ll'}\) yields that

$$\begin{aligned} \sum _{i,j=1}^n x_{ijl}x_{jil'} A_1 ({\mathbf{x}}_i,y_i,{\mathbf{x}}_j,y_j){\mathbf{K}}_{{\mathbf{x}}_i} = \lambda _1 A_2(\widehat{\mathbf{c}}_{ll'}), \end{aligned}$$
(8)

where

$$\begin{aligned} A_1({\mathbf{x}}_i,y_i,{\mathbf{x}}_j,y_j)= & {} \frac{1}{n(n-1)}w_{ij} ( y_i -y_j - \widehat{\mathbf{g}}({\mathbf{x}}_i)^\mathrm{T} ({\mathbf{x}}_i - {\mathbf{x}}_j)\\&+ \frac{1}{2} ({\mathbf{x}}_i - {\mathbf{x}}_j)^\mathrm{T} \widehat{\mathbf{H}}({\mathbf{x}}_i)({\mathbf{x}}_i - {\mathbf{x}}_j) ), \end{aligned}$$

\(A_2(\widehat{\mathbf{c}}_{ll'})=\frac{ \pi _{ll'} {\mathbf{K}} \widehat{\mathbf{c}}_{ll'} }{\left( \widehat{\mathbf{c}}_{ll'}^\mathrm{T} {\mathbf{K}} \widehat{\mathbf{c}}_{ll'} \right) ^{{1}/{2}}}\), and \(x_{ijl}=x_{il}-x_{jl}.\) For the right-hand side of (8), its norm divided by \(n^{{1}/{2}}\) is \(n^{-{1}/{2}}\lambda _1\Vert A_2(\widehat{\mathbf{c}}_{ll'})\Vert _2 \ge n^{-{1}/{2}} \lambda _{1}\pi _{ll'} \psi _{min}\psi _{max}^{-{1}/{2}}\), which diverges to infinity by Assumption 5. For the left-hand side of (8), by Assumption 1, \(x_{ijl}, x_{jil'}\), and every elements of \({\mathbf{K}}_{\mathbf{x}}\) are bounded. Denote \(A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}},\widehat{\mathbf{c}})=\sum _{i,j=1}^n A_1 ({\mathbf{x}}_i,y_i,{\mathbf{x}}_j,y_j)\), we will show that \(|A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}},\widehat{\mathbf{c}})|\) is bounded as well.

For some constant \(a_{3}\) and \(\delta _n\in (0,1)\), denote

$$\begin{aligned} \mathcal {D}&=\Big \{ |A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}},\widehat{\mathbf{c}})| > a_{3}\left( \log {\frac{2}{\delta _n}}\right) ^{{1}/{2}} \\&\quad \left( n^{-{1}/{8(p+2+2\theta )}} + n^{-{3}/{8}} +n^{-{1}/{2}}\lambda _0^{-{1}/{2}}+n^{-{1}/{2}}\lambda _1^{-1} \right) \Big \}, \end{aligned}$$

and thus it suffices to bound \(P(\mathcal{D})\). First, we have

$$\begin{aligned} P(\mathcal {D})&=P\left( \mathcal {D}\cap \{|y|\le n^{{1}/{8}} ~\text{ and }~U_n\le M_0\} \right) \\&\quad + P\left( \mathcal {D}\cap \{|y|\le n^{{1}/{8}} ~\text{ and }~U_n\le M_0\}^{C} \right) \\&\le P\left( |y|> n^{{1}/{8}} \right) + P\left( |y|\le n^{{1}/{8}} ~\text{ and }~U_n> M_0 \right) \\&\quad + P\left( \mathcal {D}\cap \{|y|\le n^{{1}/{8}} ~\text{ and }~U_n\le M_0\}\right) \le P_1+P_2+P_4, \end{aligned}$$

where \(U_n\) and \(M_0\) are defined as in Theorem 1. Note that \(P_1+P_2=O(n^{-1/8})\) as in the Proof of Theorem 1. To bound \(P_4\), by Cauchy–Schwarz inequality, we conclude that

$$\begin{aligned} E(A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}}, \widehat{\mathbf{c}} ))&\le \Big (\iint w({\mathbf{x}},{\mathbf{u}}) \Big (f^*(\mathbf{x}) - f^*(\mathbf{u}) -\widehat{\mathbf{g}}(\mathbf{x})^\mathrm{T}({\mathbf{x}}-{\mathbf{u}}) \\&\quad + \frac{1}{2} ({\mathbf{x}}-{\mathbf{u}})^\mathrm{T}\widehat{\mathbf{H}}(\mathbf{x})({\mathbf{x}}-{\mathbf{u}}) \Big )^2 \mathrm{d} \rho _{\mathbf{x}}\mathrm{d} \rho _{\mathbf{u}} \Big )^{{1}/{2}}=\left( \mathcal{E}(\widehat{\mathbf{g}},\widehat{\mathbf{H}})-2\sigma ^2_s\right) ^{{1}/{2}}. \end{aligned}$$

Within the set \(\{|y|\le n^{{1}/{8}} ~\text{ and }~U_n\le M_0\}\), following similar proofs of Lemma 1 and Proposition 2, we have for some constant \(a_{3}\), with probability at least \(1-{\delta _n}\) there holds

$$\begin{aligned} |A_{\mathcal{Z}^n}(\widehat{{\varvec{\alpha }}}, \widehat{\mathbf{c}})| \le a_{3} \left( \log \frac{2}{\delta _n}\right) ^{{1}/{2}}\left( n^{-{1}/{{8(p+2+2\theta )}}} + n^{-{3}/{8}} +n^{-{1}/{2}}\lambda _0^{-1}+n^{-{1}/{2}}\lambda _1^{-1} \right) , \end{aligned}$$

which implies \(P_4\le \delta _n\), and thus we have \(P(\mathcal {D})\le {\delta _n} + O(n^{-{1}/{8}})\). Combining the above results, the norm of the left-hand side of (8) divided by \(n^{{1}/{2}}\) converges to zero in probability, which contradicts with the fact that the right-hand side of (8) diverges to infinity when \(n\rightarrow \infty \). Therefore, for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\), \(\Vert \widehat{\mathbf{c}}_{ll'}\Vert _2\equiv 0\), implying \(\Vert \widehat{H}_{ll'}\Vert _{L_{\rho _{\mathbf{x}}}^2}=0\) for any \(l \in \mathcal{L}^*\) and \(l'\in \mathcal{S}\), and thus there holds \(\widehat{\mathcal{L}} \subset \mathcal{L}^*\).

Next, we show \(\Vert \widehat{H}_{ll'}\Vert _{L^2_{\rho _{\mathbf{x}}}}^2\ne 0\) for any \(l\in \mathcal{N}^* \) and some \(l'\in \mathcal{S}\). By Lemma 4, for the set \(\mathcal{X}_s=\{ \mathbf{x}\in \mathcal{X}: d(\mathbf{x},\partial \mathcal{X})>s, p(\mathbf{x})>s+c_1s^{\theta }\}\) and some constant as in the supplementary file, there holds

$$\begin{aligned} \int \nolimits _{\mathcal{X}_{s}} \Vert \widehat{\mathbf{H}}(\mathbf{x}) - {\mathbf{H}}^*(\mathbf{x})\Vert _F^2 \mathrm{d} \rho _{\mathbf{x}} \le \frac{b_{6}}{s^{p+5 }} (s^{6+p} + \mathcal{E}\left( \widehat{\mathbf{g}}, \widehat{\mathbf{H}})- 2\sigma _s^2\right) , \end{aligned}$$

which converges to zero in probability by Theorem 1. Suppose that there exist some \(l\in \mathcal{N}^*\) such that \(\Vert \widehat{H}_{ll'}\Vert ^2_{L_{\rho _X}^2}=0\) for any \(l'\in \mathcal{S}\), then

$$\begin{aligned} \int \nolimits _{\mathcal{X}_{s}} ({H}^*_{ll'}(\mathbf{x}))^2 \mathrm{d} \rho _{\mathbf{x}}\le \int \nolimits _{\mathcal{X}_{s}} \Vert \widehat{\mathbf{H}}(\mathbf{x}) - {\mathbf{H}}^*(\mathbf{x})\Vert _F^2 \mathrm{d} \rho _{\mathbf{x}}. \end{aligned}$$

However, Assumption 4 implies that for some \(l'\in \mathcal{S}\), \(\int \nolimits _{\mathcal{X}_{s}} ({H}^*_{ll'}(\mathbf{x}))^2 \mathrm{d} \rho _{\mathbf{x}}>\int \nolimits _{\mathcal{X}\backslash \mathcal{X}_{t}} ({H}^*_{ll'}(\mathbf{x}))^2 \mathrm{d} \rho _{\mathbf{x}}\) when s is sufficiently small, which is a positive constant, and thus leads to contradiction. Therefore, \(\Vert \widehat{H}_{ll'}\Vert _{L^2_{\rho _{\mathbf{x}}}}^2\ne 0\) for any \(l\in \mathcal{N}^* \) and some \(l'\in \mathcal{S}\), and thus there holds \(\widehat{\mathcal{N}}\subset \mathcal{N}^*\).

Finally, since \(\mathcal{S}=\mathcal{L}^*\cup \mathcal{N}^*=\widehat{L}\cup \widehat{\mathcal{N}}\) and \(\mathcal{L}^*\cap \mathcal{N}^*=\widehat{L}\cap \widehat{\mathcal{N}}=\varnothing \), combining with the above results we have \(P(\widehat{\mathcal{L}} = \mathcal{L}^*)\rightarrow 1\) and \(P(\widehat{\mathcal{N}}=\mathcal{N}^*)\rightarrow 1\) when n diverges. Moreover, we have \(P(\widehat{\mathcal{L}}=\mathcal{L}^*,\widehat{\mathcal{N}}=\mathcal{N}^*) \ge 1 - P(\widehat{\mathcal{L}} \ne \mathcal{L}^*) - P(\widehat{\mathcal{N}}\ne \mathcal{N}^*) \rightarrow 1\) as \(n \rightarrow \infty \). This completes the Proof of Theorem 2. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

He, X., Wang, J. Discovering model structure for partially linear models. Ann Inst Stat Math 72, 45–63 (2020) doi:10.1007/s10463-018-0682-9

Download citation

Keywords

  • Lasso
  • Gradient learning
  • Partially linear models
  • Proximal gradient descent
  • Reproducing kernel Hilbert space (RKHS)