Abstract
The paper concerns the regularized quantile regression for ultrahigh-dimensional data with responses missing not at random. The propensity score is specified by the semiparametric exponential tilting model. We use the Pearson Chi-square type test statistic for identification of the important features in the sparse propensity score model, and employ the adjusted empirical likelihood method for estimation of the parameters in the reduced model. With the estimated propensity score model, we suggest an inverse probability weighted and penalized objective function for regularized estimation using the nonconvex SCAD penalty and MCP functions. Assuming the propensity score model is of low dimension, we establish the oracle properties of the proposed regularized estimators. The new method has several desirable advantages. First, it is robust to heavy-tailed errors or potential outliers in the responses. Second, it can accommodate nonignorable nonresponse data. Third, it can deal with ultrahigh-dimensional data with heterogeneity. Simulation study and real data analysis are given to examine the finite sample performance of the proposed approaches.
Similar content being viewed by others
References
An LTH, Tao PD (2005) The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann Oper Res 133:23–46
Belloni A, Chernozhukov V (2011) L1-penalized quantile regression in high-dimensional sparse models. Ann Stat 39:82–130
Chang T, Kott PS (2008) Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika 95:555–571
Chen J, Variyath AM, Abraham B (2008) Adjusted empirical likelihood and its properties. J Comput Gr Stat 17:426–443
Ding X, Tang N (2018) Adjusted empirical likelihood estimation of distribution function and quantile with nonignorable missing data. J Syst Sci Complex 31:820–840
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Fan Y, Barut E (2014) Adaptive robust variable selection. Ann Stat 42:324–351
Fan J, Li Q, Wang Y (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J R Stat Soc Ser B 79:247–265
Fang F, Zhao J, Shao J (2018) Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values. Stat Sin 28:1677–1701
Gu Y, Fan J, Kong L, Ma S, Zou H (2018) ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60:319–331
He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369
Hong Z, Hu Y, Lian H (2013) Variable selection for high-dimensional varying coefficient partially linear models via nonconcave penalty. Metrika 76:887–908
Huang J, Ma S, Zhang C (2008) Adaptive lasso for sparse high-dimensional regression. Stat Sin 18:1603–1618
Huang D, Li R, Wang H (2014) Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat 32:237–244
Jiang D, Zhao P, Tang N (2016) A propensity score adjusted method for regression models with nonignorable missing covariates. Comput Stat Data Anal 94:98–119
Kim JK, Yu CL (2011) A semiparametric estimation of mean functionals with nonignorable missing data. J Am Stat Assoc 106:157–165
Kim Y, Choi H, Oh HS (2008) Smoothly clipped absolute deviation on high dimensions. J Am Stat Assoc 103:1665–1673
Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105:201–216
Lee ER, Noh H, Park BU (2014) Model selection via Bayesian information criterion for quantile regression models. J Am Stat Assoc 109:216–229
Ni L, Fang F (2016) Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification. J Nonparametr Stat 28:515–530
Ni L, Fang F, Wan F (2017) Adjusted Pearson Chi-square feature screening for multi-classification with ultrahigh dimensional data. Metrika 80:805–828
Owen AB (2001) Empirical likelihood. CRC Press, Boca Raton
Peng B, Wang L (2015) An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. J Comput Gr Stat 24:676–694
Qin J, Leung D, Shao J (2002) Estimation with survey data under nonignorable nonresponse or informative sampling. J Am Stat Assoc 97:193–200
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346:1937–1947
Shao J, Wang L (2016) Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika 103:175–187
Sherwood B (2016) Variable selection for additive partial linear quantile regression with missing covariates. J Multivar Anal 152:206–223
Tang N, Zhao P, Zhu H (2014) Empirical likelihood for estimating equations with nonignorably missing data. Stat Sin 24:723–747
Wang Q, Li Y (2018) How to make model free feature screening approaches for full data applicable to the case of missing response? Scand J Stat 45:324–346
Wang L, Wu Y, Li R (2012) Quantile regression for analyzing heterogeneity in ultra-high dimension. J Am Stat Assoc 107:214–222
Wang S, Shao J, Kim JK (2014) An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat Sin 24:1097–1116
Yu L, Lin N, Wang L (2017) A parallel algorithm for large-scale nonconvex penalized quantile regression. J Comput Gr Stat 26:935–939
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhang L, Lin C, Zhou Y (2018) Generalized method of moments for nonignorable missing data. Stat Sin 28:2107–2124
Zhao J, Shao J (2015) Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. J Am Stat Assoc 110:1577–1590
Zhao P, Zhao H, Tang N, Li Z (2017) Weighted composite quantile regression analysis for nonignorable missing data using nonresponse instrument. J Nonparametr Stat 29:189–212
Zhao J, Yang Y, Ning Y (2018) Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data. Stat Sin 28:2125–2148
Acknowledgements
The authors thank the Editor and the anonymous reviewers for their valuable comments and constructive suggestions, which have helped greatly improve our paper. This work was supported by National Natural Science Foundation of China (Nos. 11601195, 11971204), Natural Science Foundation of Jiangsu Province of China (No. BK20160289), Jiangsu Qing Lan Project, Jiangsu Overseas Visiting Scholar Program for University Prominent Young & Middle-aged Teachers and Presidents, and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 19KJB110007).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this section, we provide the proof of the results derived from the semiparametric weights of the PS model. First, we need some regularity conditions.
- Condition C1:
(Some regularity conditions on the estimating equations). The estimating equation \(\varphi _i(\gamma )\) satisfies: (1) \(E\{\varphi _i(\gamma )\varphi _i^{\top }(\gamma )\}\) is positive definite; (2) the second derivative \(\partial ^2\varphi _i(\gamma )/\partial \gamma ^2\) of \(\varphi _i(\gamma )\) is continuous in a neighborhood of the true values \(\gamma _0\), and \(|\partial \varphi _i(\gamma )/\partial \gamma |\) is bounded by some integrable function \(G({\mathbf {x}},Y)\) in this neighborhood; (3) \(E\{||\varphi _i(\gamma )||^{\kappa }\}\) are bounded for some \(\kappa >2\) and \(\gamma \in {\varvec{\Gamma }}\).
- Condition C2:
(Some commonly used conditions on analysis of missing data).
(1) The marginal probability density function \(f({\varvec{x}}_{i({\mathcal {A}})})\) is bounded away from \(\infty \) in the support of \({\varvec{x}}_{i({\mathcal {A}})}\) and the second derivative of \(f({\varvec{x}}_{i({\mathcal {A}})})\) is continuous and bounded; (2) there exist \(\alpha _l>0\) and \(\alpha _u<1\) such that \(\alpha _l<\pi _{i0}<\alpha _u\) for all \(i \in \{1,2,\ldots ,n\}\); (3) the kernel function \(K(\cdot )\) is a probability density function such that (a) it is bounded and has compact support, (b) it is symmetric with \(\int \omega ^2K(\omega )d\omega <\infty \), (c) \(K(\cdot )\ge d_1\) for some \(d_1>0\) in some closed interval centered at zero, and (d) let \(b\ge 2\), \(h\rightarrow 0\), \(nh^{2s}\rightarrow \infty \), \(nh^{2b}\rightarrow 0\) and \(nh^s/{\mathrm {ln}}(n)\rightarrow \infty \) as \(n\rightarrow \infty \).
- Condition C3:
(Some regularity conditions on analyzing sparse ultrahigh-dimensional data with heterogeneity). (1) (Condition on the random error) The conditional probability density function of \(f_i(\cdot |{\mathbb {S}}_i)\) is uniformly bounded away from 0 and infinity in a neighborhood of zero; (2) (Conditions on the design) there exists a constant \(K_1\) such that \(|X_{ij}|\le K_1\) for all \(i \in \{1,2,\ldots ,n\}\) and \(j \in \{1,2,\ldots ,p_n\}\). Also, \({1}/{n}X_j^{\top }X_j\le K_1\) for \(j=1,2,\ldots ,q_n\); (3) (Conditions on the true underlying model) there exist positive constants \(K_2<K_3\) such that \( K_2\le \lambda _{\min }(n^{-1}X_A^{\top }X_A)\le \lambda _{\max }(n^{-1}X_A^{\top }X_A)\le K_3, \) where \(\lambda _{\min }\) and \(\lambda _{\max }\) denote the smallest and largest eigenvalue, respectively. It is assumed that \(\max _{1\le i \le n}||{\mathbb {S}}_i||=O_p(\sqrt{q_n})\); (4) (Condition on model size) \(q_n=O(n^{C_1})\) for some \(0\le C_1<{1}/{2}\); (5) (Condition on the smallest signal) there exist positive constants \(C_2\) and \(K_4\) such that \(2C_1<C_2\le 1\) and \(n^{{1-C_2}/{2}}\min _{1\le j \le q_n}|\beta _{0j}|\ge K_4\).
Proof of Theorem 1
The proof of Theorem 1 can be obtained by using the similar arguments in Ding and Tang (2018), we here omit the details. \(\square \)
Proof of Theorem 2
Note that \({\widehat{{\varvec{\beta }}}}_1^K={\arg \min }_{{\varvec{\beta }}_1} L_n({\varvec{\beta }}_1)\), where \(L_n({\varvec{\beta }}_1)=\sum _{i=1}^n{\delta _i}/{{\hat{\pi }}_i({\hat{\gamma }}_{el})} \rho _{\tau } (Y_i-{\mathbb {S}}_i^{\top }{\varvec{\beta }}_1)\). We will show that \(\forall \epsilon >0\), and there exists a constant L such that for all n sufficiently large,
Since \(L_n({\varvec{\beta }}_1)\) is convex, this implies that with probability at least \(1-\epsilon \), \({\widehat{{\varvec{\beta }}}}_1^K\) is in the ball \(\{{{\varvec{\beta }}}_1:||{\widehat{{\varvec{\beta }}}}_1^K -{{\varvec{\beta }}}_{01}||\le {\mathbf {B}}n^{-1/2}q_n^{1/2}\}\). Let \({\mathbb {G}}_n({\mathbf {B}})=q_n^{-1} \{L_n({\varvec{\beta }}_{01}+n^{-1/2}q_n^{1/2}{\mathbf {B}})- L_n({\varvec{\beta }}_{01})\}\), then
where the second equality uses Knight\(^{,}\)s identity and \(\psi _i(\tau )=\tau -I(e_i<0)\). First we will show that \( I_{n1}=O_p(q_n^{-1/2}) L\). We need some notations. Define \(m_{Y_i}^0({\varvec{x}}_{i({\mathcal {A}})}) =E(Y_i|{\varvec{x}}_{i({\mathcal {A}})},\delta _i=0)\), \(m_{\psi }^0({\varvec{x}}_{i({\mathcal {A}})}) =E({\mathbb {S}}_i^{\top }\psi _i(\tau )|{\varvec{x}}_{i({\mathcal {A}})},\delta _i=0)\) and \(H=E\{(1-\delta _i)(Y_i-m_{Y_i}^0({\varvec{x}}_{i({\mathcal {A}})})) ({\mathbb {S}}_i^{\top }\psi _i(\tau )- m_{\psi }^0({\varvec{x}}_{i({\mathcal {A}})}))\}\). Then following the proof of Theorem 2 in Jiang et al. (2016) and recalling the fact \({\mathrm {Pr}}({\hat{{\mathcal {A}}}}={\mathcal {A}})\rightarrow 1\), we have \(I_{n1}=-q_n^{-1/2} L^{\top }W\) with \(W {\mathop {\rightarrow }\limits ^{\mathcal{L}}} N(0,\Sigma _1)\), where \(\Sigma _1=\mathrm {Var}\{{\delta _i}/{\pi ({\varvec{x}}_{i({\mathcal {A}})},Y_i; \gamma _0)}{\mathbb {S}}_i^{\top }\psi _i(\tau ) +(1-{\delta _i}/{\pi ({\varvec{x}}_{i({\mathcal {A}})},Y_i;\gamma _0)})m_{Y_i}^0 ({\varvec{x}}_{i({\mathcal {A}})})+\phi _i(\gamma _0)H\}\) with \(\phi _i(\gamma _0)=({\mathbb {B}}^{\top }{\mathbb {A}}^{-1}{\mathbb {B}})^{-1} {\mathbb {B}}^{\top }{\mathbb {A}}^{-1} \varphi ({\varvec{x}}_{i},Y_i;\gamma _0)\) being the influence function. Thus, we have \( I_{n1}=O_p(q_n^{-1/2}) L\).
Next we evaluate \(I_{n2}\). Let \(F_i(\cdot |{\mathbb {S}}_i)\) be the conditional distribution function of \(e_i\) given \({\mathbb {S}}_i\). We have
where the second equality holds since \(|{\hat{\gamma }}_{el}-\gamma _0|=O_p(n^{-1/2})\) and \(\max _i|{\hat{\pi }}({\varvec{x}}_{i({\hat{{\mathcal {A}}}})},Y_i;\gamma ) -{\pi }({\varvec{x}}_{i({\mathcal {A}})},Y_i;\gamma )|=o_p(1)\) via combing the standard kernel regression theory and \({\mathrm {Pr}}({\hat{{\mathcal {A}}}}={\mathcal {A}})\rightarrow 1\) as n increases, the first inequality holds via using condition C3(1) and the last inequality uses condition C3(3). Furthermore, since \(\int _0^{n^{-1/2}q_n^{1/2}{\mathbb {S}}_i^{\top }{\mathbf {B}}} \{I(e_i\le v)-I(e_i\le 0)\}{\mathrm {d}}v\) is nonnegative for all i, we have
where the second inequality uses condition C2(2) and the last inequality uses condition C3(3). Therefore, \(I_{n2}\ge \frac{1}{2}C L^2+o_p(1)\) as \(n\rightarrow \infty \) by Chebyshev’s inequality. By choosing L sufficiently large, \(I_{n2}\) will asymptotically dominate \(I_{n1}\). Thus, we can choose a sufficiently large L such that \({\mathbb {G}}_n({\mathbf {B}})>0\) with probability at least \(1-\epsilon \) for \(||{\mathbf {B}}||= L\) and all n sufficiently large. \(\square \)
Lemma 1
Assume that conditions C2 and C3 given in the “Appendix” hold and that \(\log (p_n)=o(n\lambda ^2)\) and \(n\lambda ^2\rightarrow \infty \). As \(n\rightarrow \infty \), we have
Lemma 2
Assume that conditions C2 and C3 given in the “Appendix” hold and that \(q_n\log (n)=o(n\lambda )\), \(\log (p_n)=o(n\lambda ^2)\) and \(n\lambda \rightarrow \infty \). Then for \(\forall L>0\), as \(n\rightarrow \infty \), we have
Proofs of Lemmas 1 and 2
The proofs of Lemmas A.2 and A.3 in Wang et al. (2012) can be modified to prove Lemmas 1 and 2 via using \(E\{\delta _i/\pi _{i0}|{\varvec{x}}_i,Y_i\}=1\), so we omit the details. \(\square \)
Define \(s({\varvec{\beta }})=(s_0({\varvec{\beta }}), s_1({\varvec{\beta }}),\ldots ,s_{p_n}({\varvec{\beta }}))^{\top }\) as the the subgradient corresponding to the unpenalized objective function \(S_n({\varvec{\beta }})\) for the oracle model, which can be given by
for \(j=0,1,\ldots ,p_n\), with \(v_i=0\) if \(Y_i-{\varvec{x}}_i^{\top }{\varvec{\beta }}\ne 0\) and \(v_i\in [\tau -1,\tau ]\) otherwise. The following lemma presents the properties of the oracle estimator and the subgradient functions corresponding to the active and inactive variables.
Lemma 3
Suppose that conditions C1, C2 and C3 given in the “Appendix” hold and that \(\lambda =o(n^{(-(1-C_2)/2)})\), \(n^{-1/2}q_n=o(\lambda )\), \(\log (p_n)=o(n{\lambda ^2})\), \(n\lambda ^2 \rightarrow \infty \), \(\max _{i}|{\hat{\pi }}({\varvec{x}}_{i({\mathcal {A}})},Y_i; \gamma _0)-\pi _{i0}|=O_p(h^b+(\mathrm {ln}n/h^sn)^{1/2})\) and \(h^b+({\mathrm {ln}}n/h^sn)^{1/2}=o(\lambda )\). For the oracle estimator \({\widehat{{\varvec{\beta }}}}_{ora}^K\), there exists \(v_i^*\), which satisfies \(v_i^*=0\) if \(Y_i-{\varvec{x}}_i^{\top }{\widehat{{\varvec{\beta }}}}_{ora}^K\ne 0\) and \(v_i^* \in [\tau -1,\tau ]\) if \(Y_i-{\varvec{x}}_i^{\top }{\widehat{{\varvec{\beta }}}}_{ora}^K= 0\), such that for \( s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)\) with \(v_i=v_i^*\), with probability approaching one, we have
Proof of Lemma 3
The proof of Lemma 3 can be followed from the proof of Lemma 2.2 and 2.3 in Wang et al. (2012). The convex optimization theory immediately provides (A.2) holds, while (A.3) holds from the assumption that \(\lambda =o(n^{-(1-C_2)/2})\), \(\sqrt{{q_n}/{n}}\) consistency of \({\widehat{{\varvec{\beta }}}}_1^K\) as stated in Theorem 2 and the smallest true signal by condition C3(5). By the definition of oracle estimator, \({\widehat{\beta }}^K_{ora,j}=0\), for \(j=q_n+1,\ldots ,p_n\). We need only to show that
as \(n\rightarrow \infty \). Let \({\mathcal {D}}=\{i:Y_i-{\mathbb {S}}_i^{\top }{\widehat{{\varvec{\beta }}}}_1^K=0, \delta _i=1\}\). Then for \(j=q_n+1,\ldots ,p_n\)
where \(v_i^*\in [\tau -1,\tau ]\) with \(i\in \mathcal {D}\) satisfies \(s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)=0\). With probability one (Sherwood 2016), we have \(|\mathcal {D}|=q_n+1\). Therefore by conditions C2, C3(2), \(|{\hat{\gamma }}_{el}-\gamma _0|=O_p(n^{-1/2})\) stated in Theorem 1 and \(q_nn^{-1/2}=o(\lambda )\), we have
Thus, to prove (A.6), it suffices to show that
as \(n\rightarrow \infty \). First, we have
Note that
Thus, as \(n\rightarrow \infty \), we have
This proves
Further, condition C2 and the fact that \(|{\hat{\gamma }}_{el}-\gamma _0|=O_p(n^{-1/2})\) can be combined to derive an upper bound for \(\max _i|{\hat{\pi }}_i({\hat{\gamma }}_{el})|\). Thus we have
where the equality holds since \(\max _{i}|{\hat{\pi }}({\varvec{x}}_{i({\mathcal {A}})},Y_i;\gamma _0) -\pi _{i0}|=O_p(h^b+({\mathrm {log}}(n)/h^sn)^{1/2})\) via the assumption and \(\max _i|{\hat{\pi }}_i({\varvec{x}}_{i({\mathcal {A}})},Y_i;\gamma _0) -{\hat{\pi }}_i({\hat{\gamma }}_{el})|=o_p(1)\) via recalling the facts \({\mathrm {Pr}}({\hat{{\mathcal {A}}}}={\mathcal {A}})\rightarrow 1\) and \({\hat{\gamma }}_{el}=\gamma _0+o_p(1)\). Thus we have \(\left| n^{-1}\sum _{i=1}^n{\delta _i}/{{\hat{\pi }}_i({\hat{\gamma }}_{el})}X_{ij} \{I(Y_i-{\mathbb {S}}_i^{\top }{\widehat{{\varvec{\beta }}}}_1^K\le 0)-\tau \}\right| =o(\lambda )\) by recalling (A.7). Hence, the proof is completed. \(\square \)
Lemma 4
Suppose that a nonconvex, nonsmoothing function \(f({\mathbf {x}})\) belongs to the class \( F=\{f({\mathbf {x}}):f({\mathbf {x}})=f_1({\mathbf {x}})-f_2({\mathbf {x}}), \ \ f_1, f_2 \text { are both convex}\}.\) Let \(dom(f_1)=\{{\mathbf {x}}: f_1({\mathbf {x}})<\infty \}\) be the effective domain of \(f_1\), and let \(\partial f_1({\mathbf {x}}_0)=\{{\mathbf {t}}:f_1({\mathbf {x}})\ge f_1({\mathbf {x}}_0)+({\mathbf {x}}-{\mathbf {x}}_0)^{\top }{\mathbf {t}}, \forall {\mathbf {x}}\}\) be the subdifferential of a convex function \(f_1({\mathbf {x}})\) at a point \({\mathbf {x}}_0\). If there exists a neighborhood U around the point \({\mathbf {x}}^*\) such that \(\partial f_2({\mathbf {x}})\cap \partial f_1({\mathbf {x}}^*)\ne \varnothing \), \(\forall {\mathbf {x}}\in U\cap dom(f_1)\), then \({\mathbf {x}}^*\) is a local minimizer of \(f_1({\mathbf {x}})-f_2({\mathbf {x}})\).
Proof of Lemma 4
The Lemma 4 was presented and proved in An and Tao (2005), which provides a sufficient local optimization condition for the difference of convex functions programming based on the subdifferential calculus. So we omit the details here. \(\square \)
Proof of Theorem 3
Note that the nonconvex, nonsmoothing penalized quantile objective function \(Q({\varvec{\beta }})\) in Eq. (2.7) can be written as the difference of two convex functions in \({\varvec{\beta }}\):
where \(f_1({\varvec{\beta }})=n^{-1}\sum _{i=1}^n\delta _i/{\hat{\pi }}_i({\hat{\gamma }} _{el})\rho _{\tau } (Y_i-{\varvec{x}}_i^{\top }{\varvec{\beta }})+\lambda \sum _{i=1}^p|\beta _j|\) and \(f_2({\varvec{\beta }})=\sum _{i=1}^ph_{\lambda }(\beta _j)\). For the SCAD penalty, we have
while for the MCP function, we have
The subdifferential of \(f_1({\varvec{\beta }})\) at \({\varvec{\beta }}\) is defined as the following collection of vectors:
where \(s_j({\varvec{\beta }})\) is defined in (A.1), and \(l_0=0\); for \(1\le j \le p_n, l_j=\mathrm {sgn}(\beta _j)\) if \(\beta _j\ne 0\) and \(l_j\in [-1,1]\) otherwise. Here \(\mathrm {sgn}(t)\) is defined as \(\mathrm {sgn}(t)=I(t>0)-I(t<0)\). Furthermore, for both SCAD penalty and MCP functions, \(f_2({\varvec{\beta }})\) is differentiable everywhere. Thus, the subdifferential of \(f_2({\varvec{\beta }})\) at any point \({\varvec{\beta }}\) is a singleton:
For both penalty functions, \(\partial f_2({\varvec{\beta }})/\partial \beta _j=0\) for \(j=0\). For \(1\le j\le p_n\),
for the SCAD penalty; while for the MCP function,
Next, we will check the condition in Lemma 4. From Lemma 3, there exists \(v_i^*\), \(i=1,2,\ldots ,n\), such that the subgradient function \(s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)\) defined with \(v_i=v_i^*\) satisfies \({\mathrm {Pr}}(s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)=0,j=0,1,\ldots ,q_n)\rightarrow 1\). Therefore, by the definition of the set \(\partial f_1(\varvec{{\widehat{{\varvec{\beta }}}}}_{ora}^K)\), we have \({\mathrm {Pr}}({\mathbb {G}}\subseteq \partial f_1(\varvec{{\widehat{{\varvec{\beta }}}}}_{ora}^K))\rightarrow 1\), where
and \(l_j\) ranges over \([-1,1]\), \(j=q_n+1,\ldots ,p_n\).
Consider any \({\varvec{\beta }}\) in a ball \({\mathbb {R}}^{p_n+1}\) with the center \(\varvec{{\widehat{{\varvec{\beta }}}}}_{ora}^K\) and radius \(\lambda /2\). To prove the theorem it is sufficient to show that there exists a vector \(\varvec{\xi }^*=(\xi _0^*,\xi _1^*,\ldots ,\xi _{p_n}^*)^{\top }\) in \({\mathbb {G}}\) such that
as \(n\rightarrow \infty \).
By Lemma 3, \({\mathrm {Pr}}(|s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)|\le \lambda , j=q_n+1,\ldots ,p_n)\rightarrow 1\), thus we can always find \(l_j^*\in [-1,1]\) such that \(s_j({\widehat{{\varvec{\beta }}}}_{ora}^K)+\lambda l_j^*=0\), for \(j=q_n+1,\ldots ,p_n\). Let \(\varvec{\xi }^*\) be the vector in \({\mathbb {G}}\) with \(l_j=l_j^*\), \(j=q_n+1,\ldots ,p_n\). We will verify that \(\varvec{\xi }^*\) satisfies (A.8).
(1) For \(j=0\), we have \(\xi _0^*=0\). Since \(\partial f_2({\varvec{\beta }})/\partial \beta _0=0\) for both penalty functions, it is immediate that \(\partial f_2({\varvec{\beta }})/\partial \beta _0=\xi _0^*=0\).
(2) For \(j=1,2,\ldots ,q_n\), we have \(\xi _j^*=\lambda \mathrm {sgn}(\widehat{\beta }_{ora,j}^K)\). We note that \(\min _{1\le j\le {q_n}}|\beta _j|\ge \min _{1\le j\le {q_n}}|\widehat{\beta }_{ora,j}^K|-\max _{1\le j\le {q_n}}|\widehat{\beta }_{ora,j}^K-\beta _j|\ge (a+1/2)\lambda -\lambda /2=a\lambda \) with probability approaching one by Lemma 3. Therefore, \({\mathrm {Pr}}({\partial f_2({\varvec{\beta }})}/{\partial \beta _j}=\lambda \mathrm {sgn}(\beta _j),j=1,\ldots ,q_n)\rightarrow 1\) as \(n\rightarrow \infty \) for both the SCAD penalty and MCP functions. For n sufficient large, \(\widehat{\beta }_{ora,j}^K\) and \(\beta _j\) have the same sign. Thus, \({\mathrm {Pr}}(\xi _j^*={\partial f_2({\varvec{\beta }})}/{\partial \beta _j},j=1,\ldots ,q_n)\rightarrow 1\) as \(n\rightarrow \infty \).
(3) For \(j=q_n+1,\ldots ,p_n\), we have \(\xi _j^*=0\) following the definition of \(\varvec{\xi }^*\). By Lemma 3, \({\mathrm {Pr}}(|\beta _j|\le |\widehat{\beta }_{ora,j}^K|+|\widehat{\beta }_{ora,j}^K-\beta _j|\le \lambda ,\quad j=q_n+1,\ldots ,p_n)\rightarrow 1\) as \(n\rightarrow \infty \). Therefore \({\mathrm {Pr}}({\partial f_2({\varvec{\beta }})}/{\partial \beta _j}=0,j=q_n+1,\ldots ,p_n)\rightarrow 1\) as \(n\rightarrow \infty \) for the SCAD penalty; and \({\mathrm {Pr}}({\partial f_2({\varvec{\beta }})}/{\partial \beta _j}=\beta _j/a,j=q_n+1,\ldots ,p_n)\rightarrow 1\) for the MCP function. Note that for both penalty functions, we have \({\mathrm {Pr}}(|{\partial f_2({\varvec{\beta }})}/{\partial \beta _j}|\le \lambda )\rightarrow 1\), for \(j=q_n+1,\ldots ,p_n\). By Lemma 3, with probability approaching one \(|s(\widehat{\beta }_{ora,j}^K)|\le \lambda \), for \(j=q_n+1,\ldots ,p_n\). Thus, we can always find \(l_j^*\in [-1,1]\) such that \({\mathrm {Pr}}(\xi _j^*=s(\widehat{\beta }_{ora,j}^K)+\lambda l_j^*={\partial f_2({\varvec{\beta }})}/{\partial \beta _j},j=q_n+1,\ldots ,p_n)\rightarrow 1\), as \(n\rightarrow \infty \), for both penalty functions. This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Ding, X., Chen, J. & Chen, X. Regularized quantile regression for ultrahigh-dimensional data with nonignorable missing responses. Metrika 83, 545–568 (2020). https://doi.org/10.1007/s00184-019-00744-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-019-00744-3