Skip to main content
Log in

Robust estimation in regression and classification methods for large dimensional data

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Statistical data analysis and machine learning heavily rely on error measures for regression, classification, and forecasting. Bregman divergence (\({\text{BD}}\)) is a widely used family of error measures, but it is not robust to outlying observations or high leverage points in large- and high-dimensional datasets. In this paper, we propose a new family of robust Bregman divergences called “robust-\({\text{BD}}\)” that are less sensitive to data outliers. We explore their suitability for sparse large-dimensional regression models with incompletely specified response variable distributions and propose a new estimate called the “penalized robust-\({\text{BD}}\) estimate” that achieves the same oracle property as ordinary non-robust penalized least-squares and penalized-likelihood estimates. We conduct extensive numerical experiments to evaluate the performance of the proposed penalized robust-\({\text{BD}}\) estimate and compare it with classical approaches, and show that our proposed method improves on existing approaches. Finally, we analyze a real dataset to illustrate the practicality of our proposed method. Our findings suggest that the proposed method can be a useful tool for robust statistical data analysis and machine learning in the presence of outliers and large-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

The Lymphoma data studied in Sect. 7 is publicly available from Alizadeh et al. (2000); the Colon cancer dataset in Appendix 1.2.2 is at http://genomics-pubs.princeton.edu/oncology/.

Code availability

MATLAB codes are available at https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM.

References

  • Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., … Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.

    Article  Google Scholar 

  • Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the USA, 96, 6745–6750.

    Article  Google Scholar 

  • Altun, Y., & Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In G. Lugosi & H. U. Simon (Eds.), Learning theory: 19th annual conference on learning theory (pp. 139–153). Springer.

  • Bianco, A. M., & Yohai, V. J. (1996). Robust estimation in the logistic regression model. In Robust statistics, data analysis, and computer intensive methods (Schloss Thurnau, 1994) (Vol. 109, pp. 17–34), Lecture Notes in Statist., Springer.

  • Boente, G., He, X., & Zhou, J. (2006). Robust estimates in generalized partially linear models. Annals of Statistics, 34, 2856–2878.

    Article  MathSciNet  MATH  Google Scholar 

  • Brègman, L. M. (1967). A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7, 620–631.

    Article  MathSciNet  MATH  Google Scholar 

  • Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Annals of Statistics, 35, 2313–2351.

    MathSciNet  MATH  Google Scholar 

  • Cantoni, E., & Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022–1030.

    Article  MathSciNet  MATH  Google Scholar 

  • Dupuis, D. J., & Victoria-Feser, M.-P. (2011). Fast robust model selection in large datasets. Journal of the American Statistical Association, 106, 203–212.

    Article  MathSciNet  MATH  Google Scholar 

  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32, 928–961.

    Article  MathSciNet  MATH  Google Scholar 

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.

    Article  Google Scholar 

  • Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.

    Article  MathSciNet  MATH  Google Scholar 

  • Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In The 30th international conference on machine learning (ICML 2013).

  • Grünwald, P. D., & Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Annals of Statistics, 32, 1367–1433.

    Article  MathSciNet  MATH  Google Scholar 

  • Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. Wiley.

    MATH  Google Scholar 

  • Heritier, S., Cantoni, E., Copt, S., & Victoria-Feser, M.-P. (2009). Robust methods in biostatistics. Wiley.

    Book  MATH  Google Scholar 

  • Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101.

    Article  MathSciNet  MATH  Google Scholar 

  • Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2007). Robust loss functions for boosting. Neural Computation, 19, 2183–2244.

    Article  MathSciNet  MATH  Google Scholar 

  • Künsch, H., Stefanski, L., & Carroll, R. (1989). Conditionally unbiased bounded influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460–466.

    MathSciNet  MATH  Google Scholar 

  • Lafferty, J. D., Della Piestra, S., & Della Piestra, V. (1997). Statistical learning algorithms based on Bregman distances. In Proceedings of the 5th Canadian workshop on information theory.

  • Lafferty, J. (1999). Additive models, boosting and inference for generalized divergences. In: Proceedings of the twelfth annual conference on computational learning theory (pp. 125–133). ACM Press.

  • Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group Lasso for logistic regression. Journal of the Royal Statistical Society Series B, 70, 53–71.

    Article  MathSciNet  MATH  Google Scholar 

  • Stefanski, L., Carroll, R., & Ruppert, D. (1986). Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73, 413–424.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press.

    Book  MATH  Google Scholar 

  • van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. Springer.

    Book  MATH  Google Scholar 

  • Vapnik, V. (1996). The nature of statistical learning theory. Springer.

    MATH  Google Scholar 

  • Vemuri, B. C., Liu, M., Amari, S.-I., & Nielsen, F. (2011). Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, 30, 475–483.

    Article  Google Scholar 

  • Wright, S. J. (1997). Primal-dual interior-point methods. SIAM.

    Book  MATH  Google Scholar 

  • Wu, Y. C., & Liu, Y. F. (2007). Robust truncated hinge Loss support vector machines. Journal of the American Statistical Association, 102, 974–983.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C. M., Jiang, Y., & Shang, Z. (2009). New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Canadian Journal of Statistics, 37, 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C. M., Jiang, Y., & Chai, Y. (2010). Penalized Bregman divergence for large-dimensional regression and classification. Biometrika, 97, 551–566.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., & Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. Annals of Statistics, 36, 1108–1126.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

Zhang’s work was partially supported by the U.S. National Science Foundation grants DMS-2013486 and DMS-1712418, and provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. The second author gratefully acknowledges an NSFC grant (NSFC12131006, 12071038) from the National Natural Science Foundation of China and a grant from the University Grants Council of Hong Kong.

Author information

Authors and Affiliations

Authors

Contributions

CM Zhang contributed to the proof, computation, and writing; LXZ contributed to the discussion, proof, and writing; YBS contributed to the implementation of \({\text{SVM}}\) and robust-\({\text{SVM}}\).

Corresponding author

Correspondence to Chunming Zhang.

Ethics declarations

Conflict of interest

‘Not applicable’.

Ethics approval

‘Not applicable’.

Consent to participate

‘Not applicable’.

Consent for publication

All authors of the paper consent for publication.

Additional information

Editors: Krzysztof Dembczynski and Emilie Devijver.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Supplementary Appendix

1. Proofs, figures and tables, algorithm

Notations and symbols.

For a vector \({\varvec{a}}=(a_1,\ldots ,a_d)^T\), \(\Vert {\varvec{a}}\Vert _1={\sum _{j=1}^d |a_j|}\), \(\Vert {\varvec{a}}\Vert _2=(\sum _{j=1}^d a_j^2)^{1/2}\) and \(\Vert {\varvec{a}}\Vert _{\infty } = \max _{1\le j \le d} |a_j |\). Let \(\textbf{I}_{{\textsf {k}}}\) denote a \({\textsf {k}}\times {\textsf {k}}\) identity matrix, and \({\textbf{0}}_{p,q}\) denote a \(p\times q\) matrix of zero entries. For a matrix M, its eigenvalues, minimum eigenvalue, maximum eigenvalue are labeled by \(\lambda _j(M)\), \(\lambda _{\min }(M)\), \(\lambda _{\max }(M)\) respectively; \(\text{tr}(M)\) denotes the trace of a square matrix M; let \(\Vert M\Vert =\Vert M\Vert _2=\sup _{\Vert {\varvec{x}}\Vert _2=1}\Vert M{\varvec{x}}\Vert _2=\{\lambda _{\max }(M^TM)\}^{1/2}\) be the matrix \(L_2\) norm, and \(\Vert M\Vert _F = \{\text{tr}(M^T M)\}^{1/2}\) be the Frobenius norm. Throughout the proof, C is used as a generic finite constant. The sign function \(\,\text{sign}(x)\) equals \(+1\) if \(x>0\), 0 if \(x=0\), and \(-1\) if \(x<0\). For a function g(x), the first-order derivative is \(g'(x)\) or \(g^{(1)}(x)\), the second-order derivative is \(g''(x)\) or \(g^{(2)}(x)\), and the jth other derivative is \(g^{(j)}(x)\). The chi-squared distribution with k degrees of freedom is denoted by \(\chi _{k}^2\).

The conditional expectation and condition variance of Y given \({\varvec{X}}\) are denoted by \({\text{E}}(Y \mid {\varvec{X}})\) and \({\text{var}}(Y \mid {\varvec{X}})\) respectively. Notations in the asymptotic derivations follow (van der Vaart, 1998), where \({\mathop {\longrightarrow }\limits ^{\text{P}}}\) denotes converges in probability, \({\mathop {\longrightarrow }\limits ^{\mathcal L}}\) means converges in distribution, \(o_{{\text{P}}}(1)\) is a term which converges to zero in probability, and \(O_{{\text{P}}}(1)\) is a term which is bounded in probability.

1.1 1.1. Proofs of Theorems 1 up to 9

We first impose some regularity conditions, which are not the weakest possible but facilitate the technical derivations.

Condition A.

A0.:

\(s_n\ge 1\) and \(p_{_n}-s_n \ge 1\). \(\sup _{n\ge 1} \Vert {\varvec{\beta }}_{0}^{({\text{I}})}\Vert _1<\infty\).

A1.:

\(\Vert {\varvec{X}}\Vert _{\infty } = \max _{1\le j \le p_{_n}} |X_j |\) is bounded almost surely.

A2.:

\({\text{E}}({\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^T)\) exists and is nonsingular in the case of \(p_{_n}+1 \le n\); \({\text{E}}\{{\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}\) exists and is nonsingular in the case of \(p_{_n}+1 > n\).

A4.:

There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset.

A5.:

\(w(\cdot )\ge 0\) is a bounded function. Assume that \(\psi (r)\) is a bounded, odd function, and twice differentiable, such that \(\psi '(r)\), \(\psi '(r)r\), \(\psi ''(r)\), \(\psi ''(r)r\) and \(\psi ''(r)r^2\) are bounded; \(V(\cdot )>0\), \(V^{(2)}(\cdot )\) is continuous. The matrix \(\textbf{H}_n^{({\text{I}})}\) is positive definite, with eigenvalues uniformly bounded away from zero.

A5\('\).:

\(w(\cdot )\ge 0\) is a bounded function.

A6.:

\(q^{(4)}(\cdot )\) is continuous, and \(q^{(2)}(\cdot )<0\). \(g_1^{(2)} (\cdot )\) is continuous.

A7.:

\(F(\cdot )\) is monotone and a bijection, \(F^{(3)}(\cdot )\) is continuous, and \(F^{(1)}(\cdot )\ne 0\).

Condition B.

  1. B3.

    There exists a constant \(C \in (0,\infty )\) such that \(\sup _{n\ge 1}{\text{E}}\{|Y-m({\varvec{X}})|^j \} \le j! C^j\) for all \(j\ge 3\). Also, \(\inf _{n\ge 1,\, 1\le j\le p_{_n}} {\text{E}}\{{\text{var}}(Y\mid {\varvec{X}}) X_j^2 \} > 0\).

  2. B5.

    The matrices \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) are positive definite, with eigenvalues uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})}\Vert _2\) is bounded away from infinity.

Condition C.

  1. C4.

    There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset. Moreover, the subset contains the origin.

Condition D.

  1. D5.

    The eigenvalues of \(\textbf{H}_n^{({\text{I}})}\) are uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1/2} (\Omega _n^{({\text{I}})})^{1/2}\Vert _2\) is bounded away from infinity.

Condition E.

  1. E1.

    \(\min _{1\le j\le s_n} |{\text{cov}}(X_{j}, Y)|\succeq {\mathcal {A}}_n\) and \(\max _{s_n+1\le j\le p_{_n}}|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\) for some positive sequences \({\mathcal {A}}_n\) and \({\mathcal {B}}_n\), where the symbol \(s_{n} \succeq t_{n}\), for two nonnegative sequences \(s_{n}\) and \(t_{n}\), means that there exists a constant \(c > 0\) such that \(s_{n} \ge c\, t_{n}\) for all \(n\ge 1\).

  2. E2.

    \(\sup _{n \ge 1, 1\le j\le s_n} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} < \infty\); \(\inf _{n \ge 1, s_n+1\le j\le p_{_n}} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} = \eta > 0\), where \(\alpha _0=F({\text{E}}(Y))\).

Proof of Theorem 1

We first need to show Lemma 1. \(\square\)

Lemma 1

(existence and consistency: \(p_{_n} \ll n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{A}}6\) and \({\text{A}}7\) in Appendix 1.1, the matrix \(\textbf{H}_n = {\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^{T}\}\) is positive definite with eigenvalues uniformly bounded away from zero, and \(w_{\max }^{({\text{I}})}=O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n} \sqrt{s_n/p_{_n}})\}\). If \(p_{_n}^4/n \rightarrow 0\) as \(n\rightarrow \infty\), then there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{p_{_n}/n})\).

Proof

We follow the idea of the proof of Theorem 1 in Fan and Peng (2004). Let \(r_n = \sqrt{p_{_n}/n}\) and \({\widetilde{{\varvec{u}}}}_n=(u_0,u_1,\ldots ,u_{p_{_n}})^T\in \mathbb {R}^{p_{_n}+1}\). It suffices to show that for any given \({\epsilon }>0\), there exists a sufficiently large constant \(C_{\epsilon }\) such that, for large n we have

$$\begin{aligned} {\text{P}}\Big \{\inf _{\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2=C_{\epsilon }} \ell _{n}({\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n) > \ell _{n}({\widetilde{{\varvec{\beta }}}}_{0})\Big \} \ge 1-{\epsilon }. \end{aligned}$$
(20)

This implies that with probability at least \(1-{\epsilon }\), there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in the ball \(\{{\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n: \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2\le C_{\epsilon }\}\) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2=O_{{\text{P}}}(r_n)\). To show (20), consider

$$\begin{aligned} & D_n({\widetilde{{\varvec{u}}}}_n) \nonumber \\ & \quad = \frac{1}{n} \sum _{i=1}^n\{ \rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^T({\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n)))\, w({\varvec{X}}_{i})\nonumber \\ & \qquad -\rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}_{0}))\, w({\varvec{X}}_{i})\}\nonumber \\ & \qquad +\lambda _n \sum _{j=1}^{p_{_n}} w_{n,j} (|\beta _{j;0} + r_n u_j |-|\beta _{j;0}|)\nonumber \\ & \quad \equiv I_1 + I_2, \end{aligned}$$
(21)

where \(\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2=C_{\epsilon }\).

First, we consider \(I_1\). By Taylor’s expansion, \(I_1\) has the decomposition,

$$\begin{aligned} I_1 =I_{1,1}+I_{1,2}+I_{1,3}, \end{aligned}$$
(22)

where \(I_{1,1} = {r_n } /{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n\), \(I_{1,2} = {r_n^2}/{(2n)}\sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2\), and

\(I_{1,3} = {r_n^3}/{(6n)}\sum _{i=1}^n \text{p}_{_3}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}^*) \, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^3\) for \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n\). Hence

$$\begin{aligned} |I_{1,1} |\le r_n \bigg \Vert \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}\bigg \Vert _2 \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2 = O_{{\text{P}}}(r_n\sqrt{p_{_n}/n})\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2. \end{aligned}$$
(23)

For the term \(I_{1,2}\) in (22),

$$\begin{aligned} I_{1,2} & = \frac{r_n^2}{2n}\sum _{i=1}^n {\text{E}}\{\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2\} \\ & \quad + \frac{r_n^2}{2n}\sum _{i=1}^n \Big [\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2 \\ & \ \quad - {\text{E}}\{\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2\} \Big ] \\ & \equiv I_{1,2,1}+I_{1,2,2}, \end{aligned}$$

where \(I_{1,2,1} = 2^{-1}{r_n^2}{\widetilde{{\varvec{u}}}}_n^T \textbf{H}_n{\widetilde{{\varvec{u}}}}_n.\) Meanwhile, we have

$$\begin{aligned} |I_{1,2,2} | & \le {r_n^2} \bigg \Vert \frac{1}{n}\sum _{i=1}^n \Big [\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}){\widetilde{{\varvec{X}}}}_{i}{\widetilde{{\varvec{X}}}}_{i}^T \\ & \qquad - {\text{E}}\{\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}){\widetilde{{\varvec{X}}}}_{i}{\widetilde{{\varvec{X}}}}_{i}^T\}\Big ]\bigg \Vert _F \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2^2 \\ & =r_n^2 O_{{\text{P}}}(p_{_n}/\sqrt{n})\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2^2. \end{aligned}$$

Thus,

$$\begin{aligned} I_{1,2} = \frac{r_n^2}{2}{\widetilde{{\varvec{u}}}}_n^T \textbf{H}_n {\widetilde{{\varvec{u}}}}_n + O_{{\text{P}}}(r_n^2p_{_n}/\sqrt{n}) \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2^2. \end{aligned}$$
(24)

For the term \(I_{1,3}\) in (22), we observe that

$$\begin{aligned} |I_{1,3} |\le {r_n^3} \frac{1}{n} \sum _{i=1}^n |\text{p}_{_3}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}^*) |\, w({\varvec{X}}_{i}) |{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n |^3 = O_{{\text{P}}}(r_n^3 p_{_n}^{3/2})\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2^3, \end{aligned}$$

which follows from Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}4\) and \({\text{A}}5\).

Next, we consider \(I_2\) in (21). Note \(I_2 = \lambda _n \sum _{j=1}^{s_n} w_{n,j} (|\beta _{j;0} + r_n u_j |-|\beta _{j;0}|) +\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j} |u_j |\). Clearly, by the triangle inequality,

$$\begin{aligned} I_2 \ge -\lambda _n r_n \sum _{j=1}^{s_n} w_{n,j} |u_j |\equiv I_{2,1}, \end{aligned}$$

in which

$$\begin{aligned} |I_{2,1} |\le \lambda _n r_n w_{\max }^{({\text{I}})} \Vert {\varvec{u}}_n^{({\text{I}})}\Vert _1, \end{aligned}$$
(25)

where \({\varvec{u}}_n^{({\text{I}})}=(u_1,\ldots ,u_{s_n})^T\). By (23)–(25) and \(p_{_n}^4/n \rightarrow 0\), we can choose some large \(C_{\epsilon }\) such that \(I_{1,1}\), \(I_{1,3}\) and \(I_{2,1}\) are all dominated by the first term of \(I_{1,2}\) in (24), which is positive by the eigenvalue assumption. This implies (20). \(\square\)

We now show Theorem 1. Write \({\widetilde{{\varvec{u}}}}_n= ({\widetilde{{\varvec{u}}}}_n^{({\text{I}})T}, {\varvec{u}}_n^{({\text{II}})T})^T\), where \({\widetilde{{\varvec{u}}}}_n^{({\text{I}})}=(u_0,u_1,\ldots ,u_{s_n})^T\) and \({\varvec{u}}_n^{({\text{II}})}=(u_{s_n+1},\ldots ,u_{p_{_n}})^T\). Following the proof of Lemma 1, it suffices to show (20) for \(r_n=\sqrt{s_n/n}\).

For the term \(I_{1,1}\) in (22),

$$\begin{aligned} I_{1,1} & = \frac{r_n}{n}\sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{u}}}}_n^{({\text{I}})} \\ & \quad +\frac{r_n}{n}\sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\varvec{X}}_{i}^{({\text{II}})T} {\varvec{u}}_n^{({\text{II}})} \equiv I_{1,1}^{({\text{I}})}+I_{1,1}^{({\text{II}})}. \end{aligned}$$

It follows that

$$\begin{aligned} |I_{1,1}^{({\text{I}})} |\le r_n O_{{\text{P}}}(\sqrt{s_n/n}) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2, \qquad |I_{1,1}^{({\text{II}})} |\le r_n O_{{\text{P}}}({1}/{\sqrt{n}}) \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1. \end{aligned}$$

For the term \(I_{1,2}\) in (22), similar to the proof of Lemma 1, \(I_{1,2}=I_{1,2,1}+I_{1,2,2}\). We observe that

$$\begin{aligned} & I_{1,2,1} \ge \frac{r_n^2}{2} {\widetilde{{\varvec{u}}}}_n^{({\text{I}})T} \textbf{H}_n^{({\text{I}})} {\widetilde{{\varvec{u}}}}_n^{({\text{I}})} \\ & \quad \quad + \frac{r_n^2}{n}\sum _{i=1}^n {\text{E}}[\text{p}_{_2}(Y_i; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_0^{({\text{I}})})\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{u}}}}_n^{({\text{I}})} \} \{ {\varvec{X}}_{i}^{({\text{II}})T} {\varvec{u}}_n^{({\text{II}})}\}] \\ & \quad \equiv I_{1,2,1}^{({\text{I}})}+I_{1,2,1}^{({\text{cross}})}. \end{aligned}$$

Then there exists a constant \(C>0\) such that

$$\begin{aligned} I_{1,2,1}^{({\text{I}})} \ge C r_n^2 \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2^2, \quad |I_{1,2,1}^{({\text{cross}})} |\le O_{{\text{P}}}(r_n^2 \sqrt{s_n}) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2\cdot \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1. \end{aligned}$$

For the term \(I_{1,2,2}\),

$$\begin{aligned} & I_{1,2,2} \\ & \quad = \frac{r_n^2}{2n} \sum _{i=1}^n [ \text{p}_{2i}\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{u}}}}_n^{({\text{I}})})^2 -{\text{E}}\{\text{p}_{2i} \, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{u}}}}_n^{({\text{I}})})^2\}] \\ & \qquad +\frac{r_n^2}{2n} \sum _{i=1}^n \Big [ \text{p}_{2i} \, w({\varvec{X}}_{i}) 2({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{u}}}}_n^{({\text{I}})})({\varvec{X}}_{i}^{({\text{II}})T}{\varvec{u}}_n^{({\text{II}})}) \\ & \qquad -{\text{E}}\{\text{p}_{2i} \, w({\varvec{X}}_{i}) 2({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{u}}}}_n^{({\text{I}})})({\varvec{X}}_{i}^{({\text{II}})T}{\varvec{u}}_n^{({\text{II}})}) \} \Big ] \\ & \qquad +\frac{r_n^2}{2n} \sum _{i=1}^n [ \text{p}_{2i} \, w({\varvec{X}}_{i}) ({\varvec{X}}_{i}^{({\text{II}})T}{\varvec{u}}_n^{({\text{II}})})^2 -{\text{E}}\{\text{p}_{2i} \, w({\varvec{X}}_{i}) ({\varvec{X}}_{i}^{({\text{II}})T}{\varvec{u}}_n^{({\text{II}})})^2\}]\\ & \quad \equiv I_{1,2,2}^{({\text{I}})}+I_{1,2,2}^{({\text{cross}})}+I_{1,2,2}^{({\text{II}})}, \end{aligned}$$

where

$$\begin{aligned} |I_{1,2,2}^{({\text{I}})} |\le & r_n^2 O_{{\text{P}}}(s_n/\sqrt{n}) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2^2, \\ |I_{1,2,2}^{({\text{cross}})} |\le & r_n^2 O_{{\text{P}}}(\sqrt{s_n/n}) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1, \\ |I_{1,2,2}^{({\text{II}})} |\le & r_n^2 O_{{\text{P}}}(1/{\sqrt{n}}) \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1^2. \end{aligned}$$

For the term \(I_{1,3}\) in (22), since \(s_n p_{_n}=o(n)\), \(\Vert {\widetilde{{\varvec{\beta }}}}^*\Vert _1\) is bounded and thus

$$\begin{aligned} |I_{1,3} |\le O_{{\text{P}}}(r_n^3) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _1^3+O_{{\text{P}}}(r_n^3) \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1^3 \equiv I_{1,3}^{({\text{I}})}+I_{1,3}^{({\text{II}})}, \end{aligned}$$

where

$$\begin{aligned} |I_{1,3}^{({\text{I}})} |\le O_{{\text{P}}}(r_n^3 {s_n}^{3/2}) \Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2^3, \qquad |I_{1,3}^{({\text{II}})} |\le O_{{\text{P}}}(r_n^3) \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1^3. \end{aligned}$$

For the term \(I_2\) in (21), \(I_2 \ge I_{2,1}^{({\text{I}})}+I_{2,1}^{({\text{II}})},\) where \(I_{2,1}^{({\text{I}})}=-\lambda _n r_n \sum _{j=1}^{s_n} w_{n,j} |u_j |\) and \(I_{2,1}^{({\text{II}})}={\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j}|u_j |}\). Thus, we have

$$\begin{aligned} |I_{2,1}^{({\text{I}})}|\le \lambda _n r_n w_{\max }^{({\text{I}})} \sqrt{s_n} \Vert {\varvec{u}}_n^{({\text{I}})}\Vert _2, \qquad I_{2,1}^{({\text{II}})} \ge {\lambda _n r_n w_{\min }^{({\text{II}})} \Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1}. \end{aligned}$$

It can be shown that either \(I_{1,2,1}^{({\text{I}})}\) or \(I_{2,1}^{({\text{II}})}\) dominates all other terms in groups \(\mathcal G_1=\{I_{1,2,2}^{({\text{I}})}, I_{1,3}^{({\text{I}})}\}\), \(\mathcal G_2=\{I_{1,1}^{({\text{II}})}, I_{1,2,2}^{({\text{II}})}, I_{1,3}^{({\text{II}})}, I_{1,2,1}^{({\text{cross}})}, I_{1,2,2}^{({\text{cross}})}\}\) and \(\mathcal G_3=\{I_{1,1}^{({\text{I}})}, I_{2,1}^{({\text{I}})}\}\). Namely, \(I_{1,2,1}^{({\text{I}})}\) dominates \(\mathcal G_1\), and \(I_{2,1}^{({\text{II}})}\) dominates \(\mathcal G_2\). For \(\mathcal G_3\), since \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 \le C_{\epsilon }\), we have that

$$\begin{aligned} |I_{1,1}^{({\text{I}})} |\le O_{{\text{P}}}(r_n \sqrt{s_n/n}) C_{\epsilon }, \qquad |I_{2,1}^{({\text{I}})} |\le \lambda _n r_n \sqrt{s_n} w_{\max }^{({\text{I}})} C_{\epsilon }. \end{aligned}$$

Hence, if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 \le C_{\epsilon }/2\), then \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 > C_{\epsilon }/2\), and thus \(\mathcal G_3\) is dominated by \(I_{1,2,1}^{({\text{I}})}\), which is positive; if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 > C_{\epsilon }/2\), then \(\mathcal G_3\) is dominated by \(I_{2,1}^{({\text{II}})}\), which is positive. This completes the proof. \(\square\)

Proof of Theorem 2

We first need to show Lemma 2. \(\square\)

Lemma 2

Assume Condition A in Appendix 1.1. If \(s_n^2/n=O(1)\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), then with probability tending to one, for any given \({\widetilde{{\varvec{\beta }}}}=({\widetilde{{\varvec{\beta }}}}^{({\text{I}})T}, {\varvec{\beta }}^{({\text{II}})T})^T\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\) and any constant \(C>0\), it follows that \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\textbf{0}}) = \min _{\Vert {\varvec{\beta }}^{({\text{II}})}\Vert _2 \le C\sqrt{s_n/n}} \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\varvec{\beta }}^{({\text{II}})})\).

Proof

It suffices to prove that with probability tending to one, for any \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), the following inequalities hold for \(s_n+1 \le j \le p_{_n}\),

$$\begin{aligned} {\partial }{\ell _{n}({\widetilde{{\varvec{\beta }}}})}/{\partial {\beta _{j}}}< & 0, \quad \text {for}\ \beta _{j}<0, \\ {\partial }{\ell _{n}({\widetilde{{\varvec{\beta }}}})}/{\partial {\beta _{j}}}>& 0, \quad \text {for}\ \beta _{j}>0, \end{aligned}$$

namely, with probability tending to one,

$$\begin{aligned} \max _{s_n+1\le j\le p_{_n}} \sup _{\Vert {\widetilde{{\varvec{\beta }}}} - {\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n});\, \beta _{j}<0} \frac{\partial }{\partial {\beta _{j}}}{\ell _{n}({\widetilde{{\varvec{\beta }}}})}< & 0, \nonumber \\ \min _{s_n+1\le j\le p_{_n}} \inf _{\Vert {\widetilde{{\varvec{\beta }}}} - {\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n});\, \beta _{j}>0} \frac{\partial }{\partial {\beta _{j}}}{\ell _{n}({\widetilde{{\varvec{\beta }}}})}> & 0. \end{aligned}$$
(26)

Proofs for showing both inequalities are similar; we only need to show (26). Note that for \(\beta _{j} \ne 0\),

$$\begin{aligned} \frac{\partial }{\partial {\beta _{j}}}{\ell _{n}({\widetilde{{\varvec{\beta }}}})} & = \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}})\, w({\varvec{X}}_{i}) X_{i,j} + \lambda _n w_{n,j} \, \,\text{sign}(\beta _{j}) \\ & = \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}_{0}) \, w({\varvec{X}}_{i}) X_{i,j} \\ & \quad + \frac{1}{n} \sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}^*)\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^T({\widetilde{{\varvec{\beta }}}}-{\widetilde{{\varvec{\beta }}}}_{0})\} X_{i,j} + \lambda _n w_{n,j} \, \,\text{sign}(\beta _{j}), \end{aligned}$$

where \({\widetilde{{\varvec{\beta }}}}^*\) lies between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}\). It follows that

$$\begin{aligned} & \max _{s_n+1\le j\le p_{_n}} \sup _{\Vert {\widetilde{{\varvec{\beta }}}} - {\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n});\, \beta _{j}<0} \frac{\partial }{\partial {\beta _{j}}}{\ell _{n}({\widetilde{{\varvec{\beta }}}})} \\ & \quad \le \max _{s_n+1\le j\le p_{_n}} \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) X_{i,j} \\ & \qquad + \max _{s_n+1\le j\le p_{_n}} \sup _{\Vert {\widetilde{{\varvec{\beta }}}} - {\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})} \frac{1}{n} \sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}^*) \, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^T({\widetilde{{\varvec{\beta }}}}-{\widetilde{{\varvec{\beta }}}}_{0})\} X_{i,j} \\ & \qquad - \min _{s_n+1\le j\le p_{_n}} \{\lambda _n w_{n,j}\}\\ & \quad \equiv I_1+I_2-\lambda _n \min _{s_n+1\le j\le p_{_n}} w_{n,j} = I_1+I_2-\lambda _n w_{\min }^{({\text{II}})}. \end{aligned}$$

The first term \(I_1\) satisfies that

$$\begin{aligned} |I_1 |\le O_{{\text{P}}}(\{\log (p_{_n}-s_n+1)/n\}^{1/2}). \end{aligned}$$
(27)

For the term \(I_2\),

$$\begin{aligned} |I_2 |\le O_{{\text{P}}}(\sqrt{s_n p_{_n}/n}). \end{aligned}$$
(28)

Therefore, by (27) and (28), the left side of (26) is

$$\begin{aligned} \le O_{{\text{P}}}(\sqrt{s_n p_{_n}/n}) - \lambda _n w_{\min }^{({\text{II}})} = \sqrt{s_n p_{_n}/n} \{O_{{\text{P}}}(1)-{\lambda _n \sqrt{n} w_{\min }^{({\text{II}})}} / {\sqrt{s_n p_{_n}}} \}. \end{aligned}$$

By \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), (26) is proved. \(\square\)

We now show Theorem 2. By Lemma 2, the first part of Theorem 2 holds that \(\widehat{{\widetilde{{\varvec{\beta }}}}}=(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})},{\textbf{0}}^T)^T\). To verify the second part of Theorem 2, notice the estimating equations \(\frac{\partial \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})}{\partial {\widetilde{{\varvec{\beta }}}}^{({\text{I}})}} |_{{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}=\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}} = {\textbf{0}}\), since \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is a local minimizer of \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})\). Denote \({\varvec{d}}_n({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\}\) which is equal to \({\varvec{d}}_n\) when \({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}={\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\). Since \(\min _{1\le j\le s_n}|\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that

$$\begin{aligned} & {\text{P}}(\,\text{sign}\{\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\} \ne \,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}) \\ & \quad = {\text{P}}(\,\text{sign}({\widehat{\beta }}_{j}) \ne \,\text{sign}(\beta _{j;0})\ \text {for some}\ j\in \{1,\ldots ,s_n\}) \\ & \quad \le {\text{P}}\Big (\max _{1\le j\le s_n}|{\widehat{\beta }}_{j} - \beta _{j;0} |\ge \min _{1\le j\le s_n}|\beta _{j;0}|\Big ) \rightarrow 0. \end{aligned}$$

Thus with probability tending to one, \({\varvec{d}}_n(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = {\varvec{d}}_n({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) = {\varvec{d}}_n\). Taylor’s expansion applied to the loss part on the left side of the estimating equations yields

$$\begin{aligned} & {\textbf{0}}= \bigg \{\frac{1}{n}\sum _{i=1}^n \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}+{\varvec{d}}_n \bigg \}\nonumber \\ & \qquad + \bigg \{\frac{1}{n}\sum _{i=1}^n\text{p}_{_2}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} \bigg \} (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}-{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\nonumber \\ & \qquad +\frac{1}{2n}\sum _{i=1}^n\text{p}_3(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}^{*({\text{I}})})\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}-{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\}^2{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})} \nonumber \\ & \equiv \bigg \{\frac{1}{n}\sum _{i=1}^n\text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}+{\varvec{d}}_n \bigg \} +K_2(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}-{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})+K_3, \end{aligned}$$
(29)

where both \({\widetilde{{\varvec{\beta }}}}^{*({\text{I}})}\) and \({\widetilde{{\varvec{\beta }}}}^{**({\text{I}})}\) lie between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). Below, we will show

$$\begin{aligned} \Vert K_2 - \textbf{H}_n^{({\text{I}})} \Vert _2 & = O_{{\text{P}}}(s_n/\sqrt{n}), \end{aligned}$$
(30)
$$\begin{aligned} \Vert K_3\Vert _2 & = O_{{\text{P}}}(s_n^{5/2}/n). \end{aligned}$$
(31)

First, to show (30), note that \(K_2 - \textbf{H}_n^{({\text{I}})} = K_2-{\text{E}}(K_2) \equiv L_1\). Similar arguments for the proof of Lemma 1 give \(\Vert L_1\Vert _2 = O_{{\text{P}}}(s_n/\sqrt{n})\).

Second, a similar proof used for \(I_{1,3}\) in (22) completes (31).

Third, by (29)–(31) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), we see that

$$\begin{aligned} \textbf{H}_n^{({\text{I}})} (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) + {\varvec{d}}_n = -\frac{1}{n}\sum _{i=1}^n\text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})} + {\varvec{u}}_n, \end{aligned}$$
(32)

where \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n)\). Note that by Condition \({\text{B}}5\),

$$\begin{aligned} & \Vert \sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} {\varvec{u}}_n\Vert _2 \le \sqrt{n}\Vert A_n\Vert _F \lambda _{\max }((\Omega _n^{({\text{I}})})^{-1/2}) \Vert {\varvec{u}}_n\Vert _2 \\ & \qquad = \sqrt{n} \{\text{tr}(A_nA_n^T)\}^{1/2} /{\lambda _{\min }^{1/2} (\Omega _n^{({\text{I}})})} \Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/\sqrt{n}) = o_{_{{\text{P}}}}(1). \end{aligned}$$

Thus

$$\begin{aligned} & \sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} \{\textbf{H}_n^{({\text{I}})} (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) + {\varvec{d}}_n\} \\ & \qquad = -\frac{1}{\sqrt{n}} A_n (\Omega _n^{({\text{I}})})^{-1/2} \sum _{i=1}^n \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})} + o_{_{{\text{P}}}}(1). \end{aligned}$$

To complete proving the second part of Theorem 2, we apply the Lindeberg-Feller central limit theorem (van der Vaart, 1998) to \(\sum _{i=1}^n {\varvec{Z}}_{i}\), where \({\varvec{Z}}_{i} = - n^{-1/2} A_n (\Omega _n^{({\text{I}})})^{-1/2} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). It suffices to check two conditions: (I) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \mathbb {G}\); (II) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (I) follows from the fact that \({\text{var}}\{\text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}\} = \Omega _n^{({\text{I}})}\). To verify condition (II), notice that using Conditions \({\text{B}}5\) and \({\text{A}}5\),

$$\begin{aligned} & {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \\ & \quad \le n^{-{(2+\delta )}/{2}} {\text{E}}\bigg \{ \Vert A_n\Vert _F^{2+\delta }\bigg [\Vert (\Omega _n^{({\text{I}})})^{-1/2}{\widetilde{{\varvec{X}}}}^{({\text{I}})}\Vert _2 \\ & \quad \bigg |\{\psi (r(Y,m({\varvec{X}}))) - g_1 (m({\varvec{X}}))\} \frac{\{q''(m({\varvec{X}}))\sqrt{V(m({\varvec{X}}))}\}}{F'(m({\varvec{X}}))} \, w({\varvec{X}}) \bigg |\bigg ]^{2+\delta } \bigg \} \\ & \quad \le C n^{-{(2+\delta )}/{2}} {\text{E}}[\{{\lambda _{\min }^{-1/2} (\Omega _n^{({\text{I}})})}\Vert {\widetilde{{\varvec{X}}}}^{({\text{I}})}\Vert _2\}^{2+\delta } |\{\psi (r(Y,m({\varvec{X}}))) - g_1 (m({\varvec{X}}))\}\times \\ & \quad {\{q''(m({\varvec{X}}))\sqrt{V(m({\varvec{X}}))}\}} /{F'(m({\varvec{X}}))}|^{2+\delta }] \\ & \quad \le C s_n^{{(2+\delta )}/{2}} n^{-{(2+\delta )}/{2}} {\text{E}}[|\{\psi (r(Y,m({\varvec{X}}))) - g_1 (m({\varvec{X}}))\} \times \\ & \quad {\{q''(m({\varvec{X}}))\sqrt{V(m({\varvec{X}}))}\}} / {F'(m({\varvec{X}}))}|^{2+\delta }] \\ & \quad \le O\{(s_n/n)^{{(2+\delta )}/{2}}\}. \end{aligned}$$

Thus, we get \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{n (s_n/n)^{(2+\delta )/2}\} = O\{s_n^{(2+\delta )/2} / n^{\delta /2}\}\), which is o(1). This verifies Condition (II). \(\square\)

Proof of Theorem 3

Before showing Theorem 3, Lemma 3 is needed. \(\square\)

Lemma 3

Assume conditions of Theorem 3. Then

$$\begin{aligned} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})} = - \frac{1}{n} (\textbf{H}_n^{({\text{I}})})^{-1} \sum _{i=1}^n\text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})} + o_{_{{\text{P}}}}(n^{-1/2}), \\ \sqrt{n} \{A_n ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}{\widehat{\Omega }}_n^{({\text{I}})} ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}A_n^T\}^{-1/2} A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}}). \end{aligned}$$

Proof

Following (32) in the proof of Theorem 2, we observe that \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n) = o_{_{{\text{P}}}}(n^{-1/2})\). Furthermore, \(\Vert {\varvec{d}}_n\Vert _2 \le \sqrt{s_n} \lambda _n w_{\max }^{({\text{I}})} = o_{_{{\text{P}}}}(n^{-1/2})\). Condition \({\text{B}}5\) completes the proof for the first part.

To show the second part, denote \(U_n = A_n (\textbf{H}_n^{({\text{I}})})^{-1}\Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}A_n^T\) and \({\widehat{U}}_n = A_n ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}{\widehat{\Omega }}_n^{({\text{I}})} ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}A_n^T\). Notice that the eigenvalues of \((\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}\) are uniformly bounded away from zero. So are the eigenvalues of \(U_n\). From the first part, we see that

$$\begin{aligned} A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) = - \frac{1}{n} A_n (\textbf{H}_n^{({\text{I}})})^{-1}\sum _{i=1}^n \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})} + o_{_{{\text{P}}}}(n^{-1/2}). \end{aligned}$$

It follows that

$$\begin{aligned} \sqrt{n}U_n^{-1/2} A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) = \sum _{i=1}^n {\varvec{Z}}_{i} + o_{_{{\text{P}}}}(1), \end{aligned}$$

where \({\varvec{Z}}_{i} = - n^{-1/2} U_n^{-1/2} A_n (\textbf{H}_n^{({\text{I}})})^{-1} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). To show \(\sum _{i=1}^n {\varvec{Z}}_{i} {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\), similar to the proof for Theorem 2, we check two conditions: (III) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \textbf{I}_{{\textsf {k}}}\); (IV) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (III) is straightforward since \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) = U_n^{-1/2} U_n U_n^{-1/2} = \textbf{I}_{{\textsf {k}}}\). To check condition (IV), similar arguments used in the proof of Theorem 2 give that \({\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = O\{(s_n/n)^{{(2+\delta )}/{2}}\}.\) This and the boundedness of the \(\psi\)-function yield \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{s_n^{(2+\delta )/2}/n^{\delta /2}\} = o(1)\). Hence

$$\begin{aligned} \sqrt{n} U_n^{-1/2}A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}}). \end{aligned}$$
(33)

Also, it can be concluded that \(\Vert {\widehat{U}}_n - U_n \Vert _2=o_{_{{\text{P}}}}(1)\) and that the eigenvalues of \({\widehat{U}}_n\) are uniformly bounded away from zero and infinity with probability tending to one. Consequently,

$$\begin{aligned} \Vert {\widehat{U}}_n^{-1/2} U_n^{1/2} - \textbf{I}_{{\textsf {k}}}\Vert _2=o_{_{{\text{P}}}}(1). \end{aligned}$$
(34)

Combining (33), (34) and Slutsky’s theorem completes the proof that \(\sqrt{n} {\widehat{U}}_n^{-1/2}A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\). \(\square\)

We now show Theorem 3, which follows directly from the null hypothesis \(H_0\) in (14) and the second part of Lemma 3. This completes the proof. \(\square\)

Proof of Theorem 4

The proof of Theorem 4 is similar to that used in Theorem 7, except that in the Part 2, \({\mathcal {C}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)

Proof of Theorem 5

The proof of Theorem 5 is similar to that used in Theorem 8, except that in the Part 2, \({\mathcal {B}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)

Proof of Theorem 6

Assumption (19) implies that \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in (3) is convex in \({\widetilde{{\varvec{\beta }}}}\). By Karush-Kuhn-Tucker conditions (Wright 1997, Theorem A.2), a set of sufficient conditions for an estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}} = ({\widehat{\beta }}_{0},{\widehat{\beta }}_{1},\ldots ,{\widehat{\beta }}_{p_{_n}})^T\) being a global minimizer of (3) is that

$$\begin{aligned} \begin{array}{l} \frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T\widehat{{\widetilde{{\varvec{\beta }}}}})\, w({\varvec{X}}_{i}) = 0, \\ \frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T\widehat{{\widetilde{{\varvec{\beta }}}}})\, w({\varvec{X}}_{i})X_{i,j} = -\lambda _n w_{n,j} \, \,\text{sign}({\widehat{\beta }}_{j}), \ \text {for}\, 1\le j\le p_{_n}\ \text {with}\ {\widehat{\beta }}_j\ne 0, \\ \left|\frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T\widehat{{\widetilde{{\varvec{\beta }}}}})\, w({\varvec{X}}_{i}) X_{i,j}\right|\le \lambda _n w_{n,j}, \ \text {for}\, 1\le j\le p_{_n}\ \text {with}\ {\widehat{\beta }}_j = 0. \end{array} \end{aligned}$$
(35)

Before proving Theorem 6, we first show Lemma 4. \(\square\)

Lemma 4

(existence and consistency: \(p_{_n}\gg n\)) Assume (19) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5'\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Suppose \(s_n^4/n \rightarrow 0\), \(\log (p_{_n}-s_n)/n = O(1)\), \(\log (p_{_n}-s_n)/\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\} = o_{_{{\text{P}}}}(1)\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\). Assume \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). Then with probability tending to one, there exists a global minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) of \(\ell _n({\widetilde{{\varvec{\beta }}}})\) in (3) which satisfies that

\(\mathrm {(i)}\):

\(\widehat{{\varvec{\beta }}}^{({\text{II}})} = {\textbf{0}}\),

\(\mathrm {(ii)}\):

\(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is the minimizer of the oracle subproblem,

$$\begin{aligned} \ell _{n}^O({\widetilde{{\varvec{\beta }}}}^{({\text{I}})})=\frac{1}{n}\sum _{i=1}^n \rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}))\, w({\varvec{X}}_{i})+\lambda _n\sum _{j=1}^{s_n} w_{n,j}|\beta _{j} |. \end{aligned}$$
(36)

Proof

Let \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} = ({\widehat{b}}_{0},{\widehat{b}}_{1},\ldots ,{\widehat{b}}_{s_n})^T\) be the minimizer of the subproblem (36). By Karush-Kuhn-Tucker necessary conditions (Wright 1997, Theorem A.1), \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\) satisfies that

$$\begin{aligned} \begin{array}{l} \frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) = 0, \\ \frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j} = -\lambda _n w_{n,j} \, \,\text{sign}({\widehat{b}}_{j}), \, \text {for}\, 1\le j\le s_n\ \text {with}\ {\widehat{b}}_{j} \ne 0, \\ \left|\frac{1}{n} \sum \limits _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j} \right|\le \lambda _n w_{n,j}, \,\text {for}\, 1\le j\le s_n \, \text {with}\ {\widehat{b}}_{j} = 0. \end{array} \end{aligned}$$

In the following, we will verify conditions

$$\begin{aligned} {\widehat{b}}_{1}\ne 0, \ldots , {\widehat{b}}_{s_n}\ne 0, \end{aligned}$$
(37)

and

$$\begin{aligned} \Big |\frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j}\Big |\le \lambda _n w_{n,j},\ \text {for}\ s_n+1\le j\le p_{_n}. \end{aligned}$$
(38)

It then follows, from (37), (38) and (35), that \((\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})T}, {\textbf{0}}^T)^T\) is the global minimizer of (3). This will in turn imply Lemma 4.

First, we prove that (37) holds with probability tending to one. Applying Lemma 1 to the subproblem (36), we conclude that \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\). Since \({\min _{1\le j\le s_n}|\beta _{j;0}|/ \sqrt{s_n/n}} \rightarrow \infty\) as \(n \rightarrow \infty\), it is seen that

$$\begin{aligned} & {\text{P}}(\,\text{sign}({\widehat{\beta }}_{j}) \ne \,\text{sign}(\beta _{j;0})\ \text {for some}\ j\in \{1,\ldots ,s_n\}) \\ & \quad \le {\text{P}}\Big (\max _{1\le j\le s_n}|{\widehat{\beta }}_{j} - \beta _{j;0} |\ge \min _{1\le j\le s_n}|\beta _{j;0}|\Big ) \rightarrow 0. \end{aligned}$$

Hence (37) holds with probability tending to one.

Second, we prove that (38) holds with probability tending to one. It suffices to prove that

$$\begin{aligned} {\text{P}}\bigg (\max _{s_n+1\le j\le p_{_n}}\bigg |\frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j}\bigg |< \lambda _n w_{\min }^{({\text{II}})} \bigg ) \rightarrow 1. \end{aligned}$$
(39)

By Taylor’s expansion, we have that

$$\begin{aligned} & \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j} \\ & \quad = \frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j} \\ & \qquad + \frac{1}{n} \sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})*})\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\} X_{i,j}, \end{aligned}$$

with \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\). Then (39) holds if we can prove

$$\begin{aligned} {\text{P}}\bigg (\max _{s_n+1\le j\le p_{_n}}\bigg |\frac{1}{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) X_{i,j}\bigg |< \frac{\lambda _n}{2} w_{\min }^{({\text{II}})} \bigg ) \rightarrow 1, \end{aligned}$$
(40)

and

$$\begin{aligned} {\text{P}}\bigg (\max _{s_n+1\le j\le p_{_n}}\bigg |\frac{1}{n} \sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})*})\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\} X_{i,j}\bigg |< \frac{\lambda _n}{2} w_{\min }^{({\text{II}})} \bigg ) \rightarrow 1. \end{aligned}$$
(41)

We first prove (40). Set \(\text{p}_{1i} = \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\). Since \({\log (p_{_n}-s_n)=O(n)}\) and \({\log (p_{_n}-s_n) = o_{_{{\text{P}}}}\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\}}\), we see that

$$\begin{aligned} \max _{s_n+1\le j\le p_{_n}} \bigg |\frac{1}{n}\sum _{i=1}^n \text{p}_{1i} \, w({\varvec{X}}_{i}) X_{i,j}\bigg |=O_{{\text{P}}}\{\sqrt{\log (p_{_n}-s_n+1)/n}\} =o_{_{{\text{P}}}}(\lambda _n w_{\min }^{({\text{II}})}). \end{aligned}$$

This implies (40).

Second, we prove (41). Since \(\Vert {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 < \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that

$$\begin{aligned} \Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\Vert _1 \le \Vert {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 + \Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 \le \Vert {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 + \sqrt{s_n}\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert = O_{{\text{P}}}(1), \end{aligned}$$

and then \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\Vert _1 = O_{{\text{P}}}(1)\), thus

$$\begin{aligned} & \max _{s_n+1\le j\le p_{_n}} \bigg |\frac{1}{n}\sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})*})\, w({\varvec{X}}_{i}) \{{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\} X_{i,j}\bigg |\\ & \quad \le C \sqrt{s_n} \Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert \bigg \{\frac{1}{n} \sum _{i=1}^n |\text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})*})|\, w({\varvec{X}}_{i})\bigg \} \\ & \quad = \sqrt{s_n} O_{{\text{P}}}(\sqrt{s_n/n}) O_{{\text{P}}}(1) = O_{{\text{P}}}(s_n/\sqrt{n}) = o_{_{{\text{P}}}}\{\lambda _n w_{\min }^{({\text{II}})}\}. \end{aligned}$$

Here \({w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty }\) is used. Hence (41) is proved. \(\square\)

The first part of Theorem 6 follows from the first part of Lemma 4. The second part of Theorem 6 follows directly from applying Theorem 2 to the oracle subproblem (36). \(\square\)

Proof of Theorem 7

It is easy to see that \({\widehat{\beta }}_j^{{\text{PMR}}} = \arg \min _\beta \ell ^{{\text{PMR}}*}_{j}(\beta )\), where \(\ell ^{{\text{PMR}}*}_{n,j}(\beta ) = \ell ^{{\text{PMR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). The rest of the proof contains two parts.

Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})} {\mathcal {A}}_n = O_{{\text{P}}}(1)\). It suffices to show that there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{\delta \rightarrow 0+} \inf _{n\ge 1} {\text{P}}(\min _{1\le j \le s_n} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |> {\mathcal {A}}_n \delta ) = 1.\) It suffices to prove that, for \(1\le j\le s_n\), there exist some \(b_{j}\) with \(|b_{j} |= 2\delta\) such that

$$\begin{aligned} \lim _{\delta \rightarrow 0+} \inf _{n\ge 1} {\text{P}}\Big (\min _{1\le j \le s_n}\Big \{ \inf _{|\beta |\le \delta } \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\,\beta ) - \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\, b_{j})\Big \} > 0\Big ) = 1, \end{aligned}$$
(42)

and there exists some large enough \(C_n>0\) such that

$$\begin{aligned} \lim _{\delta \rightarrow 0+} \inf _{n\ge 1} {\text{P}}\Big (\min _{1\le j \le s_n}\Big \{ \inf _{|\beta |\ge C_n} \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\,\beta ) - \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\, b_{j})\Big \} > 0\Big ) = 1. \end{aligned}$$
(43)

Equations (42) and (43) imply that with probability tending to one, there must exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \({\mathcal {A}}_n\, \delta< |{\widehat{\beta }}_{j}^{{\text{PMR}}} |< {\mathcal {A}}_n\, C_n\) for \(1\le j\le s_n\).

First, we prove (43). For every \(n \ge 1\), when \(|\beta |\rightarrow \infty\),

$$\begin{aligned} \min _{1\le j \le s_n} \{ \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\,\beta ) - \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\, b_{j})\} \ge \kappa _n {\mathcal {A}}_n |\beta |- \max _{1\le j \le s_n}\ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\, b_{j}) {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty . \end{aligned}$$

Thus (43) holds.

Second, we prove (42). Since \({\mathcal {A}}_n=O(1)\), we see that \(|{\mathcal {A}}_n\, \beta |\le O(1) \delta \rightarrow 0\) as \(\delta \rightarrow 0+\). For \(1\le j\le s_n\), by Taylor’s expansion,

$$\begin{aligned} \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\,\beta ) & = \frac{1}{n} \sum _{i=1}^n Q_{{q}}(Y_{i}, {\widehat{\mu }}_{_0}) + {\mathcal {A}}_n\,\beta \frac{1}{n} \sum _{i=1}^n \text{q}_{_1}(Y_{i}; {\widehat{\alpha }}_0) \{X_{i,j}-{\text{E}}(X_{j})\} \\ & \quad + \frac{{\mathcal {A}}_n^2\,\beta ^2}{2} \frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; \theta _{ij}^*) \{{\widehat{\alpha }}_j'({\mathcal {A}}_n\,\beta _j^*)+X_{i,j}\}^2 + {\mathcal {A}}_n\,\kappa _n |\beta |, \end{aligned}$$

where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\), \(\theta _{ij}^*=\theta _{ij}({\mathcal {A}}_n\, \beta _{j}^*)\), \(\theta _{ij}(\beta )={\widehat{\alpha }}_j(\beta ) +X_{i,j}\beta\) and \(\beta _{j}^*\) is between 0 and \(\beta\). Thus we have that

$$\begin{aligned} & \min _{1\le j \le s_n} \Big \{ \inf _{|\beta |\le \delta } \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\,\beta ) - \ell _{n,j}^{{\text{PMR}}*}({\mathcal {A}}_n\, b_{j})\Big \} \\ & \quad \ge {\mathcal {A}}_n \min _{1\le j \le s_n} \inf _{|\beta |\le \delta } \bigg [(\beta - b_{j}) \frac{1}{n} \sum _{i=1}^n \text{q}_{_1}(Y_{i}; {\widehat{\alpha }}_0) \{X_{i,j}-{\text{E}}(X_{j})\}\bigg ] \\ & \qquad + \frac{{\mathcal {A}}_n^2}{2} \min _{1\le j \le s_n} \inf _{|\beta |\le \delta } \bigg [\beta ^2 \frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; \theta _{ij}^*) \{{\widehat{\alpha }}_j'({\mathcal {A}}_n\,\beta _j^*)+X_{i,j}\}^2 \\ & \qquad - b_{j}^2 \frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; c_{ij}^{*}) \{{\widehat{\alpha }}_j'({\mathcal {A}}_nb_j^*)+X_{i,j}\}^2 \bigg ] \\ & \qquad + {\mathcal {A}}_n \min _{1\le j \le s_n} \inf _{|\beta |\le \delta } \{\kappa _n (|\beta |- |b_{j} |)\} \\ & \quad \equiv I_1 + I_2 + I_3, \end{aligned}$$

where \(c_{ij}^{*}=\theta _{ij}({\mathcal {A}}_n\, b_j^*)\), with \(b_{j}^*\) between 0 and \(b_{j}\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain

$$\begin{aligned} & I_1 \\ & \quad = {\mathcal {A}}_n \min _{1\le j \le s_n} \inf _{|\beta |\le \delta }\{ {\widehat{C}}_0 (\beta - b_{j}) {\text{cov}}(X_{j}, Y) \} \\ & \qquad + {\mathcal {A}}_n \min _{1\le j \le s_n} \inf _{|\beta |\le \delta } \bigg ( {\widehat{C}}_0 (\beta - b_{j}) \frac{1}{n}\sum _{i=1}^n [(Y_{i}-\mu _{_0}) \{X_{i,j}-{\text{E}}(X_{j})\} - {\text{cov}}(X_{j}, Y)] \bigg ) \\ & \qquad - {\mathcal {A}}_n \max _{1\le j \le s_n} \sup _{|\beta |\le \delta } \Big [ {\widehat{C}}_0 ({\widehat{\mu }}_{_0}-\mu _{_0}) (\beta - b_{j}) \frac{1}{n}\sum _{i=1}^n \{X_{i,j} - {\text{E}}(X_{j})\} \Big ] \\ & \quad \equiv I_{1,1} + I_{1,2} + I_{1,3}. \end{aligned}$$

Choosing \(b_{j} = -2\delta \,\text{sign}\{{\widehat{C}}_0\, {\text{cov}}(X_{j},Y)\}\), which satisfies \(|b_{j} |= 2\delta\), gives

$$\begin{aligned} I_{1,1} & = {\mathcal {A}}_n \min _{1\le j \le s_n} \inf _{|\beta |\le \delta } \{\beta {\widehat{C}}_0\, {\text{cov}}(X_{j},Y) + 2\delta |{\widehat{C}}_0\, {\text{cov}}(X_{j},Y) |\} \\ \ge & {\mathcal {A}}_n\, \delta |{\widehat{C}}_0 |\min _{1\le j \le s_n} |{\text{cov}}(X_{j},Y) |\ge |{\widehat{C}}_0 |c {\mathcal {A}}_n^2\, \delta . \end{aligned}$$

We can see that \(|I_{1,2} |= O_{{\text{P}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta ,\) by the Bernstein’s inequality (van der Vaart and Wellner 1996, Lemma 2.2.11). Similarly, \(|I_{1,3} |\le o_{_{{\text{P}}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta\). For terms \(I_2\) and \(I_3\), we observe that \(|I_2 |\le O_{{\text{P}}}({\mathcal {A}}_n^2)\,\delta ^2\) and \(|I_3 |= O({\mathcal {A}}_n\,\kappa _n)\delta\). The conditions \(\log (p_{_n}) = o(n \kappa _n^2)\) and \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\) imply that \(\{\log (s_n)/n\}^{1/2}/{\mathcal {A}}_n=o(1)\). Together with the condition \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\), we can choose a small enough \(\delta > 0\) such that with probability tending to one, \(I_{1,2}\), \(I_{1,3}\), \(I_2\) and \(I_3\) are dominated by \(I_{1,1}\), which is positive. Thus (42) is proved.

Part 2. For \({\mathcal {C}}_n=\lambda _n\sqrt{n}/s_n\), we will show that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {C}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). It suffices to prove that for any \({\epsilon }> 0\), there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |\le {\mathcal {C}}_n {\epsilon }) = 1.\) Similar to the proof of Lemma 1, we will prove that for any \({\epsilon }> 0\),

$$\begin{aligned} \lim _{n\rightarrow \infty } {\text{P}}\Big (\min _{s_n+1\le j\le p_{_n}}\Big \{ \inf _{|\beta |= {\epsilon }} \ell _{n,j}^{{\text{PMR}}*}({\mathcal {C}}_n \beta ) - \ell _{n,j}^{{\text{PMR}}*}(0)\Big \} > 0 \Big ) = 1. \end{aligned}$$
(44)

Since \({\mathcal {C}}_n\rightarrow 0\) as \(n\rightarrow \infty\), we have that by Taylor’s expansion,

$$\begin{aligned} & \min _{s_n+1\le j\le p_{_n}}\Big \{ \inf _{|\beta |= {\epsilon }} \ell _{n,j}^{{\text{PMR}}*}({\mathcal {C}}_n \beta ) - \ell _{n,j}^{{\text{PMR}}*}(0)\Big \} \\ & \quad \ge {\mathcal {C}}_n \min _{s_n+1\le j\le p_{_n}} \inf _{|\beta |= {\epsilon }} \Big [\beta \frac{1}{n} \sum _{i=1}^n \text{q}_{_1}(Y_{i}; {\widehat{\alpha }}_0) \{X_{i,j}-{\text{E}}(X_j)\}\Big ] \\ & \qquad + \frac{{\mathcal {C}}_n^2}{2}\min _{s_n+1\le j\le p_{_n}} \inf _{|\beta |= {\epsilon }} \Big [ \beta ^2\frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; \theta _{ij}({\mathcal {C}}_n\beta _{j}^*)) \{{\widehat{\alpha }}_j'({\mathcal {C}}_n\beta _{j}^*)+X_{i,j}\}^2\Big ] \\ & \qquad + {\mathcal {C}}_n \inf _{|\beta |= {\epsilon }} (\kappa _n|\beta |) \\ & \quad \equiv I_1 + I_2 + I_3, \end{aligned}$$

where \(\beta _{j}^*\) is between 0 and \(\beta\). Similar to the proof in Part 1,

$$\begin{aligned} & I_1={\mathcal {C}}_n \min _{s_n+1\le j\le p_{_n}} \inf _{|\beta |={\epsilon }}\{ {\widehat{C}}_0 \beta {\text{cov}}(X_{j}, Y) \} \\ & \qquad + {\mathcal {C}}_n \min _{s_n+1\le j\le p_{_n}} \inf _{|\beta |={\epsilon }} \bigg ( {\widehat{C}}_0 \beta \frac{1}{n}\sum _{i=1}^n [(Y_{i}-\mu _{_0})\{X_{i,j}-{\text{E}}(X_j)\} - {\text{cov}}(X_{j},Y)] \bigg ) \\ & \qquad - {\mathcal {C}}_n \max _{s_n+1\le j\le p_{_n}} \sup _{|\beta |={\epsilon }} \Big [ {\widehat{C}}_0 ({\widehat{\mu }}_{_0}-\mu _{_0}) \beta \frac{1}{n}\sum _{i=1}^n \{X_{i,j}-{\text{E}}(X_j)\} \Big ] \\ & \equiv I_{1,1} + I_{1,2} + I_{1,3}. \end{aligned}$$

Then \(|I_{1,1} |\le o({\mathcal {C}}_n {\mathcal {B}}_n {\epsilon }),\) \(|I_{1,2} |\le O_{{\text{P}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\) and \(|I_{1,3} |\le o_{_{{\text{P}}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\). Hence \(|I_1 |\le O_{{\text{P}}}[{\mathcal {C}}_n\{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }+ o({\mathcal {C}}_n {\mathcal {B}}_n){\epsilon }.\) For the term \(I_2\), we have that \(|I_2 |\le O_{{\text{P}}}({\mathcal {C}}_n^2){\epsilon }^2.\) Note \(I_3 = {\mathcal {C}}_n \kappa _n {\epsilon }.\) Since \(\log (p_{_n}) = o(n \kappa _n^2)\), \({\mathcal {B}}_n = O(\kappa _n)\) and \({\mathcal {C}}_n = o(\kappa _n)\), it follows that with probability tending to one, terms \(I_1\) and \(I_2\) are dominated by \(I_3\), which is positive. So (44) is proved. \(\square\)

Proof of Theorem 8

It is easy to see that \({\widehat{\beta }}_j^{{\text{MR}}} = \arg \min _\beta \ell ^{{\text{MR}}*}_{j}(\beta )\), where \(\ell ^{{\text{MR}}*}_{n,j}(\beta ) = \ell ^{{\text{MR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). Let \(h_{n,j}(\beta ) =\frac{{\text{d}}}{{\text{d}}\beta } \ell _{n,j}^{{\text{MR}}*}(\beta )= n^{-1} \sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}\). Then \(h_{n,j}'(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}^2\) and \(h_{n,j}''(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{3i}(\beta )\). The minimizer \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (17) satisfies the estimating equations, \(h_{n,j}({\widehat{\beta }}^{{\text{MR}}}_{j})= 0\). The rest of the proof consists of two parts.

Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})}{\mathcal {A}}_n=O_{{\text{P}}}(1)\), which is \({{\mathcal {A}}_n}/{\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} =O_{{\text{P}}}(1)\). That is, \(\lim _{\delta \rightarrow 0+} \sup _{n\ge 1} {\text{P}}(\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n \delta )=0\). Using the Bonferroni inequality, it suffices to show that

$$\begin{aligned} \lim _{\delta \rightarrow 0+} \sup _{n\ge 1} \sum _{j=1}^{s_n} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n\, \delta )=0. \end{aligned}$$

With assumption (11) for the convex \({\text{BD}}\), \(h_{n,j}(\cdot )\) is an increasing function. Thus

$$\begin{aligned} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n\, \delta ) \le {\text{P}}\{h_{n,j}(-{\mathcal {A}}_n\,\delta ) \le 0 \le h_{n,j}({\mathcal {A}}_n\,\delta )\}. \end{aligned}$$

Note that \({\mathcal {A}}_n=O(1)\) gives \({\mathcal {A}}_n\,\delta \rightarrow 0\) as \(\delta \rightarrow 0+\). By Taylor’s expansion, for \(1\le j\le s_n\), we have that

$$\begin{aligned} h_{n,j}(\pm {\mathcal {A}}_n\,\delta ) & = \frac{1}{n} \sum _{i=1}^n \text{q}_{_1}(Y_{i}; {\widehat{\alpha }}_0) \{X_{i,j}-{\text{E}}(X_j)\} \\ & \quad + (\pm {\mathcal {A}}_n\,\delta ) \frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; {\widehat{\alpha }}_0) \{{\widehat{\alpha }}_j'(0)+X_{i,j}\}^2 \\ & \quad + \frac{1}{2} ({\mathcal {A}}_n\,\delta )^2 \frac{1}{n} \sum _{i=1}^n \text{q}_{3i}({\mathcal {A}}_n\,\delta _j^*) \equiv I_{1j}+I_{2j}+I_{3j}, \end{aligned}$$

with \(\delta _j^* \in (0, \delta )\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\), where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain

$$\begin{aligned} I_{1j} & = \frac{1}{n} \sum _{i=1}^n (Y_{i} - {\widehat{\mu }}_{_0}) {\widehat{C}}_0 \{X_{i,j}-{\text{E}}(X_j)\} \\ & = {\widehat{C}}_0\, {\text{cov}}(X_j, Y) +{\widehat{C}}_0 \frac{1}{n} \sum _{i=1}^n [(Y_{i} - \mu _{_0}) \{X_{i,j}-{\text{E}}(X_j)\} - {\text{cov}}(X_j, Y)] \\ & \quad -{\widehat{C}}_0({\widehat{\mu }}_{_0}-\mu _{_0})\frac{1}{n} \sum _{i=1}^n \{X_{i,j}-{\text{E}}(X_j)\} \equiv I_{1j, 1}+I_{1j,2}+I_{1j,3}. \end{aligned}$$

Because \({\mathcal {A}}_n = O(1)\), \(|{\text{cov}}(X_j, Y) |\ge c\, {\mathcal {A}}_n\), \(1\le j\le s_n\), and both

\(\max _{1\le j\le s_n} {\text{E}}[n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; {\widehat{\alpha }}_0) \{{\widehat{\alpha }}_j'(0)+X_{i,j}\}^2]\) and \(\max _{1\le j\le s_n} {\text{E}}\{n^{-1} \sum _{i=1}^n |\text{q}_{3i}({\mathcal {A}}_n\,\delta _j^*) |\}\) are bounded, we can choose \(\delta\) small enough such that, uniformly for all \(1 \le j\le s_n\), the term \(I_{1j, 1}={\widehat{C}}_0\, {\text{cov}}(X_j, Y)\) dominates \(I_{2j}\) and \(I_{3j}\). By assuming \({\widehat{C}}_0\, {\text{cov}}(X_j, Y) < 0\) without loss of generality,

$$\begin{aligned} & {\text{P}}({|{\widehat{\beta }}^{{\text{MR}}}_{j} |} \le {\mathcal {A}}_n\, \delta ) \le {\text{P}}(0 \le h_{n,j}({\mathcal {A}}_n\, \delta )) \nonumber \\ & \qquad \le {\text{P}}(I_{1j, 2} + I_{1j, 3} \ge C {\mathcal {A}}_n) \le 4 \exp \Big (\frac{-n^2 {\mathcal {A}}_n^2}{C_1n + C_2 n {\mathcal {A}}_n}\Big ), \end{aligned}$$
(45)

for some positive constants C, \(C_1\) and \(C_2\), where the last inequality applies the Bernstein inequality. By (45), for a small enough \(\delta > 0\),

$$\begin{aligned} \sum _{j=1}^{s_n} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n\, \delta ) \le 4 s_n \exp \Big (\frac{-n^2 {\mathcal {A}}_n^2}{C_1n + C_2 n {\mathcal {A}}_n}\Big ) = o(1). \end{aligned}$$
(46)

The equality in (46) follows from \({\mathcal {A}}_n = O(1)\), \(\lambda _n n \rightarrow \infty\) and \(\log (s_n) = o(\lambda _n^2 n^2)\), where the latter two are implied by the conditions \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\).

Part 2. For \({\mathcal {B}}_n=\lambda _n \sqrt{n}/s_n\), we will prove that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {B}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), which is \({\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} / {{\mathcal {B}}_n} = o_{_{{\text{P}}}}(1)\). Namely, for any \({\epsilon }> 0\), \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) = 0\). By the Bonferroni inequality, it suffices to show that

$$\begin{aligned} \lim _{n\rightarrow \infty } \sum _{j=s_n+1}^{p_{_n}} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) = 0. \end{aligned}$$

Since \(h_{n,j}(\cdot )\) is increasing, we have that for \(j=s_n+1, \ldots , p_{_n}\),

$$\begin{aligned} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) \le {\text{P}}\{h_{n,j}(-{\mathcal {B}}_n {\epsilon }) \ge 0\} + {\text{P}}\{h_{n,j}({\mathcal {B}}_n {\epsilon }) \le 0\}. \end{aligned}$$
(47)

Similar to Part 1, \({\mathcal {B}}_n=o(1)\) gives that for \(j=s_n+1, \ldots , p_{_n}\),

$$\begin{aligned} h_{n,j}(\pm {\mathcal {B}}_n {\epsilon }) & = \frac{1}{n} \sum _{i=1}^n \text{q}_{_1}(Y_{i}; {\widehat{\alpha }}_0) \{X_{i,j}-{\text{E}}(X_j)\} \\ & \quad + (\pm {\mathcal {B}}_n {\epsilon }) \frac{1}{n} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; {\widehat{\alpha }}_0) \{{\widehat{\alpha }}_j'(0)+X_{i,j}\}^2 \\ & \quad + \frac{1}{2} ({\mathcal {B}}_n {\epsilon })^2 \frac{1}{n} \sum _{i=1}^n \text{q}_{3i}({\mathcal {B}}_n {\epsilon }_j^*) \equiv I_{1j}+J_{2j}+J_{3j}, \end{aligned}$$

with \({\epsilon }_j^* \in (0, \delta )\). Since \({\mathcal {B}}_n=o(1)\), \(|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\), \(s_n+1\le j\le p_{_n}\), and from Condition \({{\text{E}}2}\), \(|J_{2j} |\ge {\mathcal {B}}_n{\epsilon }\eta\), as \(n\rightarrow \infty\), \(J_{2j}\) dominates \(I_{1j,1}\) and \(J_{3j}\). Applying the Bernstein’s inequality, for large n,

$$\begin{aligned} {\text{P}}\{h_{n,j}({\mathcal {B}}_n{\epsilon }) \le 0\}\le & {\text{P}}(I_{1j,2} + I_{1j,3} \le -C {\mathcal {B}}_n{\epsilon }) \nonumber \\ \le & 4 \exp \Big (\frac{-{\epsilon }^2 n^2 {\mathcal {B}}_n^2}{C_1n + C_2 {\epsilon }n {\mathcal {B}}_n}\Big ), \end{aligned}$$
(48)

for some positive constants C, \(C_1\) and \(C_2\), where \(I_{1j}\), \(I_{1j,2}\) and \(I_{1j,3}\) are as defined in Part 1. Similarly,

$$\begin{aligned} {\text{P}}\{h_{n,j}(-{\mathcal {B}}_n{\epsilon }) \ge 0\}\le & {\text{P}}(I_{1j,2} + I_{1j,3} \ge C {\mathcal {B}}_n{\epsilon }) \nonumber \\\le & 4 \exp \Big (\frac{-{\epsilon }^2 n^2 {\mathcal {B}}_n^2}{C_1n + C_2 {\epsilon }n {\mathcal {B}}_n}\Big ). \end{aligned}$$
(49)

Thus by (47), (48) and (49),

$$\begin{aligned} \sum _{j=s_n}^{p_{_n}} {\text{P}}(|{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) \le 8(p_{_n} - s_n) \exp \Big (\frac{-{\epsilon }^2 n^2 {\mathcal {B}}_n^2}{C_1n + C_2 {\epsilon }n {\mathcal {B}}_n}\Big ) = o(1). \end{aligned}$$
(50)

The equality in (50) follows from the conditions \({\mathcal {B}}_n = o(1)\), \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\). \(\square\)

Proof of Theorem 9

For part (i), note that for \({\varvec{X}}^o = ({\varvec{X}}^{o({\text{I}})T}, {\varvec{X}}^{o({\text{II}})T})^T\), \({\widetilde{{\varvec{X}}}}^o = (1, {\varvec{X}}^{oT})^T\) and \({\widetilde{{\varvec{X}}}}^{o({\text{I}})} = (1,{\varvec{X}}^{o({\text{I}})T})^T\),

$$\begin{aligned} |{\widehat{m}}({\varvec{X}}^o)- m({\varvec{X}}^o) | & = |F^{-1}({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) - F^{-1}({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) |\\ & \le |(F^{-1})'({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} {\widetilde{{\varvec{\beta }}}}^*) |\Vert {\widetilde{{\varvec{X}}}}^{o({\text{I}})T}\Vert _2 \Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2, \end{aligned}$$

for some \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). By Condition \({\text{A}}4\), we conclude that \((F^{-1})'({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} {\widetilde{{\varvec{\beta }}}}^*) = O_{{\text{P}}}(1)\). This along with \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(r_n)\) and \(\Vert {\widetilde{{\varvec{X}}}}^{o({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n})\) implies that \(|{\widehat{m}}({\varvec{X}}^o)- m({\varvec{X}}^o) |= O_{{\text{P}}}(r_n\sqrt{s_n}) = o_{_{{\text{P}}}}(1).\) The rest of the proof is similar to that of Theorem 9 in Zhang et al. (2010) and is omitted.

For part (ii), using the proof similar to Lemma A1 of Zhang et al. (2010), we obtain that for any \({\text{BD}}\) \(\texttt {Q}\) satisfying (4),

$$\begin{aligned} {\text{E}}\{\texttt {Q}(Y^o,{\widehat{m}}({\varvec{X}}^o))\mid \mathcal T_n, {\varvec{X}}^o\} = {\text{E}}\{\texttt {Q}(Y^o, m({\varvec{X}}^o))\mid {\varvec{X}}^o\}+\texttt {Q}(m({\varvec{X}}^o),{\widehat{m}}({\varvec{X}}^o)). \end{aligned}$$

It follows that

$$\begin{aligned} & {\text{E}}\{\texttt {Q}(Y^o,{\widehat{m}}({\varvec{X}}^o))\mid \mathcal T_n\} \\ & \quad = {\text{E}}[{\text{E}}\{\texttt {Q}(Y^o,{\widehat{m}}({\varvec{X}}^o))\mid \mathcal T_n, {\varvec{X}}^o\} \mid \mathcal T_n] \\ & \quad = {\text{E}}[{\text{E}}\{\texttt {Q}(Y^o, m({\varvec{X}}^o))\mid {\varvec{X}}^o\}+\texttt {Q}(m({\varvec{X}}^o),{\widehat{m}}({\varvec{X}}^o)) \mid \mathcal T_n] \\ & \quad = {\text{E}}[{\text{E}}\{\texttt {Q}(Y^o, m({\varvec{X}}^o))\mid {\varvec{X}}^o\} \mid \mathcal T_n] + {\text{E}}\{\texttt {Q}(m({\varvec{X}}^o),{\widehat{m}}({\varvec{X}}^o)) \mid \mathcal T_n\} \\ & \quad = {\text{E}}[{\text{E}}\{\texttt {Q}(Y^o, m({\varvec{X}}^o))\mid {\varvec{X}}^o\} ] + {\text{E}}\{\texttt {Q}(m({\varvec{X}}^o),{\widehat{m}}({\varvec{X}}^o)) \mid \mathcal T_n\} \\ & \quad = {\text{E}}\{\texttt {Q}(Y^o, m({\varvec{X}}^o))\} + {\text{E}}\{\texttt {Q}(m({\varvec{X}}^o),{\widehat{m}}({\varvec{X}}^o)) \mid \mathcal T_n\}. \end{aligned}$$

Setting \(\texttt {Q}\) to be the misclassification loss implies

$$\begin{aligned} R({\widehat{\phi }} \mid \mathcal T_n) \ge R(\phi _{\mathrm {\scriptscriptstyle B}}), \end{aligned}$$

which combined with part (i) completes the proof. \(\square\)

1.2 1.2 Additional numerical studies

1.2.1 1.2.1 Gaussian responses in Sect. 6.3

Random samples \(\{({\varvec{X}}_{i}, Y_i)\}_{i=1}^n\) of size \(n=200\) are generated from the model,

$$\begin{aligned} {\varvec{X}}_{i}=(X_{i,1},\ldots ,X_{i,p_{_n}})^T \sim N({\textbf{0}},\Sigma _{p_{_n}}), \quad Y_i \mid {\varvec{X}}_{i} \sim N(\beta _{0;0} + {\varvec{X}}_{i}^T{\varvec{\beta }}_{0}, \sigma ^2), \end{aligned}$$

where \(\beta _{0;0} = 1\), \({\varvec{\beta }}_{0} = (2, 1.5, 0.8, -1.5, 0.4, 0,\ldots ,0)^T\) with \(\sigma ^2 =1\). Here \(\Sigma _{p_{_n}}(j,k)=\rho ^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}\), with \(\rho =0.1\). The qudartic loss is used as the \({\text{BD}}\).

Study 1 (raw data without outliers). For simulated data in the non-contaminated case, the results are summarized in Table 7. The robust estimators perform very similar to the non-robust counterparts.

Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set, where 7 data points \((X_{i,j},Y_i)\) are contaminated as follows: They are replaced by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = Y_i {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma > 2\} + 15 {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma \le 2\}\), \(i=1,\ldots ,7\),

$$\begin{aligned} \begin{matrix} X_{1, 1}^* = 5\, \,\text{sign}(U_1-.5), & X_{2, 1}^* = 5\, \,\text{sign}(U_2-.5), & X_{3, 1}^* = 5\, \,\text{sign}(U_3-.5), \\ X_{4, 3}^* = 5\, \,\text{sign}(U_4-.5), & X_{5, 5}^* = 5\, \,\text{sign}(U_5-.5), & X_{6, 9}^* = 5\, \,\text{sign}(U_6-.5), \\ X_{7, 9}^* = 5\, \,\text{sign}(U_7-.5), & & \end{matrix} \end{aligned}$$

with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 8 summarizes the results over 500 sets of contaminated data. A comparison of each estimator in Tables 7 and 8 indicates that the presence of contamination substantially increases the estimation errors \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\) and reduces either \({\text {C-Z}}\) or \({\text {C-NZ}}\). On the other hand, it is clearly observed that the non-robust estimates are more sensitive to outliers than the robust counterparts.

To further assess the impact of the sample size n on the parameter estimates, we display boxplots of \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,8\), using the \({\text{PMR}}\) selection method for the weighted-\(L_1\) penalty, in Fig. 3 using \(n=200\) and Fig. 4 using \(n=100\), respectively. The comparison supports the consistency of both the classical and robust estimates of large dimensional model parameters for clean data as n increases, in addition to the stability of the robust estimates under a small amount of contaminated outliers.

Table 7 (Simulation study: Gaussian responses, \(n=200\)) Summary results for Study 1 (raw data without outliers)
Table 8 (Simulation study: Gaussian responses, \(n=200\)) Summary results for Study 2 (contaminated data with outliers)
Fig. 3
figure 3

(Simulation study: Gaussian responses, \(n=200\), \(p_{_n}=50\)  (left panel) and \(p_{_n}=500\) (right panel)) Boxplots of  \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,8\),  corresponding to results in Tables 7 and 8, using the \({\text{PMR}}\)  selection method for penalty weights in the weighted-\(L_1\) penalty. The first row: raw data and using non-robust method; the second row: raw data and using robust method; the third row: contaminated data and using non-robust method; the fourth row: contaminated data and using robust method

Fig. 4
figure 4

(Simulation study: Gaussian responses, \(n=100\), \(p_{_n}=50\)  (left panel) and \(p_{_n}=500\) (right panel)) The caption is similar to that in Fig. 3, except \(n=100\)

1.2.2 1.2.2 Real data analysis

We consider the classification of Colon cancer data discussed in Alon et al. (1999) and available at http://genomics-pubs.princeton.edu/oncology/. It consists of 2000 genes and 62 samples, where 22 samples are from normal colon tissues and 40 samples are from tumor tissues. Similar to the analysis in Sect. 7, the data set is randomly split into two parts, with 45 samples as training samples and the rest 17 as test samples. Table 9 summarizes the average of the test errors (\(\text{TE}\)) and the average number of selected genes over 100 random splits. We observe that robust procedures tend to select fewer genes than the non-robust procedures, without getting much increase in the test errors. This lends further support to the practicality of the proposed penalized robust-\({\text{BD}}\)  estimation.

Table 9 (Real data) Classification for the Colon cancer data

1.3 1.3 Numerical procedure for penalized robust-\({\text{BD}}\) estimator in (3)

1.3.1 1.3.1 Optimization algorithm

Numerically, the penalized robust-\({\text{BD}}\) estimators in (3) combined with penalties used in Sects. 6 and 7 are implemented by extending the coordinate descent (\({\text{CD}}\)) iterative algorithm (Friedman et al., 2010), with the initial value \((b,0,\ldots ,0)^T\), where \(b=\log \{(\overline{Y}_n+0.1)/(1-\overline{Y}_n+0.1)\}\) and \(b=\log (\overline{Y}_n+0.1)\) for Bernoulli and count responses respectively, using the sample mean \(\overline{Y}_n\) of \(\{Y_i\}_{i=1}^n\). Namely, the loss term

$$\begin{aligned} L({\widetilde{{\varvec{\beta }}}})=n^{-1}\sum _{i=1}^n \rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}))\, w({\varvec{X}}_{i}) \end{aligned}$$

in (3) is locally approximated by a weighted form of quadratic loss functions, and the optimization solution of (3) is obtained by the \({\text{CD}}\) method. Particularly, the gradient vector and Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) are

$$\begin{aligned} L'({\widetilde{{\varvec{\beta }}}}) & = n^{-1}\sum _{i=1}^n \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}, \\ L''({\widetilde{{\varvec{\beta }}}}) & = n^{-1}\sum _{i=1}^n \text{p}_{_2}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i} {\widetilde{{\varvec{X}}}}_{i}^T. \end{aligned}$$

The quadratic approximation is supported by the fact that the Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) evaluated at the true parameter vector \({\widetilde{{\varvec{\beta }}}}_0\) is

$$\begin{aligned} L''({\widetilde{{\varvec{\beta }}}}_0) & = n^{-1}\sum _{i=1}^n \text{p}_{_2}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{\beta }}}}_0)\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i} {\widetilde{{\varvec{X}}}}_{i}^T \\ & = {\text{E}}[{\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\}\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}} {\widetilde{{\varvec{X}}}}^T] +o_{_{{\text{P}}}}(1), \end{aligned}$$

which, combined with the property \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\}\ge 0\) discussed in part (d) of Sect. 2.2, indicates that with probability tending to one, the matrix \(L''({\widetilde{{\varvec{\beta }}}}_0)\) is positive semidefinite.

Both \(\text{p}_{_1}(y;\theta )\) and \(\text{p}_{_2}(y;\theta )\) in \(L'({\widetilde{{\varvec{\beta }}}})\) and \(L''({\widetilde{{\varvec{\beta }}}})\) are calculated using (9), which incorporates the Huber and Tukey \(\psi\)-functions whose derivatives \(\psi '(r)\) can be substituted by its subgradient or approximation.

1.3.2 1.3.2 Pseudo codes, source codes and computational complexity analysis

Algorithm 1 summarizes the complete procedure for numerically solving the “penalized robust-\({\text{BD}}\)  estimator” in (3).

figure a

To illustrate the computational complexity analysis, Tables 10 and 11 compare runtime of the non-robust and robust procedures. All computations are performed using MATLAB (Version: 9.12.0.1956245 (R2022a) Update 2) on Windows 11, 12th Gen Intel(R) Core(TM) i9-12900, 2400 Mhz, 16 Core(s), 24 Logical Processors. MATLAB source codes are available at GitHub https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM. For either clean or contaminated data, the algorithmic complexity depends on the type of response variables, the dimensionality and the procedure. Poisson-type responses are more computationally intensive than Gaussian responses; robust procedures are slower than the non-robust counterparts; higher dimensions demand more computational costs than lower dimensional settings.

Table 10 The total CPU time (in seconds) for overdispersed Poisson responses in 500 replications
Table 11 The total CPU time (in seconds) for Gaussian responses in 500 replications

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Zhu, L. & Shen, Y. Robust estimation in regression and classification methods for large dimensional data. Mach Learn 112, 3361–3411 (2023). https://doi.org/10.1007/s10994-023-06349-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06349-2

Keywords

Navigation