Abstract
Statistical data analysis and machine learning heavily rely on error measures for regression, classification, and forecasting. Bregman divergence (\({\text{BD}}\)) is a widely used family of error measures, but it is not robust to outlying observations or high leverage points in large- and high-dimensional datasets. In this paper, we propose a new family of robust Bregman divergences called “robust-\({\text{BD}}\)” that are less sensitive to data outliers. We explore their suitability for sparse large-dimensional regression models with incompletely specified response variable distributions and propose a new estimate called the “penalized robust-\({\text{BD}}\) estimate” that achieves the same oracle property as ordinary non-robust penalized least-squares and penalized-likelihood estimates. We conduct extensive numerical experiments to evaluate the performance of the proposed penalized robust-\({\text{BD}}\) estimate and compare it with classical approaches, and show that our proposed method improves on existing approaches. Finally, we analyze a real dataset to illustrate the practicality of our proposed method. Our findings suggest that the proposed method can be a useful tool for robust statistical data analysis and machine learning in the presence of outliers and large-dimensional data.
Similar content being viewed by others
Data availability
The Lymphoma data studied in Sect. 7 is publicly available from Alizadeh et al. (2000); the Colon cancer dataset in Appendix 1.2.2 is at http://genomics-pubs.princeton.edu/oncology/.
Code availability
MATLAB codes are available at https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM.
References
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., … Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the USA, 96, 6745–6750.
Altun, Y., & Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In G. Lugosi & H. U. Simon (Eds.), Learning theory: 19th annual conference on learning theory (pp. 139–153). Springer.
Bianco, A. M., & Yohai, V. J. (1996). Robust estimation in the logistic regression model. In Robust statistics, data analysis, and computer intensive methods (Schloss Thurnau, 1994) (Vol. 109, pp. 17–34), Lecture Notes in Statist., Springer.
Boente, G., He, X., & Zhou, J. (2006). Robust estimates in generalized partially linear models. Annals of Statistics, 34, 2856–2878.
Brègman, L. M. (1967). A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7, 620–631.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Annals of Statistics, 35, 2313–2351.
Cantoni, E., & Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022–1030.
Dupuis, D. J., & Victoria-Feser, M.-P. (2011). Fast robust model selection in large datasets. Journal of the American Statistical Association, 106, 203–212.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470.
Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32, 928–961.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.
Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In The 30th international conference on machine learning (ICML 2013).
Grünwald, P. D., & Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Annals of Statistics, 32, 1367–1433.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. Wiley.
Heritier, S., Cantoni, E., Copt, S., & Victoria-Feser, M.-P. (2009). Robust methods in biostatistics. Wiley.
Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101.
Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2007). Robust loss functions for boosting. Neural Computation, 19, 2183–2244.
Künsch, H., Stefanski, L., & Carroll, R. (1989). Conditionally unbiased bounded influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460–466.
Lafferty, J. D., Della Piestra, S., & Della Piestra, V. (1997). Statistical learning algorithms based on Bregman distances. In Proceedings of the 5th Canadian workshop on information theory.
Lafferty, J. (1999). Additive models, boosting and inference for generalized divergences. In: Proceedings of the twelfth annual conference on computational learning theory (pp. 125–133). ACM Press.
Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group Lasso for logistic regression. Journal of the Royal Statistical Society Series B, 70, 53–71.
Stefanski, L., Carroll, R., & Ruppert, D. (1986). Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73, 413–424.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press.
van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. Springer.
Vapnik, V. (1996). The nature of statistical learning theory. Springer.
Vemuri, B. C., Liu, M., Amari, S.-I., & Nielsen, F. (2011). Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, 30, 475–483.
Wright, S. J. (1997). Primal-dual interior-point methods. SIAM.
Wu, Y. C., & Liu, Y. F. (2007). Robust truncated hinge Loss support vector machines. Journal of the American Statistical Association, 102, 974–983.
Zhang, C. M., Jiang, Y., & Shang, Z. (2009). New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Canadian Journal of Statistics, 37, 119–139.
Zhang, C. M., Jiang, Y., & Chai, Y. (2010). Penalized Bregman divergence for large-dimensional regression and classification. Biometrika, 97, 551–566.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Zou, H., & Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. Annals of Statistics, 36, 1108–1126.
Funding
Zhang’s work was partially supported by the U.S. National Science Foundation grants DMS-2013486 and DMS-1712418, and provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. The second author gratefully acknowledges an NSFC grant (NSFC12131006, 12071038) from the National Natural Science Foundation of China and a grant from the University Grants Council of Hong Kong.
Author information
Authors and Affiliations
Contributions
CM Zhang contributed to the proof, computation, and writing; LXZ contributed to the discussion, proof, and writing; YBS contributed to the implementation of \({\text{SVM}}\) and robust-\({\text{SVM}}\).
Corresponding author
Ethics declarations
Conflict of interest
‘Not applicable’.
Ethics approval
‘Not applicable’.
Consent to participate
‘Not applicable’.
Consent for publication
All authors of the paper consent for publication.
Additional information
Editors: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Supplementary Appendix
1. Proofs, figures and tables, algorithm
Notations and symbols.
For a vector \({\varvec{a}}=(a_1,\ldots ,a_d)^T\), \(\Vert {\varvec{a}}\Vert _1={\sum _{j=1}^d |a_j|}\), \(\Vert {\varvec{a}}\Vert _2=(\sum _{j=1}^d a_j^2)^{1/2}\) and \(\Vert {\varvec{a}}\Vert _{\infty } = \max _{1\le j \le d} |a_j |\). Let \(\textbf{I}_{{\textsf {k}}}\) denote a \({\textsf {k}}\times {\textsf {k}}\) identity matrix, and \({\textbf{0}}_{p,q}\) denote a \(p\times q\) matrix of zero entries. For a matrix M, its eigenvalues, minimum eigenvalue, maximum eigenvalue are labeled by \(\lambda _j(M)\), \(\lambda _{\min }(M)\), \(\lambda _{\max }(M)\) respectively; \(\text{tr}(M)\) denotes the trace of a square matrix M; let \(\Vert M\Vert =\Vert M\Vert _2=\sup _{\Vert {\varvec{x}}\Vert _2=1}\Vert M{\varvec{x}}\Vert _2=\{\lambda _{\max }(M^TM)\}^{1/2}\) be the matrix \(L_2\) norm, and \(\Vert M\Vert _F = \{\text{tr}(M^T M)\}^{1/2}\) be the Frobenius norm. Throughout the proof, C is used as a generic finite constant. The sign function \(\,\text{sign}(x)\) equals \(+1\) if \(x>0\), 0 if \(x=0\), and \(-1\) if \(x<0\). For a function g(x), the first-order derivative is \(g'(x)\) or \(g^{(1)}(x)\), the second-order derivative is \(g''(x)\) or \(g^{(2)}(x)\), and the jth other derivative is \(g^{(j)}(x)\). The chi-squared distribution with k degrees of freedom is denoted by \(\chi _{k}^2\).
The conditional expectation and condition variance of Y given \({\varvec{X}}\) are denoted by \({\text{E}}(Y \mid {\varvec{X}})\) and \({\text{var}}(Y \mid {\varvec{X}})\) respectively. Notations in the asymptotic derivations follow (van der Vaart, 1998), where \({\mathop {\longrightarrow }\limits ^{\text{P}}}\) denotes converges in probability, \({\mathop {\longrightarrow }\limits ^{\mathcal L}}\) means converges in distribution, \(o_{{\text{P}}}(1)\) is a term which converges to zero in probability, and \(O_{{\text{P}}}(1)\) is a term which is bounded in probability.
1.1 1.1. Proofs of Theorems 1 up to 9
We first impose some regularity conditions, which are not the weakest possible but facilitate the technical derivations.
Condition A.
- A0.:
-
\(s_n\ge 1\) and \(p_{_n}-s_n \ge 1\). \(\sup _{n\ge 1} \Vert {\varvec{\beta }}_{0}^{({\text{I}})}\Vert _1<\infty\).
- A1.:
-
\(\Vert {\varvec{X}}\Vert _{\infty } = \max _{1\le j \le p_{_n}} |X_j |\) is bounded almost surely.
- A2.:
-
\({\text{E}}({\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^T)\) exists and is nonsingular in the case of \(p_{_n}+1 \le n\); \({\text{E}}\{{\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}\) exists and is nonsingular in the case of \(p_{_n}+1 > n\).
- A4.:
-
There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset.
- A5.:
-
\(w(\cdot )\ge 0\) is a bounded function. Assume that \(\psi (r)\) is a bounded, odd function, and twice differentiable, such that \(\psi '(r)\), \(\psi '(r)r\), \(\psi ''(r)\), \(\psi ''(r)r\) and \(\psi ''(r)r^2\) are bounded; \(V(\cdot )>0\), \(V^{(2)}(\cdot )\) is continuous. The matrix \(\textbf{H}_n^{({\text{I}})}\) is positive definite, with eigenvalues uniformly bounded away from zero.
- A5\('\).:
-
\(w(\cdot )\ge 0\) is a bounded function.
- A6.:
-
\(q^{(4)}(\cdot )\) is continuous, and \(q^{(2)}(\cdot )<0\). \(g_1^{(2)} (\cdot )\) is continuous.
- A7.:
-
\(F(\cdot )\) is monotone and a bijection, \(F^{(3)}(\cdot )\) is continuous, and \(F^{(1)}(\cdot )\ne 0\).
Condition B.
-
B3.
There exists a constant \(C \in (0,\infty )\) such that \(\sup _{n\ge 1}{\text{E}}\{|Y-m({\varvec{X}})|^j \} \le j! C^j\) for all \(j\ge 3\). Also, \(\inf _{n\ge 1,\, 1\le j\le p_{_n}} {\text{E}}\{{\text{var}}(Y\mid {\varvec{X}}) X_j^2 \} > 0\).
-
B5.
The matrices \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) are positive definite, with eigenvalues uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})}\Vert _2\) is bounded away from infinity.
Condition C.
-
C4.
There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset. Moreover, the subset contains the origin.
Condition D.
-
D5.
The eigenvalues of \(\textbf{H}_n^{({\text{I}})}\) are uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1/2} (\Omega _n^{({\text{I}})})^{1/2}\Vert _2\) is bounded away from infinity.
Condition E.
-
E1.
\(\min _{1\le j\le s_n} |{\text{cov}}(X_{j}, Y)|\succeq {\mathcal {A}}_n\) and \(\max _{s_n+1\le j\le p_{_n}}|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\) for some positive sequences \({\mathcal {A}}_n\) and \({\mathcal {B}}_n\), where the symbol \(s_{n} \succeq t_{n}\), for two nonnegative sequences \(s_{n}\) and \(t_{n}\), means that there exists a constant \(c > 0\) such that \(s_{n} \ge c\, t_{n}\) for all \(n\ge 1\).
-
E2.
\(\sup _{n \ge 1, 1\le j\le s_n} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} < \infty\); \(\inf _{n \ge 1, s_n+1\le j\le p_{_n}} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} = \eta > 0\), where \(\alpha _0=F({\text{E}}(Y))\).
Proof of Theorem 1
We first need to show Lemma 1. \(\square\)
Lemma 1
(existence and consistency: \(p_{_n} \ll n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{A}}6\) and \({\text{A}}7\) in Appendix 1.1, the matrix \(\textbf{H}_n = {\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^{T}\}\) is positive definite with eigenvalues uniformly bounded away from zero, and \(w_{\max }^{({\text{I}})}=O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n} \sqrt{s_n/p_{_n}})\}\). If \(p_{_n}^4/n \rightarrow 0\) as \(n\rightarrow \infty\), then there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{p_{_n}/n})\).
Proof
We follow the idea of the proof of Theorem 1 in Fan and Peng (2004). Let \(r_n = \sqrt{p_{_n}/n}\) and \({\widetilde{{\varvec{u}}}}_n=(u_0,u_1,\ldots ,u_{p_{_n}})^T\in \mathbb {R}^{p_{_n}+1}\). It suffices to show that for any given \({\epsilon }>0\), there exists a sufficiently large constant \(C_{\epsilon }\) such that, for large n we have
This implies that with probability at least \(1-{\epsilon }\), there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in the ball \(\{{\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n: \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2\le C_{\epsilon }\}\) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2=O_{{\text{P}}}(r_n)\). To show (20), consider
where \(\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2=C_{\epsilon }\).
First, we consider \(I_1\). By Taylor’s expansion, \(I_1\) has the decomposition,
where \(I_{1,1} = {r_n } /{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n\), \(I_{1,2} = {r_n^2}/{(2n)}\sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2\), and
\(I_{1,3} = {r_n^3}/{(6n)}\sum _{i=1}^n \text{p}_{_3}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}^*) \, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^3\) for \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n\). Hence
For the term \(I_{1,2}\) in (22),
where \(I_{1,2,1} = 2^{-1}{r_n^2}{\widetilde{{\varvec{u}}}}_n^T \textbf{H}_n{\widetilde{{\varvec{u}}}}_n.\) Meanwhile, we have
Thus,
For the term \(I_{1,3}\) in (22), we observe that
which follows from Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}4\) and \({\text{A}}5\).
Next, we consider \(I_2\) in (21). Note \(I_2 = \lambda _n \sum _{j=1}^{s_n} w_{n,j} (|\beta _{j;0} + r_n u_j |-|\beta _{j;0}|) +\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j} |u_j |\). Clearly, by the triangle inequality,
in which
where \({\varvec{u}}_n^{({\text{I}})}=(u_1,\ldots ,u_{s_n})^T\). By (23)–(25) and \(p_{_n}^4/n \rightarrow 0\), we can choose some large \(C_{\epsilon }\) such that \(I_{1,1}\), \(I_{1,3}\) and \(I_{2,1}\) are all dominated by the first term of \(I_{1,2}\) in (24), which is positive by the eigenvalue assumption. This implies (20). \(\square\)
We now show Theorem 1. Write \({\widetilde{{\varvec{u}}}}_n= ({\widetilde{{\varvec{u}}}}_n^{({\text{I}})T}, {\varvec{u}}_n^{({\text{II}})T})^T\), where \({\widetilde{{\varvec{u}}}}_n^{({\text{I}})}=(u_0,u_1,\ldots ,u_{s_n})^T\) and \({\varvec{u}}_n^{({\text{II}})}=(u_{s_n+1},\ldots ,u_{p_{_n}})^T\). Following the proof of Lemma 1, it suffices to show (20) for \(r_n=\sqrt{s_n/n}\).
For the term \(I_{1,1}\) in (22),
It follows that
For the term \(I_{1,2}\) in (22), similar to the proof of Lemma 1, \(I_{1,2}=I_{1,2,1}+I_{1,2,2}\). We observe that
Then there exists a constant \(C>0\) such that
For the term \(I_{1,2,2}\),
where
For the term \(I_{1,3}\) in (22), since \(s_n p_{_n}=o(n)\), \(\Vert {\widetilde{{\varvec{\beta }}}}^*\Vert _1\) is bounded and thus
where
For the term \(I_2\) in (21), \(I_2 \ge I_{2,1}^{({\text{I}})}+I_{2,1}^{({\text{II}})},\) where \(I_{2,1}^{({\text{I}})}=-\lambda _n r_n \sum _{j=1}^{s_n} w_{n,j} |u_j |\) and \(I_{2,1}^{({\text{II}})}={\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j}|u_j |}\). Thus, we have
It can be shown that either \(I_{1,2,1}^{({\text{I}})}\) or \(I_{2,1}^{({\text{II}})}\) dominates all other terms in groups \(\mathcal G_1=\{I_{1,2,2}^{({\text{I}})}, I_{1,3}^{({\text{I}})}\}\), \(\mathcal G_2=\{I_{1,1}^{({\text{II}})}, I_{1,2,2}^{({\text{II}})}, I_{1,3}^{({\text{II}})}, I_{1,2,1}^{({\text{cross}})}, I_{1,2,2}^{({\text{cross}})}\}\) and \(\mathcal G_3=\{I_{1,1}^{({\text{I}})}, I_{2,1}^{({\text{I}})}\}\). Namely, \(I_{1,2,1}^{({\text{I}})}\) dominates \(\mathcal G_1\), and \(I_{2,1}^{({\text{II}})}\) dominates \(\mathcal G_2\). For \(\mathcal G_3\), since \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 \le C_{\epsilon }\), we have that
Hence, if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 \le C_{\epsilon }/2\), then \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 > C_{\epsilon }/2\), and thus \(\mathcal G_3\) is dominated by \(I_{1,2,1}^{({\text{I}})}\), which is positive; if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 > C_{\epsilon }/2\), then \(\mathcal G_3\) is dominated by \(I_{2,1}^{({\text{II}})}\), which is positive. This completes the proof. \(\square\)
Proof of Theorem 2
We first need to show Lemma 2. \(\square\)
Lemma 2
Assume Condition A in Appendix 1.1. If \(s_n^2/n=O(1)\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), then with probability tending to one, for any given \({\widetilde{{\varvec{\beta }}}}=({\widetilde{{\varvec{\beta }}}}^{({\text{I}})T}, {\varvec{\beta }}^{({\text{II}})T})^T\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\) and any constant \(C>0\), it follows that \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\textbf{0}}) = \min _{\Vert {\varvec{\beta }}^{({\text{II}})}\Vert _2 \le C\sqrt{s_n/n}} \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\varvec{\beta }}^{({\text{II}})})\).
Proof
It suffices to prove that with probability tending to one, for any \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), the following inequalities hold for \(s_n+1 \le j \le p_{_n}\),
namely, with probability tending to one,
Proofs for showing both inequalities are similar; we only need to show (26). Note that for \(\beta _{j} \ne 0\),
where \({\widetilde{{\varvec{\beta }}}}^*\) lies between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}\). It follows that
The first term \(I_1\) satisfies that
For the term \(I_2\),
Therefore, by (27) and (28), the left side of (26) is
By \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), (26) is proved. \(\square\)
We now show Theorem 2. By Lemma 2, the first part of Theorem 2 holds that \(\widehat{{\widetilde{{\varvec{\beta }}}}}=(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})},{\textbf{0}}^T)^T\). To verify the second part of Theorem 2, notice the estimating equations \(\frac{\partial \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})}{\partial {\widetilde{{\varvec{\beta }}}}^{({\text{I}})}} |_{{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}=\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}} = {\textbf{0}}\), since \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is a local minimizer of \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})\). Denote \({\varvec{d}}_n({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\}\) which is equal to \({\varvec{d}}_n\) when \({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}={\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\). Since \(\min _{1\le j\le s_n}|\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that
Thus with probability tending to one, \({\varvec{d}}_n(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = {\varvec{d}}_n({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) = {\varvec{d}}_n\). Taylor’s expansion applied to the loss part on the left side of the estimating equations yields
where both \({\widetilde{{\varvec{\beta }}}}^{*({\text{I}})}\) and \({\widetilde{{\varvec{\beta }}}}^{**({\text{I}})}\) lie between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). Below, we will show
First, to show (30), note that \(K_2 - \textbf{H}_n^{({\text{I}})} = K_2-{\text{E}}(K_2) \equiv L_1\). Similar arguments for the proof of Lemma 1 give \(\Vert L_1\Vert _2 = O_{{\text{P}}}(s_n/\sqrt{n})\).
Second, a similar proof used for \(I_{1,3}\) in (22) completes (31).
Third, by (29)–(31) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), we see that
where \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n)\). Note that by Condition \({\text{B}}5\),
Thus
To complete proving the second part of Theorem 2, we apply the Lindeberg-Feller central limit theorem (van der Vaart, 1998) to \(\sum _{i=1}^n {\varvec{Z}}_{i}\), where \({\varvec{Z}}_{i} = - n^{-1/2} A_n (\Omega _n^{({\text{I}})})^{-1/2} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). It suffices to check two conditions: (I) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \mathbb {G}\); (II) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (I) follows from the fact that \({\text{var}}\{\text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}\} = \Omega _n^{({\text{I}})}\). To verify condition (II), notice that using Conditions \({\text{B}}5\) and \({\text{A}}5\),
Thus, we get \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{n (s_n/n)^{(2+\delta )/2}\} = O\{s_n^{(2+\delta )/2} / n^{\delta /2}\}\), which is o(1). This verifies Condition (II). \(\square\)
Proof of Theorem 3
Before showing Theorem 3, Lemma 3 is needed. \(\square\)
Lemma 3
Assume conditions of Theorem 3. Then
Proof
Following (32) in the proof of Theorem 2, we observe that \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n) = o_{_{{\text{P}}}}(n^{-1/2})\). Furthermore, \(\Vert {\varvec{d}}_n\Vert _2 \le \sqrt{s_n} \lambda _n w_{\max }^{({\text{I}})} = o_{_{{\text{P}}}}(n^{-1/2})\). Condition \({\text{B}}5\) completes the proof for the first part.
To show the second part, denote \(U_n = A_n (\textbf{H}_n^{({\text{I}})})^{-1}\Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}A_n^T\) and \({\widehat{U}}_n = A_n ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}{\widehat{\Omega }}_n^{({\text{I}})} ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}A_n^T\). Notice that the eigenvalues of \((\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}\) are uniformly bounded away from zero. So are the eigenvalues of \(U_n\). From the first part, we see that
It follows that
where \({\varvec{Z}}_{i} = - n^{-1/2} U_n^{-1/2} A_n (\textbf{H}_n^{({\text{I}})})^{-1} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). To show \(\sum _{i=1}^n {\varvec{Z}}_{i} {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\), similar to the proof for Theorem 2, we check two conditions: (III) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \textbf{I}_{{\textsf {k}}}\); (IV) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (III) is straightforward since \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) = U_n^{-1/2} U_n U_n^{-1/2} = \textbf{I}_{{\textsf {k}}}\). To check condition (IV), similar arguments used in the proof of Theorem 2 give that \({\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = O\{(s_n/n)^{{(2+\delta )}/{2}}\}.\) This and the boundedness of the \(\psi\)-function yield \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{s_n^{(2+\delta )/2}/n^{\delta /2}\} = o(1)\). Hence
Also, it can be concluded that \(\Vert {\widehat{U}}_n - U_n \Vert _2=o_{_{{\text{P}}}}(1)\) and that the eigenvalues of \({\widehat{U}}_n\) are uniformly bounded away from zero and infinity with probability tending to one. Consequently,
Combining (33), (34) and Slutsky’s theorem completes the proof that \(\sqrt{n} {\widehat{U}}_n^{-1/2}A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\). \(\square\)
We now show Theorem 3, which follows directly from the null hypothesis \(H_0\) in (14) and the second part of Lemma 3. This completes the proof. \(\square\)
Proof of Theorem 4
The proof of Theorem 4 is similar to that used in Theorem 7, except that in the Part 2, \({\mathcal {C}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)
Proof of Theorem 5
The proof of Theorem 5 is similar to that used in Theorem 8, except that in the Part 2, \({\mathcal {B}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)
Proof of Theorem 6
Assumption (19) implies that \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in (3) is convex in \({\widetilde{{\varvec{\beta }}}}\). By Karush-Kuhn-Tucker conditions (Wright 1997, Theorem A.2), a set of sufficient conditions for an estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}} = ({\widehat{\beta }}_{0},{\widehat{\beta }}_{1},\ldots ,{\widehat{\beta }}_{p_{_n}})^T\) being a global minimizer of (3) is that
Before proving Theorem 6, we first show Lemma 4. \(\square\)
Lemma 4
(existence and consistency: \(p_{_n}\gg n\)) Assume (19) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5'\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Suppose \(s_n^4/n \rightarrow 0\), \(\log (p_{_n}-s_n)/n = O(1)\), \(\log (p_{_n}-s_n)/\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\} = o_{_{{\text{P}}}}(1)\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\). Assume \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). Then with probability tending to one, there exists a global minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) of \(\ell _n({\widetilde{{\varvec{\beta }}}})\) in (3) which satisfies that
- \(\mathrm {(i)}\):
-
\(\widehat{{\varvec{\beta }}}^{({\text{II}})} = {\textbf{0}}\),
- \(\mathrm {(ii)}\):
-
\(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is the minimizer of the oracle subproblem,
$$\begin{aligned} \ell _{n}^O({\widetilde{{\varvec{\beta }}}}^{({\text{I}})})=\frac{1}{n}\sum _{i=1}^n \rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}))\, w({\varvec{X}}_{i})+\lambda _n\sum _{j=1}^{s_n} w_{n,j}|\beta _{j} |. \end{aligned}$$(36)
Proof
Let \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} = ({\widehat{b}}_{0},{\widehat{b}}_{1},\ldots ,{\widehat{b}}_{s_n})^T\) be the minimizer of the subproblem (36). By Karush-Kuhn-Tucker necessary conditions (Wright 1997, Theorem A.1), \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\) satisfies that
In the following, we will verify conditions
and
It then follows, from (37), (38) and (35), that \((\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})T}, {\textbf{0}}^T)^T\) is the global minimizer of (3). This will in turn imply Lemma 4.
First, we prove that (37) holds with probability tending to one. Applying Lemma 1 to the subproblem (36), we conclude that \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\). Since \({\min _{1\le j\le s_n}|\beta _{j;0}|/ \sqrt{s_n/n}} \rightarrow \infty\) as \(n \rightarrow \infty\), it is seen that
Hence (37) holds with probability tending to one.
Second, we prove that (38) holds with probability tending to one. It suffices to prove that
By Taylor’s expansion, we have that
with \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\). Then (39) holds if we can prove
and
We first prove (40). Set \(\text{p}_{1i} = \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\). Since \({\log (p_{_n}-s_n)=O(n)}\) and \({\log (p_{_n}-s_n) = o_{_{{\text{P}}}}\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\}}\), we see that
This implies (40).
Second, we prove (41). Since \(\Vert {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 < \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that
and then \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\Vert _1 = O_{{\text{P}}}(1)\), thus
Here \({w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty }\) is used. Hence (41) is proved. \(\square\)
The first part of Theorem 6 follows from the first part of Lemma 4. The second part of Theorem 6 follows directly from applying Theorem 2 to the oracle subproblem (36). \(\square\)
Proof of Theorem 7
It is easy to see that \({\widehat{\beta }}_j^{{\text{PMR}}} = \arg \min _\beta \ell ^{{\text{PMR}}*}_{j}(\beta )\), where \(\ell ^{{\text{PMR}}*}_{n,j}(\beta ) = \ell ^{{\text{PMR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). The rest of the proof contains two parts.
Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})} {\mathcal {A}}_n = O_{{\text{P}}}(1)\). It suffices to show that there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{\delta \rightarrow 0+} \inf _{n\ge 1} {\text{P}}(\min _{1\le j \le s_n} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |> {\mathcal {A}}_n \delta ) = 1.\) It suffices to prove that, for \(1\le j\le s_n\), there exist some \(b_{j}\) with \(|b_{j} |= 2\delta\) such that
and there exists some large enough \(C_n>0\) such that
Equations (42) and (43) imply that with probability tending to one, there must exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \({\mathcal {A}}_n\, \delta< |{\widehat{\beta }}_{j}^{{\text{PMR}}} |< {\mathcal {A}}_n\, C_n\) for \(1\le j\le s_n\).
First, we prove (43). For every \(n \ge 1\), when \(|\beta |\rightarrow \infty\),
Thus (43) holds.
Second, we prove (42). Since \({\mathcal {A}}_n=O(1)\), we see that \(|{\mathcal {A}}_n\, \beta |\le O(1) \delta \rightarrow 0\) as \(\delta \rightarrow 0+\). For \(1\le j\le s_n\), by Taylor’s expansion,
where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\), \(\theta _{ij}^*=\theta _{ij}({\mathcal {A}}_n\, \beta _{j}^*)\), \(\theta _{ij}(\beta )={\widehat{\alpha }}_j(\beta ) +X_{i,j}\beta\) and \(\beta _{j}^*\) is between 0 and \(\beta\). Thus we have that
where \(c_{ij}^{*}=\theta _{ij}({\mathcal {A}}_n\, b_j^*)\), with \(b_{j}^*\) between 0 and \(b_{j}\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain
Choosing \(b_{j} = -2\delta \,\text{sign}\{{\widehat{C}}_0\, {\text{cov}}(X_{j},Y)\}\), which satisfies \(|b_{j} |= 2\delta\), gives
We can see that \(|I_{1,2} |= O_{{\text{P}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta ,\) by the Bernstein’s inequality (van der Vaart and Wellner 1996, Lemma 2.2.11). Similarly, \(|I_{1,3} |\le o_{_{{\text{P}}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta\). For terms \(I_2\) and \(I_3\), we observe that \(|I_2 |\le O_{{\text{P}}}({\mathcal {A}}_n^2)\,\delta ^2\) and \(|I_3 |= O({\mathcal {A}}_n\,\kappa _n)\delta\). The conditions \(\log (p_{_n}) = o(n \kappa _n^2)\) and \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\) imply that \(\{\log (s_n)/n\}^{1/2}/{\mathcal {A}}_n=o(1)\). Together with the condition \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\), we can choose a small enough \(\delta > 0\) such that with probability tending to one, \(I_{1,2}\), \(I_{1,3}\), \(I_2\) and \(I_3\) are dominated by \(I_{1,1}\), which is positive. Thus (42) is proved.
Part 2. For \({\mathcal {C}}_n=\lambda _n\sqrt{n}/s_n\), we will show that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {C}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). It suffices to prove that for any \({\epsilon }> 0\), there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |\le {\mathcal {C}}_n {\epsilon }) = 1.\) Similar to the proof of Lemma 1, we will prove that for any \({\epsilon }> 0\),
Since \({\mathcal {C}}_n\rightarrow 0\) as \(n\rightarrow \infty\), we have that by Taylor’s expansion,
where \(\beta _{j}^*\) is between 0 and \(\beta\). Similar to the proof in Part 1,
Then \(|I_{1,1} |\le o({\mathcal {C}}_n {\mathcal {B}}_n {\epsilon }),\) \(|I_{1,2} |\le O_{{\text{P}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\) and \(|I_{1,3} |\le o_{_{{\text{P}}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\). Hence \(|I_1 |\le O_{{\text{P}}}[{\mathcal {C}}_n\{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }+ o({\mathcal {C}}_n {\mathcal {B}}_n){\epsilon }.\) For the term \(I_2\), we have that \(|I_2 |\le O_{{\text{P}}}({\mathcal {C}}_n^2){\epsilon }^2.\) Note \(I_3 = {\mathcal {C}}_n \kappa _n {\epsilon }.\) Since \(\log (p_{_n}) = o(n \kappa _n^2)\), \({\mathcal {B}}_n = O(\kappa _n)\) and \({\mathcal {C}}_n = o(\kappa _n)\), it follows that with probability tending to one, terms \(I_1\) and \(I_2\) are dominated by \(I_3\), which is positive. So (44) is proved. \(\square\)
Proof of Theorem 8
It is easy to see that \({\widehat{\beta }}_j^{{\text{MR}}} = \arg \min _\beta \ell ^{{\text{MR}}*}_{j}(\beta )\), where \(\ell ^{{\text{MR}}*}_{n,j}(\beta ) = \ell ^{{\text{MR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). Let \(h_{n,j}(\beta ) =\frac{{\text{d}}}{{\text{d}}\beta } \ell _{n,j}^{{\text{MR}}*}(\beta )= n^{-1} \sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}\). Then \(h_{n,j}'(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}^2\) and \(h_{n,j}''(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{3i}(\beta )\). The minimizer \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (17) satisfies the estimating equations, \(h_{n,j}({\widehat{\beta }}^{{\text{MR}}}_{j})= 0\). The rest of the proof consists of two parts.
Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})}{\mathcal {A}}_n=O_{{\text{P}}}(1)\), which is \({{\mathcal {A}}_n}/{\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} =O_{{\text{P}}}(1)\). That is, \(\lim _{\delta \rightarrow 0+} \sup _{n\ge 1} {\text{P}}(\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n \delta )=0\). Using the Bonferroni inequality, it suffices to show that
With assumption (11) for the convex \({\text{BD}}\), \(h_{n,j}(\cdot )\) is an increasing function. Thus
Note that \({\mathcal {A}}_n=O(1)\) gives \({\mathcal {A}}_n\,\delta \rightarrow 0\) as \(\delta \rightarrow 0+\). By Taylor’s expansion, for \(1\le j\le s_n\), we have that
with \(\delta _j^* \in (0, \delta )\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\), where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain
Because \({\mathcal {A}}_n = O(1)\), \(|{\text{cov}}(X_j, Y) |\ge c\, {\mathcal {A}}_n\), \(1\le j\le s_n\), and both
\(\max _{1\le j\le s_n} {\text{E}}[n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; {\widehat{\alpha }}_0) \{{\widehat{\alpha }}_j'(0)+X_{i,j}\}^2]\) and \(\max _{1\le j\le s_n} {\text{E}}\{n^{-1} \sum _{i=1}^n |\text{q}_{3i}({\mathcal {A}}_n\,\delta _j^*) |\}\) are bounded, we can choose \(\delta\) small enough such that, uniformly for all \(1 \le j\le s_n\), the term \(I_{1j, 1}={\widehat{C}}_0\, {\text{cov}}(X_j, Y)\) dominates \(I_{2j}\) and \(I_{3j}\). By assuming \({\widehat{C}}_0\, {\text{cov}}(X_j, Y) < 0\) without loss of generality,
for some positive constants C, \(C_1\) and \(C_2\), where the last inequality applies the Bernstein inequality. By (45), for a small enough \(\delta > 0\),
The equality in (46) follows from \({\mathcal {A}}_n = O(1)\), \(\lambda _n n \rightarrow \infty\) and \(\log (s_n) = o(\lambda _n^2 n^2)\), where the latter two are implied by the conditions \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\).
Part 2. For \({\mathcal {B}}_n=\lambda _n \sqrt{n}/s_n\), we will prove that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {B}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), which is \({\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} / {{\mathcal {B}}_n} = o_{_{{\text{P}}}}(1)\). Namely, for any \({\epsilon }> 0\), \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) = 0\). By the Bonferroni inequality, it suffices to show that
Since \(h_{n,j}(\cdot )\) is increasing, we have that for \(j=s_n+1, \ldots , p_{_n}\),
Similar to Part 1, \({\mathcal {B}}_n=o(1)\) gives that for \(j=s_n+1, \ldots , p_{_n}\),
with \({\epsilon }_j^* \in (0, \delta )\). Since \({\mathcal {B}}_n=o(1)\), \(|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\), \(s_n+1\le j\le p_{_n}\), and from Condition \({{\text{E}}2}\), \(|J_{2j} |\ge {\mathcal {B}}_n{\epsilon }\eta\), as \(n\rightarrow \infty\), \(J_{2j}\) dominates \(I_{1j,1}\) and \(J_{3j}\). Applying the Bernstein’s inequality, for large n,
for some positive constants C, \(C_1\) and \(C_2\), where \(I_{1j}\), \(I_{1j,2}\) and \(I_{1j,3}\) are as defined in Part 1. Similarly,
The equality in (50) follows from the conditions \({\mathcal {B}}_n = o(1)\), \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\). \(\square\)
Proof of Theorem 9
For part (i), note that for \({\varvec{X}}^o = ({\varvec{X}}^{o({\text{I}})T}, {\varvec{X}}^{o({\text{II}})T})^T\), \({\widetilde{{\varvec{X}}}}^o = (1, {\varvec{X}}^{oT})^T\) and \({\widetilde{{\varvec{X}}}}^{o({\text{I}})} = (1,{\varvec{X}}^{o({\text{I}})T})^T\),
for some \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). By Condition \({\text{A}}4\), we conclude that \((F^{-1})'({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} {\widetilde{{\varvec{\beta }}}}^*) = O_{{\text{P}}}(1)\). This along with \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(r_n)\) and \(\Vert {\widetilde{{\varvec{X}}}}^{o({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n})\) implies that \(|{\widehat{m}}({\varvec{X}}^o)- m({\varvec{X}}^o) |= O_{{\text{P}}}(r_n\sqrt{s_n}) = o_{_{{\text{P}}}}(1).\) The rest of the proof is similar to that of Theorem 9 in Zhang et al. (2010) and is omitted.
For part (ii), using the proof similar to Lemma A1 of Zhang et al. (2010), we obtain that for any \({\text{BD}}\) \(\texttt {Q}\) satisfying (4),
It follows that
Setting \(\texttt {Q}\) to be the misclassification loss implies
which combined with part (i) completes the proof. \(\square\)
1.2 1.2 Additional numerical studies
1.2.1 1.2.1 Gaussian responses in Sect. 6.3
Random samples \(\{({\varvec{X}}_{i}, Y_i)\}_{i=1}^n\) of size \(n=200\) are generated from the model,
where \(\beta _{0;0} = 1\), \({\varvec{\beta }}_{0} = (2, 1.5, 0.8, -1.5, 0.4, 0,\ldots ,0)^T\) with \(\sigma ^2 =1\). Here \(\Sigma _{p_{_n}}(j,k)=\rho ^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}\), with \(\rho =0.1\). The qudartic loss is used as the \({\text{BD}}\).
Study 1 (raw data without outliers). For simulated data in the non-contaminated case, the results are summarized in Table 7. The robust estimators perform very similar to the non-robust counterparts.
Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set, where 7 data points \((X_{i,j},Y_i)\) are contaminated as follows: They are replaced by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = Y_i {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma > 2\} + 15 {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma \le 2\}\), \(i=1,\ldots ,7\),
with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 8 summarizes the results over 500 sets of contaminated data. A comparison of each estimator in Tables 7 and 8 indicates that the presence of contamination substantially increases the estimation errors \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\) and reduces either \({\text {C-Z}}\) or \({\text {C-NZ}}\). On the other hand, it is clearly observed that the non-robust estimates are more sensitive to outliers than the robust counterparts.
To further assess the impact of the sample size n on the parameter estimates, we display boxplots of \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,8\), using the \({\text{PMR}}\) selection method for the weighted-\(L_1\) penalty, in Fig. 3 using \(n=200\) and Fig. 4 using \(n=100\), respectively. The comparison supports the consistency of both the classical and robust estimates of large dimensional model parameters for clean data as n increases, in addition to the stability of the robust estimates under a small amount of contaminated outliers.
1.2.2 1.2.2 Real data analysis
We consider the classification of Colon cancer data discussed in Alon et al. (1999) and available at http://genomics-pubs.princeton.edu/oncology/. It consists of 2000 genes and 62 samples, where 22 samples are from normal colon tissues and 40 samples are from tumor tissues. Similar to the analysis in Sect. 7, the data set is randomly split into two parts, with 45 samples as training samples and the rest 17 as test samples. Table 9 summarizes the average of the test errors (\(\text{TE}\)) and the average number of selected genes over 100 random splits. We observe that robust procedures tend to select fewer genes than the non-robust procedures, without getting much increase in the test errors. This lends further support to the practicality of the proposed penalized robust-\({\text{BD}}\) estimation.
1.3 1.3 Numerical procedure for penalized robust-\({\text{BD}}\) estimator in (3)
1.3.1 1.3.1 Optimization algorithm
Numerically, the penalized robust-\({\text{BD}}\) estimators in (3) combined with penalties used in Sects. 6 and 7 are implemented by extending the coordinate descent (\({\text{CD}}\)) iterative algorithm (Friedman et al., 2010), with the initial value \((b,0,\ldots ,0)^T\), where \(b=\log \{(\overline{Y}_n+0.1)/(1-\overline{Y}_n+0.1)\}\) and \(b=\log (\overline{Y}_n+0.1)\) for Bernoulli and count responses respectively, using the sample mean \(\overline{Y}_n\) of \(\{Y_i\}_{i=1}^n\). Namely, the loss term
in (3) is locally approximated by a weighted form of quadratic loss functions, and the optimization solution of (3) is obtained by the \({\text{CD}}\) method. Particularly, the gradient vector and Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) are
The quadratic approximation is supported by the fact that the Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) evaluated at the true parameter vector \({\widetilde{{\varvec{\beta }}}}_0\) is
which, combined with the property \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\}\ge 0\) discussed in part (d) of Sect. 2.2, indicates that with probability tending to one, the matrix \(L''({\widetilde{{\varvec{\beta }}}}_0)\) is positive semidefinite.
Both \(\text{p}_{_1}(y;\theta )\) and \(\text{p}_{_2}(y;\theta )\) in \(L'({\widetilde{{\varvec{\beta }}}})\) and \(L''({\widetilde{{\varvec{\beta }}}})\) are calculated using (9), which incorporates the Huber and Tukey \(\psi\)-functions whose derivatives \(\psi '(r)\) can be substituted by its subgradient or approximation.
1.3.2 1.3.2 Pseudo codes, source codes and computational complexity analysis
Algorithm 1 summarizes the complete procedure for numerically solving the “penalized robust-\({\text{BD}}\) estimator” in (3).
To illustrate the computational complexity analysis, Tables 10 and 11 compare runtime of the non-robust and robust procedures. All computations are performed using MATLAB (Version: 9.12.0.1956245 (R2022a) Update 2) on Windows 11, 12th Gen Intel(R) Core(TM) i9-12900, 2400 Mhz, 16 Core(s), 24 Logical Processors. MATLAB source codes are available at GitHub https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM. For either clean or contaminated data, the algorithmic complexity depends on the type of response variables, the dimensionality and the procedure. Poisson-type responses are more computationally intensive than Gaussian responses; robust procedures are slower than the non-robust counterparts; higher dimensions demand more computational costs than lower dimensional settings.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, C., Zhu, L. & Shen, Y. Robust estimation in regression and classification methods for large dimensional data. Mach Learn 112, 3361–3411 (2023). https://doi.org/10.1007/s10994-023-06349-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06349-2