Skip to main content

Penalised robust estimators for sparse and high-dimensional linear models

Abstract

We introduce a new class of robust M-estimators for performing simultaneous parameter estimation and variable selection in high-dimensional regression models. We first explain the motivations for the key ingredient of our procedures which are inspired by regularization methods used in wavelet thresholding in noisy signal processing. The derived penalized estimation procedures are shown to enjoy theoretically the oracle property both in the classical finite dimensional case as well as the high-dimensional case when the number of variables p is not fixed but can grow with the sample size n, and to achieve optimal asymptotic rates of convergence. A fast accelerated proximal gradient algorithm, of coordinate descent type, is proposed and implemented for computing the estimates and appears to be surprisingly efficient in solving the corresponding regularization problems including the case for ultra high-dimensional data where \(p \gg n\). Finally, a very extensive simulation study and some real data analysis, compare several recent existing M-estimation procedures with the ones proposed in the paper, and demonstrate their utility and their advantages.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Alfons A (2014) perryExamples: examples for integrating prediction error estimation into regression models. R package version 0.1.0

  2. Alfons A (2016) robustHD: robust methods for high-dimensional data. R package version 0.5.1

  3. Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248

    MathSciNet  MATH  Google Scholar 

  4. Antoniadis A (2007) Wavelet methods in statistics: some recent developments and their applications. Stat Surv 1:16–55

    MathSciNet  MATH  Google Scholar 

  5. Antoniadis A (2010) Comments on: $\ell _1$-penalization for mixture regression models [mr2677722]. TEST 19(2):257–258

    MathSciNet  MATH  Google Scholar 

  6. Antoniadis A, Fan J (2001) Regularization of wavelet approximations. J Am Stat Assoc 96(455):939–967 with discussion and a rejoinder by the authors

    MathSciNet  MATH  Google Scholar 

  7. Antoniadis A, Gijbels I, Nikolova M (2011) Penalized likelihood regression for generalized linear models with nonquadratic penalties. Ann Inst Stat Math 63(3):585–615

    MATH  Google Scholar 

  8. Arslan O (2012) Weighted lad-Lasso method for robust parameter estimation and variable selection in regression. Comput Stat Data Anal 56(6):1952–1965

    MathSciNet  MATH  Google Scholar 

  9. Avella Medina MA, Ronchetti E (2014) Robust and consistent variable selection for generalized linear and additive models. Technical report 310, University of Geneva

  10. Belloni A, Chernozhukov V (2011) $\ell _1$-penalized quantile regression in high-dimensional sparse models. Ann Stat 39(1):82–130

    MATH  Google Scholar 

  11. Belloni A, Chernozhukov V, Wang L (2011) Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806

    MathSciNet  MATH  Google Scholar 

  12. Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37(4):1705–1732

    MathSciNet  MATH  Google Scholar 

  13. Bradic J, Fan J, Wang W (2011) Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. J R Stat Soc Ser B Stat Methodol 73(3):325–349

    MathSciNet  MATH  Google Scholar 

  14. Breheny P (2018) ncvreg: regularization paths for SCAD and MCP penalized regression models. R package version 3.11-0

  15. Bunea F (2008) Consistent selection via the Lasso for high dimensional approximating regression models. In: Clarke B, Ghosal S (eds) Pushing the limits of contemporary statistics: contributions in Honor of Jayanta K. Ghosh, Collections, vol 3. Institute of Mathematical Statistics, Beachwood, pp 122–137. https://doi.org/10.1214/074921708000000101

    Chapter  Google Scholar 

  16. Bunea F, Tsybakov A, Wegkamp M (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194

    MathSciNet  MATH  Google Scholar 

  17. Cerioli A, Riani M, Atkinson AC, Corbellini A (2018) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl 27:589–594

    MathSciNet  MATH  Google Scholar 

  18. Chang X, Qu L (2004) Wavelet estimation of partially linear models. Comput Stat Data Anal 47(1):31–48

    MathSciNet  MATH  Google Scholar 

  19. Chen Z, Tang ML, Gao W, Shi NZ (2014) New robust variable selection methods for linear regression models. Scand J Stat 41(3):725–741

    MathSciNet  MATH  Google Scholar 

  20. Cohen Freue GV, Kepplinger D, Salibian-Barrera M, Smucler E (2018) Proteomic biomarker study using novel robust penalized elastic net estimators. Ann Appl Stat (submitted)

  21. Dennis JEJ, Welsch RE (1978) Techniques for nonlinear least squares and robust regression. Commun Stat Simul Comput 7(4):345–359

    MATH  Google Scholar 

  22. Donoho D, Huber PJ (1983) The notion of breakdown point. In: A Festschrift for Erich L. Lehmann, Wadsworth Statist./Probab. Ser. Wadsworth, Belmont, pp 157–184

  23. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    MathSciNet  MATH  Google Scholar 

  24. Fadili J, Bullmore E (2005) Penalized partially linear models using sparse representation with an application to FMRI time series. IEEE Trans Signal Process 53(9):3436–3448

    MathSciNet  MATH  Google Scholar 

  25. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    MathSciNet  MATH  Google Scholar 

  26. Fan J, Li Q, Wang Y (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J R Stat Soc Ser B Stat Methodol 79(1):247–265

    MathSciNet  MATH  Google Scholar 

  27. Fu A, Narasimhan B, Diamond S, Miller J (2019) CVXR: disciplined convex optimization. J Stat Softw (to appear)

  28. Gannaz I (2007) Robust estimation and wavelet thresholding in partially linear models. Stat Comput 17(4):293–310

    MathSciNet  Google Scholar 

  29. Gijbels I, Vrinssen I (2015) Robust nonnegative garrote variable selection in linear regression. Comput Stat Data Anal 85:1–22

    MathSciNet  MATH  Google Scholar 

  30. Gijbels I, Verhasselt A, Vrissen I (2017) Consistency and robustness properties of the s-nonnegative Garrote estimator. Statistics 51(4):921–947

    MathSciNet  MATH  Google Scholar 

  31. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, New York, the approach based on influence functions

  32. Harrison D, Rubinfeld D (1978) Hedonic prices and the demand for clean air. J Environ Econ Manag 5:81–102

    MATH  Google Scholar 

  33. Huang J, Ma S, Zhang CH (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18(4):1603–1618

    MathSciNet  MATH  Google Scholar 

  34. Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101

    MathSciNet  MATH  Google Scholar 

  35. Huber PJ (1981) Robust statistics. Wiley Series in Probability and Mathematical Statistics. Wiley, New York

    Google Scholar 

  36. Janssens KH, Deraedt I, Schalml O, Veeckman J (1998) Composition of 15–17th century archaeological glass vessels excavated in Antwerp, Belgium. Mikrochim Acta [SupplJ] 15:253–267

    Google Scholar 

  37. Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 15(1):2869–2909

    MathSciNet  MATH  Google Scholar 

  38. Khan JA, Van Aelst S, Zamar RH (2007) Robust linear model selection based on least angle regression. J Am Stat Assoc 102(480):1289–1299

    MathSciNet  MATH  Google Scholar 

  39. Knight K, Fu W (2000) Asymptotics for Lasso-type estimators. Ann Stat 28(5):1356–1378

    MathSciNet  MATH  Google Scholar 

  40. Kong D, Bondell H, Wu Y (2018) Fully efficient robust estimation, outlier detection, and variable selection via penalized regression. Stat Sin 28:1031–1062

    MathSciNet  MATH  Google Scholar 

  41. Kraemer N, Schaefer J (2014) parcor: regularized estimation of partial correlation matrices. R package version 0.2-6

  42. Lambert-Lacroix S, Zwald L (2011) Robust regression through the Huber’s criterion and adaptive Lasso penalty. Electron J Stat 5:1015–1053

    MathSciNet  MATH  Google Scholar 

  43. Loh PL (2017) Statistical consistency and asymptotic normality for high-dimensional robust $M$-estimators. Ann Stat 45(2):866–896

    MathSciNet  MATH  Google Scholar 

  44. Loh PL, Wainwright MJ (2012) High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann Stat 40(3):1637–1664

    MathSciNet  MATH  Google Scholar 

  45. Loh PL, Wainwright MJ (2015) Regularized m-estimators with nonconvexity: statistical and algorithmic theory for local optima. J Mach Learn Res 16:559–616

    MathSciNet  MATH  Google Scholar 

  46. Maechler M et al (2017) robustbase: basic robust statistics. R package version 0.92-8

  47. Maronna RA (2011) Robust ridge regression for high-dimensional data. Technometrics 53(1):44–53 supplementary materials available online

    MathSciNet  Google Scholar 

  48. Maronna RA, Yohai VJ (1981) Asymptotic behavior of general $M$-estimates for regression and scale with random carriers. Z Wahrsch Verw Gebiete 58(1):7–20

    MathSciNet  MATH  Google Scholar 

  49. Meinshausen N, Yu B (2009) Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 37(1):246–270

    MathSciNet  MATH  Google Scholar 

  50. Negahban SN, Ravikumar P, Wainwright MJ, Yu B (2012) A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. Stat Sci 27(4):538–557

    MathSciNet  MATH  Google Scholar 

  51. Nesterov Y (2007) Gradient methods for minimizing composite objective function. Discussion Paper 2007076, Center for Operations Research and Econometrics (CORE). Université Catholique de Louvain

  52. Pace RK, Gilley OW (1997) Using the spatial configuration of the data to improve estimation. J Real Estate Finance Econ 14:333–340

    Google Scholar 

  53. Rey W (1983) Introduction to robust and quasi-robust statistical methods. Springer, Berlin

    MATH  Google Scholar 

  54. Rodriguez P (2017) A two-term penalty function for inverse problems with sparsity constrains. EUSIPCO 17:2185–2189

    Google Scholar 

  55. Rosset S, Zhu J (2004) Discussion on least angle regression. Ann Stat 32(2):459–475

    Google Scholar 

  56. Rousseeuw P, Yohai V (1984) Robust regression by means of S-estimators. In: Robust and nonlinear time series analysis (Heidelberg, 1983), Lect. Notes Stat., vol 26, Springer, New York, pp 256–272

  57. She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639

    MathSciNet  MATH  Google Scholar 

  58. Smucler E, Yohai VJ (2017) Robust and sparse estimators for linear regression models. Comput Stat Data Anal 111(C):116–130

    MathSciNet  MATH  Google Scholar 

  59. Städler D, Bühlmann PJ, van De Geer N (2010) $\ell _1$-penalization for mixture regression models. TEST 19(2):209–256

    MathSciNet  MATH  Google Scholar 

  60. Sun T, Zhang CH (2010) Comments on: $\ell _1$-penalization for mixture regression models [mr2677722]. TEST 19(2):270–275

    MathSciNet  MATH  Google Scholar 

  61. Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898

    MathSciNet  MATH  Google Scholar 

  62. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  63. Tukey JW (1960) A survey of sampling from contaminated distributions. Contributions to probability and statistics. Stanford University Press, Stanford, pp 448–485

    Google Scholar 

  64. van de Geer SA (2008) High-dimensional generalized linear models and the Lasso. Ann Stat 36(2):614–645

    MathSciNet  MATH  Google Scholar 

  65. Wainwright MJ (2009) Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans Inf Theory 55(12):5728–5741

    MathSciNet  MATH  Google Scholar 

  66. Wang H, Leng C (2007) Unified Lasso estimation by least squares approximation. J Am Stat Assoc 102:1039–1048

    MathSciNet  MATH  Google Scholar 

  67. Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econom Stat 25(3):347–355

    MathSciNet  Google Scholar 

  68. Wang L (2013) The $L_1$ penalized LAD estimator for high dimensional linear regression. J Multivar Anal 120:135–151

    MathSciNet  MATH  Google Scholar 

  69. Wang Z, Liu H, Zhang T (2014) Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann Stat 42(6):2164–2201

    MathSciNet  MATH  Google Scholar 

  70. Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    MathSciNet  MATH  Google Scholar 

  71. Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    MathSciNet  MATH  Google Scholar 

  72. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors thank the Editor and a referee for their constructive comments and helpful suggestions, which improved the paper. They also would like to thank E. Smucler for sharing the archaeological dataset used in the examples. Part of this work was completed while A. Antoniadis and I. Gijbels were visiting the Istituto per le Applicazioni del Calcolo “M. Picone”, National Research Council, Naples, Italy. I. Gijbels gratefully acknowledges financial support from the GOA/12/014 project of the Research Fund KU Leuven, Belgium.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Italia De Feis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (ZIP 123 kb)

Appendices

Appendix 1: Definitions of \(\rho\), \(\psi\) and thresholding \(\delta\) functions (used in this paper)

Preamble

Unless otherwise stated, most of the following definitions of functions are standard, but few definitions differ, sometimes slightly, by the different way of standardising them. To avoid confusion, we first define \(\psi\)- and \(\rho\)-functions.

Definition 1

A \(\psi\)-function is a piecewise continuous function \(\psi : \mathbb {R}\rightarrow \mathbb {R}\) such that

  1. (1)

    \(\psi\) is odd, i.e., \(\psi (-x) = -\psi (x) \ \forall x\),

  2. (2)

    \(\psi (x) \ge 0\) for \(x \ge 0\), and \(\psi (x) > 0\) for \(0< x < x_r := \sup \{\tilde{x} : \psi (\tilde{x}) > 0\}\) (\(x_r > 0\), possibly \(x_r = \infty\)).

  3. (3)

    Its slope is 1 at 0, i.e., \(\displaystyle \psi '(0) = 1\).

Note that (3) is not strictly required mathematically, but we use it for standardisation in those cases where \(\psi\) is continuous at 0. Then, it also follows (from (1)) that \(\psi (0) = 0\), and we require \(\psi (0)=0\) also for the case where \(\psi\) is discontinuous in 0, as it is, e.g., for the M-estimator defining the median.

Definition 2

A \(\rho\)-function can be represented by the following integral of a \(\psi\)-function,

$$\begin{aligned} \rho (x) = \int _0^x \psi (u) du\; \end{aligned}$$
(8.1)

which entails that \(\rho (0) = 0\) and \(\rho\) is an even function.

A \(\psi\)-function is called redescending if \(\psi (x) = 0\) for all \(x \ge x_r\) for \(x_r < \infty\), and \(x_r\) is often called rejection point. Corresponding to a redescending \(\psi\)-function, one may associate a loss function \(\tilde{\rho }\), a version of \(\rho\) standardised such as to attain maximum value one. Formally,

$$\begin{aligned} \tilde{\rho }(x) = \rho (x)/\rho (\infty ). \end{aligned}$$
(8.2)

Note that \(\rho (\infty ) = \rho (x_r) \equiv \rho (x), \ \forall \left| x \right| \ge x_r\). \(\tilde{\rho }\) is a \(\rho\)-function as defined in Maronna (2011) and has been called \(\chi\) function in other contexts. For example, in package robustbase (see Maechler et al. 2017) Mchi(x, *) computes \(\tilde{\rho }(x)\), whereas Mpsi(x, *, deriv=-1) (“(-1)-st derivative” is the primitive or antiderivative) computes \(\rho (x)\), both according to the above definitions.

Weakly redescending \(\psi\) functions. Note that the above definition does require a finite rejection point \(x_r\). However there exist functions \(\psi (\cdot )\) having \(x_r=\infty\), e.g. \(\psi _C(x) := s(x)/2\) with \(s(x) = 2x/(1+x^2)\) score function for the Cauchy (\(= t_1\)) distribution and hence \(\psi _C(\cdot )\) is not a redescending \(\psi\)-function in the above sense. For this reason we call \(\psi\)-functions fulfilling \(\lim _{x\rightarrow \infty }\psi (x) = 0\) weakly redescending. Note that they’d naturally fall into two sub categories, namely the one with a finite \(\rho\)-limit, i.e., \(\rho (\infty ) := \lim _{x\rightarrow \infty }\rho (x)\), and those for which \(\rho (x)\) is unbounded even though \(\rho ' = \psi\) tends to zero.

Note An alternative slightly more general definition of redescending would only require \(\rho (\infty ) := \lim _{x\rightarrow \infty }\rho (x)\) to be finite.

Monotone \(\psi\)-functions

Monotone \(\psi\)-functions lead to convex \(\rho\)-functions such that the corresponding M-estimators are defined uniquely. Historically, the “Huber function” has been the first \(\psi\)-function, proposed by Huber (1964).

Huber

The family of Huber functions is defined as

$$\begin{aligned} \rho _M(x) = {}&\left\{ \begin{array}{ll} \frac{1}{2} x^2 &{} \text{ if } \left| x \right| \le M \\ M \left(\left| x \right| - \frac{M}{2}\right)&{} \text{ if } \left| x \right|> M, \end{array} \right. \\ \psi _M(x) = {}&\left\{ \begin{array}{ll} x &{} \text{ if } \left| x \right| \le M \\ M \ {{\,\mathrm{sign}\,}}(x)&{} \text{ if } \left| x \right| > M. \end{array} \right. \end{aligned}$$

The constant M for \(95\%\) efficiency of the regression estimator is 1.345 (Fig.  8).

Fig. 8
figure8

Huber family of functions using tuning parameter \(M=\lambda = 1.345\)

Redescenders

All the \(\psi\)-functions below, unless stated differently, are redescending, i.e. with finite “rejection point” \(x_r = \sup \{t; \psi (t) > 0\} < \infty\). We recall here their definition and we visualize them in the following subsections.

Tukey’s bisquare

Tukey’s bisquare (aka “biweight”) family of functions (see Tukey 1960) is defined as

$$\begin{aligned} \tilde{\rho }_M(x) = \left\{ \begin{array}{cl} 1 - \bigl (1 - (x/M )^2 \bigr )^3 &{} \text{ if } \left| x \right| \le M \\ 1 &{} \text{ if } \left| x \right| > M, \end{array} \right. \end{aligned}$$

with derivative \({\tilde{\rho }_M }'(x) = 6\psi _M (x)/M ^2\), where

$$\begin{aligned} \psi _M (x) = x \left( 1 - \left( \frac{x}{M }\right) ^2\right) ^2 I_{\{\left| x \right| \le M \}}. \end{aligned}$$

The constant M for \(95\%\) efficiency of the regression estimator is 4.685 and the constant for a breakdown point of 0.5 of the S-estimator is 1.548 (Fig. 9).

Fig. 9
figure9

Bisquare family functions using tuning parameter \(M=\lambda\)

Hampel

The Hampel family of functions (see Hampel et al. 1986) is defined as

$$\begin{aligned}\tilde{\rho }_{a, b, r}(x) ={}&\left\{ \begin{array}{ll} \frac{1}{2} x^2/C &{} \left| x \right| \le a \\ \left( \frac{1}{2}a^2 + a(\left| x \right| -a)\right) /C &{} a< \left| x \right| \le b \\ \frac{a}{2}\left( 2b - a + (\left| x \right| - b) \left( 1 + \frac{r - \left| x \right| }{r-b}\right) \right) /C &{} b< \left| x \right| \le r \\ 1 &{} r< \left| x \right|, \end{array} \right. \\ \psi _{a, b, r}(x) ={}&\left\{ \begin{array}{ll} x &{} \left| x \right| \le a \\ a \ {{\,\mathrm{sign}\,}}(x) &{} a< \left| x \right| \le b \\ a \ {{\,\mathrm{sign}\,}}(x) \frac{r - \left| x \right| }{r - b}&{} b< \left| x \right| \le r \\ 0 &{} r < \left| x \right|, \end{array} \right. \end{aligned}$$

where \(C := \rho (\infty ) = \rho (r) = \frac{a}{2}\left( 2b - a + (r - b) \right) = \frac{a}{2}(b-a + r)\).

By this standardization, \(\psi\) has slope 1 in the center. The slope of the redescending part (\(x\in [b,r]\)) is \(-a/(r-b)\). If it is set to \(-\frac{1}{2}\), as recommended sometimes, one has

$$\begin{aligned} r = 2a + b. \end{aligned}$$

When restricting ourselves to a two-parameter family of Hampel functions with \(a = b=M\) and \(r = \gamma M\), where \(\gamma > 1\) hence a redescending slope of \(-\frac{1}{3}\), and varying M to get the desired efficiency or breakdown point, the resulting functions are those associated to the MCP penalties of Zhang (2010).

The constant M for \(95\%\) efficiency of the regression estimator is 0.9016085 and the one for a breakdown point of 0.5 of the S-estimator is 0.2119163 (Fig. 10).

Fig. 10
figure10

MCP family of functions using tuning parameter \(M=\lambda\) and \(\gamma\)

Weak redescenders

Cauchy loss

The Cauchy loss has also been propagated as “Lorentzian merit function” in regression for outlier detection. We have

$$\begin{aligned} \rho _M(u) = \frac{M^2}{2} \log \left( 1 + \frac{u^2}{M^2}\right) . \end{aligned}$$

Note that \(\rho _M\) is nonconvex. When \(M= 1\), the function \(\rho _M(u)\) is proportional to the MLE for the t-distribution with one degree of freedom (a heavy-tailed distribution). This suggests that for heavy-tailed distributions, nonconvex loss functions may be more desirable from the point of view of statistical efficiency, although optimisation becomes more difficult. For the Cauchy loss, we have

$$\begin{aligned} \psi _M(u) = \frac{u}{1 + u^2/M^2}, \quad {\text {and}} \quad \psi '_M(u) = \frac{1 - u^2/M^2}{(1 + u^2/M^2)^2}. \end{aligned}$$

In particular, \(|\psi _M(u)|\) is maximized when \(u^2 = M^2\), so \(\Vert \psi _M\Vert _\infty \le \frac{M}{2}\). We may also check that \(\Vert \psi '_M\Vert _\infty \le 1\) and \(\Vert \psi ''_M\Vert _\infty \le \frac{3}{2M}\). The Cauchy \(\psi\)-functions fulfill \(\lim _{x\rightarrow \infty }\psi (x) = 0\) so they are weakly redescending. The constant M for \(95\%\) efficiency of the regression estimator is 2.3849 (see Rey 1983; Fig. 11).

Fig. 11
figure11

Cauchy family of functions using tuning parameter \(M=\lambda\)

Nonnegative garrote

The NNG functions are defined as (see Rodriguez 2017),

$$\rho _M(x) = \left\{ \begin{array}{ll} 0.5 x^2 &{} \text{ if } \left| x \right| \le M \\ 0.5 M^2 \left(1 + 2 \log \left( \frac{|x|}{M}\right)\right) &{} \text{ if } \left| x \right| > M, \end{array} \right.$$
$$\psi _M(x) = \left\{ \begin{array}{ll} x &{} \text{ if } \left| x \right| \le M \\ \frac{M^2}{|x|} &{} \text{ if } \left| x \right| > M. \end{array} \right.$$

The constant M for \(95\%\) efficiency of the regression estimator is 2.0 and the constant for a breakdown point of 0.5 of the S-estimator is 0.199 (Fig. 12).

Fig. 12
figure12

Nonnegative garrote family of functions using tuning parameter \(M=\lambda\)

In particular, \(\psi _M\) is maximised when \(u = M\) so \(\Vert \psi _M\Vert _{\infty } \le M\). We may also check that \(\Vert \psi ^\prime _M\Vert _{\infty } \le 1\) and \(\Vert \psi ^{\prime \prime }_M\Vert _{\infty }\) is finite. The NNG \(\psi\)-functions also fulfill \(\lim _{x\rightarrow \infty }\psi (x) = 0\) so they are weakly redescending.

Welsh

The Welsh functions (see Dennis and Welsch 1978) are defined as,

$$\begin{aligned} \rho _M(x) ={}&1 - \exp \bigl (- \left( x/M\right) ^2/2\bigr ) \\ \psi _M(x) ={}&M^2\rho '_M(x) = x\exp \bigl (- \left( x/M\right) ^2/2\bigr ) \\ \psi '_M(x) ={}&\bigl (1 - \bigl (x/M\bigr )^2\bigr ) \exp \bigl (- \left( x/M\right) ^2/2\bigr ) \end{aligned}$$

The constant M for \(95\%\) efficiency of the regression estimator is 2.9846 (see Rey 1983) and the constant for a breakdown point of 0.5 of the S-estimator is 0.577 (Fig. 13).

Fig. 13
figure13

Welsh family of functions using tuning parameter \(M=\lambda\)

Welsh does not have a finite rejection point, but does have bounded \(\rho\), and hence well defined \(\rho (\infty )\).

Appendix 2: Technical proofs

Proof of Proposition 1

By definition, for any estimate \((\hat{{\varvec{\beta }}}, \hat{{\varvec{\gamma }}})\) minimising the criterion (1.4) with a penalty associated to a re-descending or weakly re-descending function, \(\hat{{\varvec{\gamma }}}\) is a fixed point of \({\varvec{\gamma }}=\delta _M(\mathbf {H}{\varvec{\gamma }}+ ({\varvec{I}}-\mathbf {H}){\varvec{y}})\), and \(\hat{{\varvec{\beta }}}=({\varvec{X}}^T{\varvec{X}})^{-1} {\varvec{X}}^T ({\varvec{y}}-\hat{{\varvec{\gamma }}})\), where \(\mathbf {H}= {\varvec{X}}({\varvec{X}}^T{\varvec{X}})^{-1} {\varvec{X}}^T\) is the hat matrix associated to \({\varvec{X}}\). It follows that

$$\begin{aligned} {\varvec{X}}^T\psi _M({\varvec{y}}- {\varvec{X}}\hat{{\varvec{\beta }}})&= {\varvec{X}}^T\psi _M({\varvec{y}}-\mathbf {H}({\varvec{y}}-\hat{{\varvec{\gamma }}})\\&={\varvec{X}}^T ((({\varvec{I}}-\mathbf {H}){\varvec{y}}+\mathbf {H}\hat{{\varvec{\gamma }}}) - \delta _M(({\varvec{I}}-\mathbf {H}){\varvec{y}}+\mathbf {H}\hat{{\varvec{\gamma }}}))\\&= {\varvec{X}}^T ( ({\varvec{I}}-\mathbf {H}){\varvec{y}}+ \mathbf {H}\hat{{\varvec{\gamma }}} -\hat{{\varvec{\gamma }}}) \\&= {\varvec{X}}^T ({\varvec{I}}-\mathbf {H}) ({\varvec{y}}- {\hat{\hat{{\varvec{\gamma }}}}}) = {\mathbf {0}}, \end{aligned}$$

and so \(\hat{{\varvec{\beta }}}\) is an M-estimate associated with \(\psi _M\). \(\square\)

Proof of Theorem 1

For any \({\varvec{\beta }}\in \mathbb {R}^p\) and any \(\sigma > 0\), we write \({\varvec{\beta }}={\varvec{\beta }}^* + \mathbf {u}_n /\sqrt{n}\) and \(\sigma =\sigma ^* + \delta _n/\sqrt{n}\), where \({\varvec{\beta }}^*\) and \(\sigma ^*\) are the true location and scale parameters of the linear regression model (2.1). To avoid complicate notation we will suppress hereafter the index n. Note that by our assumptions the sequences \(\mathbf {u}\) and \(\delta\) are bounded. The loss function involved in Theorem 1 is

$$\begin{aligned} J_n({{\varvec{\beta }}} ) = \frac{1}{n} \sum _{i=1}^n \rho _\lambda \left( \frac{(Y_i-{\varvec{X}}_i^T{\varvec{\beta }})}{\sigma } \right) + \mu _n \sum _{j=1}^p \hat{w}_j |\beta _j|, \end{aligned}$$

and may be replaced for the optimisation by a function of \(\mathbf {u}\) and \(\delta\) defined by

$$\begin{aligned} \varPsi _n(\mathbf {u},\delta ) = \sum _{i=1}^n \rho _\lambda \left( \frac{(Y_i-{\varvec{X}}_i^T({\varvec{\beta }}^* + \mathbf {u}/\sqrt{n}))}{\sigma ^* + \delta /\sqrt{n}} \right) + \sqrt{n} \mu _n \sum _{j=1}^p \hat{w}_j \sqrt{n} |\beta _j^* + u_j/ \sqrt{n} | . \end{aligned}$$
(8.3)

Let

$$\begin{aligned}G_n(\mathbf {u},\delta ) : = \varPsi _n(\mathbf {u},\delta )- \varPsi ({\mathbf {0}},\delta ) =A_n(\mathbf {u},\delta ) + B_n(\mathbf {u},\delta ),\end{aligned}$$

where the first term \(A_n(\mathbf {u},\delta )\) involves the summation terms in \(G_n\) related to \(\rho _\lambda\) while the second term \(B_n\) involves the summation terms related to the weighted differences of absolute values. Since both \(\mathbf {u}\) and \(\delta\) are bounded, and by the properties of the involved robust losses we can use a Taylor expansion with a remainder up to order 2 (denoted \(R_2\) and R below) to get

$$\begin{aligned} A_n(\mathbf {u},\delta )& {}= -\frac{2\sqrt{n}}{\sigma ^*} \left( \frac{1}{n} \sum _{i=1}^n \psi _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i^T\right) \mathbf {u}+ \frac{1}{\sigma ^{*2}} \mathbf {u}^T \left( \frac{1}{n} \sum _{i=1}^n \psi ^\prime _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i{\varvec{X}}_i^T\right) \mathbf {u}\\&\quad + \frac{2}{\sigma ^{*2}} \left( \frac{1}{n} \sum _{i=1}^n \psi _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i^T + \frac{1}{n} \sum _{i=1}^n \frac{\epsilon _i}{\sigma ^*} \psi ^\prime _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i{\varvec{X}}_i^T\right) \mathbf {u}\, \delta \\&\quad + 2 \sum _{i=1}^n R_2 \left( \left(\frac{{\varvec{X}}_i \mathbf {u}}{\sqrt{n}} , \frac{\delta }{\sqrt{n}}\right) \right) - 2 \sum _{i=1}^n R \left( \frac{\delta }{\sqrt{n}}\right) .\\ \end{aligned}$$

Since \({{\mathbb {E}}}_{F_\epsilon } (\psi _\lambda (\epsilon ) ) = 0\) and \(\hbox {var}_{F_\epsilon } (\psi _\lambda (\epsilon ) ) < \infty\) the central limit theorem yields

$$\begin{aligned}\sqrt{n} \left( \frac{1}{n} \sum _{i=1}^n \psi _\lambda (\epsilon _i \sigma ^*) {\varvec{X}}_i^T \right) \overset{d}{\rightarrow }\mathcal {N} ( {\mathbf {0}}, a(\psi _\lambda ,F_{\epsilon }) V).\end{aligned}$$

For all robust losses considered in the theorem, it is easy to see that \(\hbox {var}_{F_\epsilon } \left(\psi ^\prime _\lambda (\epsilon ) \right)\) is also finite, and therefore by assumption (A2), \(\hbox {var}_{F_\epsilon }\left( \frac{1}{n} \sum _{i=1}^n \psi ^\prime _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i{\varvec{X}}_i^T\right) \rightarrow {\mathbf {0}}\). It follows by the law of large numbers that

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \psi ^\prime _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i{\varvec{X}}_i^T \overset{p}{\rightarrow }V \, {{\mathbb {E}}}_{F_\epsilon } \left( \psi ^\prime _\lambda (\epsilon ) \right) \quad \hbox {and} \quad \frac{1}{n} \sum _{i=1}^n \psi _\lambda (\epsilon _i/\sigma ^*) {\varvec{X}}_i^T \overset{p}{\rightarrow }\,{\mathbf {0}}. \end{aligned}$$

For all robust losses considered in the theorem the derivatives \(\psi _\lambda ^\prime\) of the corresponding influence functions are even and by symmetry around 0 of the distribution \(F_\epsilon\) we have \({{\mathbb {E}}}_{F_\epsilon }(\epsilon \, \psi ^\prime _\lambda (\epsilon _i/\sigma ^*))=0\) and \(\hbox {var}_{F_\epsilon } ( \epsilon \psi ^\prime _\lambda (\epsilon ) ) = {{\mathbb {E}}}_{F_\epsilon }(\epsilon ^2 \, \psi ^{\prime 2}_\lambda (\epsilon _i/\sigma ^*)) \le T\), where T is a finite number since \(\psi ^\prime _\lambda\) are bounded. It follows that

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \epsilon _i \psi ^\prime _\lambda (\epsilon _i/\sigma ^*)) {\varvec{X}}_i^T \overset{p}{\rightarrow }\,{\mathbf {0}}\end{aligned}$$

and therefore the term involving the multiplication \(\mathbf {u}\delta\) in the expression of \(A_n(\mathbf {u},\delta )\) converges to 0 in probability as n goes to \(\infty\). By the remainder theorem on Taylor approximations, by Assumption (A2) and by the fact that both \(\mathbf {u}\) and \(\delta\) are bounded it follows easily that both \(\sum _{i=1}^n R_2 \left( \frac{{\varvec{X}}_i \mathbf {u}}{\sqrt{n}} , \frac{\delta }{\sqrt{n}} \right)\) and \(\sum _{i=1}^n R \left( \frac{\delta }{\sqrt{n}}\right)\) tend to 0 as n goes to \(\infty\). The above imply that

$$\begin{aligned}A_n(\mathbf {u},\delta ) \overset{d}{\rightarrow }\mathcal {N} \left( \frac{1}{\sigma ^{*2}} b(\psi _\lambda , F_\epsilon ) \mathbf {u}^T V \mathbf {u}, \frac{4}{\sigma ^{*2}} a(\psi _\lambda ,F_{\epsilon }) \mathbf {u}^T V \mathbf {u}\right) . \end{aligned}$$

Let’s consider now the asymptotic behaviour of \(B_n(\mathbf {u},\delta )\). The analysis of this term follows closely the one by Zou (2006) on the adaptive Lasso. The only difference to be noted here is that we are using weights based on MM-estimators of \({\varvec{\beta }}\) and a \(\sqrt{n}\)-consistent S-estimator \(\hat{s}_n\) of \(\sigma ^*\). It follows that \(\sqrt{n} (\hat{s}_n - \sigma ^*) = \delta _n \overset{p}{\rightarrow }0\) as n goes to \(\infty\) and also that \(\hat{w}_j = 1/ | \hat{\beta }_j^\mathrm{MM}| \overset{p}{\rightarrow }1/ | \beta ^*_j|\) for any \(\beta _j^* \ne 0\), by the consistency of MM-estimates and preliminary S-estimates as discussed in Smucler and Yohai (2017). When \(\beta ^*_j = 0\), then the corresponding term involved in \(B_n(\mathbf {u},\delta )\) is \(\sqrt{n} ( | \beta _j^* + \frac{u_j}{\sqrt{n}} | - | \beta _j^*|) = |u_j|\) and, since \(\sqrt{n} \hat{\beta }_j^\mathrm{MM} = \mathcal {O}_p(1)\), \(n \mu _n ( \sqrt{n} \hat{\beta }_j^\mathrm{MM})^{-1} |u_j| \overset{d}{\rightarrow }\infty\). We may then mimic completely the analysis of Zou (2006) (see also Lambert-Lacroix and Zwald 2011) for the term \(B_n(\mathbf {u},\delta )\) to obtain the asymptotic normality result (i) of the theorem.

Assertion (ii) follows similarly from KKT conditions satisfied by \(\hat{{\varvec{\beta }}}_{II}\), the above asymptotic normality result of \(\hat{{\varvec{\beta }}}\), the \(\sqrt{n}\)-consistency rates of the MM-estimators of \({\varvec{\beta }}^*\) and \(\sigma ^*\) and the behaviour of the influence functions of the robust losses used in the theorem. \(\square\)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Amato, U., Antoniadis, A., De Feis, I. et al. Penalised robust estimators for sparse and high-dimensional linear models. Stat Methods Appl 30, 1–48 (2021). https://doi.org/10.1007/s10260-020-00511-z

Download citation

Keywords

  • Contamination
  • Outliers
  • High-dimensional regression
  • Variable selection
  • Wavelet thresholding
  • Nonconvex penalties
  • Regularization

Mathematics Subject Classification

  • Primary 62H12
  • 62G08
  • Secondary 62G10