Skip to main content
Log in

Degrees of freedom for regularized regression with Huber loss and linear constraints

  • Research Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

The ordinary least squares estimate for linear regression is sensitive to errors with large variance. It is not robust to heavy-tailed errors or outliers, which are commonly encountered in applications. In this paper, we propose to use a Huber loss function with a generalized penalty to achieve robustness in estimation and variable selection. The performance of estimation and variable selection can be further improved by incorporating any prior knowledge as constraints on parameters. A formula of degrees of freedom of the fit is derived, which is utilized in information criteria for model selection. Simulation studies and real examples are used to demonstrate the application of degrees of freedom and the performance of the model selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Berger AN, DeYoung R (1997) Problem loans and cost efficiency in commercial banks. J Bank Financ 21(6):849–870

    Article  Google Scholar 

  • Claeskens G, Hjort NL et al (2008) Model selection and model averaging. Cambridge books. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Efron B (2004) The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc 99(467):619–632

    Article  MathSciNet  Google Scholar 

  • Espinoza RA, Prasad A (2010) Nonperforming loans in the gcc banking system and their macroeconomic effects. Imf Working Papers 10(224)

  • Fan J, Li Q, Wang Y (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J R Stat Soc Ser B 79(1):247–265

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  Google Scholar 

  • Fan J, Zhang J, Yu K (2012) Vast portfolio selection with gross-exposure constraints. J Am Stat Assoc 107(498):592–606

    Article  MathSciNet  Google Scholar 

  • Ghosh A (2015) Banking-industry specific and regional economic determinants of non-performing loans: evidence from US states. J Financ Stab 20:93–104

    Article  Google Scholar 

  • Golub GH, van Loan C (2013) Matrix computations, 4th edn. The Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  • Gorinevsky D, Kim SJ, Beard S, Boyd S, Gordon G (2009) Optimal estimation of deterioration from diagnostic image sequence. IEEE Trans Signal Process 57(3):1030–1043

    Article  MathSciNet  Google Scholar 

  • Hu Q, Zeng P, Lin L (2015) The dual and degrees of freedom of linearly constrained generalized lasso. Comput Stat Data Anal 86:13–26

    Article  MathSciNet  Google Scholar 

  • Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  • Kawano S (2014) Selection of tuning parameters in bridge regression models via Bayesian information criterion. Stat Pap 55(4):1207–1223

    Article  MathSciNet  Google Scholar 

  • Keeton R (1999) Does faster loan growth lead to higher loan losses? Econ Rev 84(2):57–75

    Google Scholar 

  • Lambert-Lacroix S, Zwald L (2011) Robust regression through the Huber’s criterion and adaptive lasso penalty. Electron J Stat 5:1015–1053

    Article  MathSciNet  Google Scholar 

  • Li Y, Zhu J (2008) \(\ell _{1}\)-norm quantile regression. J Comput Graph Stat 17(1):163–185

    Article  MathSciNet  Google Scholar 

  • Louzis DP, Vouldis AT, Metaxas VL (2012) Macroeconomic and bank-specific determinants of non-performing loans in greece: a comparative study of mortgage, business and consumer loan portfolios. Social Sci Electron Publish 36(4):1012–1027

    Google Scholar 

  • Meyer M, Woodroofe M (2000) On the degrees of freedom in shape-restricted regression. Ann Stat 28(4):1083–1104

    Article  MathSciNet  Google Scholar 

  • Owen AB (2007) A robust hybrid of lasso and ridge regression. Contemp Math 443:59-72

  • Salas V, Saurina J (2002) Credit risk in two institutional regimes: Spanish commercial and savings banks. J Financ Serv Res 22(3):203–224

    Article  Google Scholar 

  • Stone M (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc 39(1):44–47

    MathSciNet  MATH  Google Scholar 

  • Sun Q, Zhou W.-X, Fan J (2018) Adaptive Huber regression. J Am Stat Assoc

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tibshirani RJ, Taylor J (2011) The solution path of the generalized lasso. Ann Stat 39(3):1335–1371

    Article  MathSciNet  Google Scholar 

  • Wang J, Ghosh SK (2012) Shape restricted nonparametric regression with bernstein polynomials. Comput Stat Data Anal 56(9):2729–2741

    Article  MathSciNet  Google Scholar 

  • Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131

    Article  MathSciNet  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc 68(1):49–67

    Article  MathSciNet  Google Scholar 

  • Zeng P, Hu Q, Li X (2017) Geometry and degrees of freedom of linearly constrained generalized lasso. Scand J Stat 44(4):989–1008

    Article  MathSciNet  Google Scholar 

  • Zhang C, Xiang Y (2016) On the oracle property of adaptive group lasso in high-dimensional linear models. Stat Pap 57(1):249–265

    Article  MathSciNet  Google Scholar 

  • Zou H, Hastie T, Tibshirani R (2007) On the “degrees of freedom” of the lasso. Ann Stat 35(5):2173–2192

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank the Editor, the Associate Editor, and anonymous referees for their helpful comments and suggestions. This research was partially supported by the National Natural Science Foundation of China (Grant Nos. 11971265, 11971235 and 11831008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Zeng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

All technique proofs of theorems and lemmas are presented in the following.

1.1 Proof of Theorem 1

Proof

Denote \(P_{null}\) as the projection for \(null(G_{-{\mathcal {A}},{\mathcal {B}}})\), \(P_{null}G_{-{\mathcal {A}},{\mathcal {B}}}^{T}=0\). Multiplying \(P_{null}\) from the left on both sides of (9) yields

$$\begin{aligned} P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}\hat{\beta }&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}}-P_{null}G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}} + P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \nonumber \\&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} + P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \end{aligned}$$
(A.1)

Note that \(\hat{\beta }\) can be decomposed as the sum of two components as follows,

$$\begin{aligned} \hat{\beta }&= P_{null}\hat{\beta }+ P_{col(G_{-{\mathcal {A}},{\mathcal {B}}}^{T})}\hat{\beta }\\&=P_{null}\hat{\beta }+G_{-{\mathcal {A}},{\mathcal {B}}}^{T}(G_{-{\mathcal {A}},{\mathcal {B}}}G_{-{\mathcal {A}},{\mathcal {B}}}^{T})^{+}G_{-{\mathcal {A}},{\mathcal {B}}}\hat{\beta }\\&=P_{null}\hat{\beta }-Ag_{-{\mathcal {A}},{\mathcal {B}}}, \end{aligned}$$

where the last equation holds because \(G_{-{\mathcal {A}},{\mathcal {B}}}\hat{\beta }=g_{-{\mathcal {A}},{\mathcal {B}}}\). Plugging-in the expression of \(\hat{\beta }\) into (A.1), after simplification, we obtain

$$\begin{aligned}&P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}P_{null}\hat{\beta }\\&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}}+ P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}}+P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{null}X_{-{\mathcal {V}}}^{T}\big [y_{-{\mathcal {V}}} -(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null}(D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ], \end{aligned}$$

where the last equation holds because (A.1) implies \(P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \in col(P_{null}X_{-{\mathcal {V}}}^{T})\), which further implies

$$\begin{aligned}&P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \\&\quad =(P_{null}X_{-{\mathcal {V}}}^{T})(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}(X_{-{\mathcal {V}}}P_{null})\\&(P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}}). \end{aligned}$$

Therefore,

$$\begin{aligned}&X_{-{\mathcal {V}}}P_{null}\hat{\beta }\\&=X_{-{\mathcal {V}}}P_{null}(P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}P_{null})^{+} P_{null}X_{-{\mathcal {V}}}^{T}\big [y_{-{\mathcal {V}}}\\&\quad -(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null}(D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]. \end{aligned}$$

Hence,

$$\begin{aligned} X_{-{\mathcal {V}}}\hat{\beta }&=X_{-{\mathcal {V}}}P_{null}\hat{\beta }-X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})\\&\qquad +X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]-X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})\big ]\\&\qquad -(I-P_{X_{-{\mathcal {V}}}P_{null}})X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}. \end{aligned}$$

\(\square \)

1.2 Proof of Lemma 1

Proof

In order to prove this lemma, we will first define a function of y using fixed sets \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) and then show that it is indeed a solution by proving that it satisfies the KKT conditions in a neighborhood of y. Notice that we need to discuss \(\hat{\beta }\), \({\hat{v}}\) and \(\hat{\theta }\) together.

Following the discussions in the beginning of Sect. 2, after introducing sets \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\), we know that \({\hat{u}}_{{\mathcal {A}}}=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{{\mathcal {B}}^c}=0\) and \(|{\hat{u}}_{k}|\le \lambda \) for \(k \in {\mathcal {A}}^{c}\), \(|\hat{\xi }_{j}|\ge 0\) for \(j \in {\mathcal {B}}\). We can verify whether \({\mathcal {A}}\) and \({\mathcal {B}}\) are indeed active sets by checking the values of \({\hat{u}}_{k}\) and \(\hat{\xi }_{j}\). However, there may exist \(k \in {\mathcal {A}}^{c}\) such that \(|{\hat{u}}_{k}|=\lambda \) and \(j \in {\mathcal {B}}\) such that \(\hat{\xi }_{j}=0\). We need to remove those y’s that may yield such scenarios, which is exactly the purpose of \({\mathcal {N}}_{\lambda }\) in this lemma. For any \(y \notin {\mathcal {N}}_{\lambda }\), \(|{\hat{u}}_{k}|=\lambda \) and \(|{\hat{u}}_{k}|< \lambda \) correspond to \(k \in {\mathcal {A}}\) and \(k \in {\mathcal {A}}^{c}\), respectively, and \(\hat{\xi }_{j}=0\) and \(\hat{\xi }_{j}>0\) correspond to \(j \in {\mathcal {B}}^{c}\) and \(j \in {\mathcal {B}}\), respectively.

Define the set \({\mathcal {N}}_{\lambda }\) as

$$\begin{aligned} {\mathcal {N}}_{\lambda }=\bigcup _{{\mathcal {A}},{\mathcal {B}},s_{{\mathcal {A}}}}\bigcup _{k \in {\mathcal {A}}^{c}\setminus {\mathcal {Z}}({\mathcal {A}}), j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})} \{y: {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=\pm \lambda \ \text {or} \ \hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=0\}, \end{aligned}$$

the first union is taken over all possible subsets \({\mathcal {A}}\subset \{1, \ldots ,m \}\), \({\mathcal {B}}\subset \{1, \ldots ,q \}\) and all sign vector \(s_{{\mathcal {A}}} \in \{-1, 1 \}^{|{\mathcal {A}}|}\) and in the second union \({\mathcal {Z}}({\mathcal {A}})\) and \({\mathcal {Z}}({\mathcal {B}})\) are defined by

$$\begin{aligned}&{\mathcal {Z}}({\mathcal {A}})=\{k \in {\mathcal {A}}^{c}: {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv \lambda \ \text {or} \ {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv - \lambda \},\\&{\mathcal {Z}}({\mathcal {B}})=\{j \in {\mathcal {B}}: \hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv 0\}. \end{aligned}$$

Notice that the purpose of \({\mathcal {Z}}({\mathcal {A}})\) and \({\mathcal {Z}}({\mathcal {B}})\) is to exclude the scenarios that \({\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=\pm \lambda \) or \(\hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=0\) for an arbitrary y. Because \({\mathcal {N}}_{\lambda }\) is a finite union of affine subsets of dimension \(n-1\), it has zero Lebesgue measure.

Recall that it is assumed that \(\beta \) is unique and then equations (4)-(8) lead to an expression of \(\hat{\beta }\), \({\hat{v}}_{{\mathcal {V}}}\) and \(\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\) as follows:

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }\\ {\hat{v}}_{{\mathcal {V}}}\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right)&= \left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}}-\lambda D_{{\mathcal {A}}}^{T}s_{{\mathcal {A}}}+ MX_{{\mathcal {V}}}^{T}s_{{\mathcal {V}}}\\ y_{{\mathcal {V}}}-Ms_{{\mathcal {V}}}\\ -g_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right) ,\\&=\left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} X_{-{\mathcal {V}}}^{T}I_{-{\mathcal {V}}}\\ I_{{\mathcal {V}}}\\ 0 \end{array} \right) y \\&\quad {} -\left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T} \\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} \lambda D_{{\mathcal {A}}}^{T}s_{{\mathcal {A}}}- MX_{{\mathcal {V}}}^{T}s_{{\mathcal {V}}} \\ Ms_{{\mathcal {V}}}\\ g_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right) \\&\triangleq H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y - h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where \(I_{{\mathcal {V}}}\) is the \(|{\mathcal {V}}|\times n\) matrix such that \(I_{{\mathcal {V}}}y=y_{{\mathcal {V}}}\) and \(I_{-{\mathcal {V}}}\) is the \((n-|{\mathcal {V}}|)\times n\) matrix such that \(I_{-{\mathcal {V}}}y=y_{-{\mathcal {V}}}\). The matrix \(H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\) and vector \(h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\) are introduced to ease presentation and they depend on y only via sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\).

In the following, we will show that for any \(y \notin {\mathcal {N}}_{\lambda }\), the sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and signs \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) are local constant. For any \(y_{0} \notin {\mathcal {N}}_{\lambda }\), let \(\hat{\beta }(y_{0})\), \({\hat{v}}(y_{0})\) and \(\hat{\theta }(y_{0})\) be the solution with boundary sets \({\mathcal {A}}(y_{0})\), \({\mathcal {B}}(y_{0})\), \({\mathcal {V}}(y_{0})\) and signs \(s_{{\mathcal {A}}(y_{0})}\), \(s_{{\mathcal {V}}(y_{0})}\). We have \({\hat{u}}_{{\mathcal {A}}}(y_{0})=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{-{\mathcal {B}}}(y_{0})=0\) and \({\hat{v}}_{-{\mathcal {V}}}(y_{0})=0\) and

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }(y_{0})\\ {\hat{v}}_{{\mathcal {V}}}(y_{0})\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}(y_{0}) \end{array} \right) = H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y_{0} + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$
(A.2)

where \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) are written instead of \({\mathcal {A}}(y_{0})\), \({\mathcal {B}}(y_{0})\), \({\mathcal {V}}(y_{0})\) in the above expression for simplicity of notation. By the definition of the boundary sets, \(|{\hat{u}}_{k}(y_{0})|< \lambda \) for \(k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})\) and \(|\hat{\xi }_{j}(y_{0})|> 0\) for \(j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})\). We also have \(D_{k}\hat{\beta }(y_{0}) \ne 0\), for \(k \in {\mathcal {A}}\), \(C_{j}\hat{\beta }(y_{0})>d_{j}\), for \(j \in {\mathcal {B}}^{c}\) and \({\hat{v}}_{i}(y_{0}) \ne 0\) for \(i \in {\mathcal {V}}\).

For any new point y, we first define the function form for the solution and then show that it is indeed a solution. Specially, define \({\hat{u}}_{{\mathcal {A}}}(y)=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{-{\mathcal {B}}}(y)=0\) and \({\hat{v}}_{-{\mathcal {V}}}(y)=0\) and

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }(y)\\ {\hat{v}}_{{\mathcal {V}}}(y)\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}(y) \end{array} \right) = H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\) are the same as in (A.2). It is clear that \((\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}})\) satisfies (4)-(8). Because of the continuity of the affine mapping

$$\begin{aligned} y\mapsto H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

there exists a neighborhood \({\mathcal {U}}\) of \(y_{0}\) such that for any \(y \in {\mathcal {U}}\) we have \(|{\hat{u}}_{k}(y)|< \lambda \) for \(k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})\), \(|\hat{\xi }_{j}(y)|> 0\) for \(j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})\), \(D_{k}\hat{\beta }(y) \ne 0\) and \(D_{k}\hat{\beta }(y)\) does not change signs for \(k \in {\mathcal {A}}\), \(C_{j}\hat{\beta }(y)>d_{j}\) for \(j \in {\mathcal {B}}^{c}\) and \({\hat{v}}_{i}(y) \ne 0\) and \({\hat{v}}_{i}(y)\) does not change signs for \(i \in {\mathcal {V}}\). It implies that \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}\) are indeed active sets and sign vector correspond to \(y \in {\mathcal {U}}\). It is clear that \(\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}\) satisfies (4)-(8). Therefore, \(\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}\) is indeed a solution at \(y \in {\mathcal {U}}\) with the same boundary sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vector \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\). This completes the proof of the local constancy of sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and signs \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\). \(\square \)

1.3 Proof of Lemma 2

Proof

For any \(y \notin \mathcal {N}_{\lambda }\) and the corresponding set \({\mathcal {V}}\), \({\mathcal {A}}\) and \({\mathcal {B}}\), Lemma 1 implies that \(\hat{\mu }(y)={\hat{y}}\) can be written as

$$\begin{aligned} \hat{\mu }(y)&=XH({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y - Xh(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\\&=W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y-w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where the matrix \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\) and vector \(w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\) depend on y only via sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\). Similarly, for any \(\Delta y\), we have

$$\begin{aligned} \hat{\mu }(y+\Delta y)=W({\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}})(y+\Delta y)-w(\lambda ,{\widetilde{{\mathcal {A}}}}, {\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}},s_{{\widetilde{{\mathcal {A}}}}},s_{{\widetilde{{\mathcal {V}}}}}), \end{aligned}$$

where \({\widetilde{{\mathcal {V}}}}\), \({\widetilde{{\mathcal {A}}}}\), \({\widetilde{{\mathcal {B}}}}\) and signs \(s_{{\widetilde{{\mathcal {A}}}}},s_{{\widetilde{{\mathcal {V}}}}}\) correspond to \(y+\Delta y\). Because of Lemma 1, for \(y \notin \mathcal {N}_{\lambda }\), we can select \(\Delta y\) small enough such that \(y+\Delta y \in {\mathcal {U}}\) , where \({\mathcal {U}}\) is defined as in Lemma 1. In this case, \({\widetilde{{\mathcal {V}}}}={\mathcal {V}}\), \({\widetilde{{\mathcal {A}}}}={\mathcal {A}}\), \({\widetilde{{\mathcal {B}}}}={\mathcal {B}}\), \(s_{{\widetilde{{\mathcal {A}}}}}=s_{{\mathcal {A}}},s_{{\widetilde{{\mathcal {V}}}}}=s_{{\mathcal {V}}}\) and \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})=W({\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}})\), \(w(\lambda ,{\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}},{\widetilde{s}}_{{\mathcal {A}}}, {\widetilde{s}}_{{\mathcal {V}}})=w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\). Hence,

$$\begin{aligned}&\Vert \hat{\mu }(y+\Delta y)-\hat{\mu }(y)\Vert _{2}=\Vert W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\Delta y\Vert _{2}\\&\quad \le |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2} \Vert \Delta y\Vert _{2}, \end{aligned}$$

where \(|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}\) is the induced matrix norm of \( W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) \) defined by

$$\begin{aligned} |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}:=\max _{\Vert \varvec{x}\Vert _{2}=1}\Vert W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\varvec{x}\Vert _{2}. \end{aligned}$$

The matrix norm induced by Euclidean norm (or \(\ell _{2}\)-norm) has the special form as

$$\begin{aligned} |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}=\sqrt{\rho _{\max }(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})^{*}W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}))}, \end{aligned}$$

where \(\rho _{\max }(\varvec{A})=\max \{|\lambda |, \lambda \ \text {is eigenvalue of} \ \varvec{A}\}\) for a matrix \(\varvec{A}\) and \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})^{*}\) is the conjugate transpose of \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\). We can see that \(|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}\) is a constant related to the sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) depend on the value of y. Thus, it can be written as \(C_{y}\) and we have

$$\begin{aligned} \Vert \hat{\mu }(y+\Delta y)-\hat{\mu }(y)\Vert _{2} \le C_{y}\Vert \Delta y\Vert _{2}. \end{aligned}$$

\(\square \)

1.4 Proof of Theorem 2

Proof

Stein’s lemma is applicable for Huber lcg-lasso fit because Lemma 2 and 3 imply that \(\hat{\mu }(y)={\hat{y}}\) is continuous and almost differentiable with respect to y. Therefore,

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}\bigg [\sum _{i=1}^{n}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]={\mathbb {E}}\bigg [\sum _{i \in {\mathcal {V}}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}+\sum _{i \in {\mathcal {V}}^{c}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]. \end{aligned}$$

Recall that \(\hat{\beta }\) only depends on \(y_{-{\mathcal {V}}}\) explicitly, which implies the derivatives of fits \({\hat{y}}_{i}\) in set \({\mathcal {V}}\) is 0, that is,

$$\begin{aligned} \frac{\partial {\hat{y}}_{i}}{\partial y_{i}}=\frac{\partial x_{i}^{T}\hat{\beta }(y_{-{\mathcal {V}}})}{\partial y_{i}}=0, \quad \text {for}\ i \in {\mathcal {V}}. \end{aligned}$$

It leads to the expression of degrees of freedom as

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}\bigg [\sum _{i \in {\mathcal {V}}^{c}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]=E[ (\triangledown \cdot {\hat{y}}_{-{\mathcal {V}}}) (y_{-{\mathcal {V}}})]. \end{aligned}$$

Now consider the expression of \({\hat{y}}_{-{\mathcal {V}}}\) as in Theorem 1; we have

$$\begin{aligned} {\hat{y}}_{-{\mathcal {V}}}&= P_{X_{-{\mathcal {V}}}P_{null}}y_{-{\mathcal {V}}}-P_{X_{-{\mathcal {V}}}P_{null}}(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})] \\&\quad {} -(I-P_{X_{-{\mathcal {V}}}P_{null}})X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}. \end{aligned}$$

The first term on the right-hand side of the previous equation depends on y explicitly, and the remaining terms depend on y only via \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\). According to Lemma 1, \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) are constant in a neighborhood of y, which implies their derivative with respect to y is 0. Therefore,

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}[ (\triangledown \cdot {\hat{y}}_{-{\mathcal {V}}}) (y_{-{\mathcal {V}}})]=\text {tr}(P_{X_{-{\mathcal {V}}}P_{null}}). \end{aligned}$$

Because the trace of a projection matrix is equal to the dimension of the corresponding linear space, this theorem holds. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Zeng, P. & Lin, L. Degrees of freedom for regularized regression with Huber loss and linear constraints. Stat Papers 62, 2383–2405 (2021). https://doi.org/10.1007/s00362-020-01192-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-020-01192-2

Keywords

Navigation