Skip to main content

Advertisement

Log in

Variable selection and collinearity processing for multivariate data via row-elastic-net regularization

  • Original Paper
  • Published:
AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Abstract

Multivariate data is collected in many fields, such as chemometrics, econometrics, financial engineering and genetics. In multivariate data, heteroscedasticity and collinearity occur frequently. And selecting material predictors is also a key issue when analyzing multivariate data. To accomplish these tasks, multivariate linear regression model is often constructed. We thus propose row-sparse elastic-net regularized multivariate Huber regression model in this paper. For this new model, we proof its grouping effect property and the property of resisting sample outliers. Based on the KKT condition, an accelerated proximal sub-gradient algorithm is designed to solve the proposed model and its convergency is also established. To demonstrate the accuracy and efficiency, simulation and real data experiments are carried out. The numerical results show that the new model can deal with heteroscedasticity and collinearity well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  • Breiman, L., Friedman, J.H.: Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 59, 3–54 (1997)

    Article  MathSciNet  Google Scholar 

  • Chen, B.Z., Kong, L.C.: High-dimensional least square matrix regression via elastic net penalty. Pac. J. Optim. 13(2), 185–196 (2017)

    MathSciNet  MATH  Google Scholar 

  • Chen, B.Z., Zhai, W.J., Huang, Z.Y.: Low-rank elastic-net regularized multivariate Huber regression model. Appl. Math. Model. 87, 571–583 (2020)

    Article  MathSciNet  Google Scholar 

  • Das, J., Gayvert, K., Bunea, F., Wegkamp, M., Yu, H.: Encapp: elastic-net-based prognosis prediction and biomarker discovery for human cancers. BMC Genom. 16, 1–13 (2015)

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., et al.: ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1, 1–21 (2000)

    Article  Google Scholar 

  • Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101 (1964)

    Article  MathSciNet  Google Scholar 

  • Huber, P.: Robust Statistics. Wiley, New York (1981)

    Book  Google Scholar 

  • Koenker, R., Bassett, G.: Regression quantiles. Econometrica 46, 33–50 (1978)

    Article  MathSciNet  Google Scholar 

  • Mukherjee, A., Zhu, J.: Reduced rank ridge regression and its kernel extensions. Stat. Anal. Data Min. ASA Data Sci J. 4, 612–622 (2011)

    Article  MathSciNet  Google Scholar 

  • Negahban, S., Wainwright, M.: Simultaneous support recovery in high dimensions: benefits and perils of block \(l_1/l_{\infty }\)-regularization. IEEE Trans. Inform. Theory 57, 3841–3863 (2011)

  • Obozinski, G., Wainwright, M., Jordan, M.: Support union recovery in high-dimensional multivariate regression. Ann. Stat. 39(1), 1–47 (2011)

    Article  MathSciNet  Google Scholar 

  • Rodol\(\grave{a}\), E., Torsello, A., Harada, T., Kuniyoshi, Y., Cremers, D.: Elastic net constraints for shape matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1169–1176, (2013)

  • Similä, T., Tikka, J.: Input selection and shrinkage in multiresponse linear regression. Comput. Stat. Data Anal. 52, 406–422 (2007)

    Article  MathSciNet  Google Scholar 

  • Skagerberg, S., MacGregor, J.F., Kiparissides, C.: Multivariate data analysis applied to low-density polyethylene reactors. Chemom. Intell. Lab. Syst. 14, 341–356 (1992)

    Article  Google Scholar 

  • Stransky, N.: The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483(7391), 603–607 (2012)

    Article  Google Scholar 

  • Toh, K., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim. 6, 615–640 (2010)

    MathSciNet  MATH  Google Scholar 

  • Tropp, J.A.: Algorithms for simultaneous sparse approximation. Part II Convex Relaxation. Signal Process. 86, 589–602 (2006)

    MATH  Google Scholar 

  • Turlach, B., Venables, W., Wright, S.: Simultaneous variable selection. Technometrics 47, 350–363 (2005)

    Article  MathSciNet  Google Scholar 

  • Xin, X., Hu, J., Liu, L.: On the oracle property of a generalized adaptive elastic-net for multivariate linear regression with a diverging number of parameters. J. Multiva. Anal. 162, 16–31 (2017)

    Article  MathSciNet  Google Scholar 

  • Yi, C., Huang, J.: Semismooth Newton coordinate descent algorithm for elastic-net penalized Huber loss regression and quantile regression. J. Comput. Graph. Stat. 26, 547–557 (2017)

    Article  MathSciNet  Google Scholar 

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005)

    Article  MathSciNet  Google Scholar 

  • Zou, H., Zhang, H.: On the adaptive elastic-net with a diverging number of parameters. Ann. Statist. 37, 1733–1751 (2009)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors are very grateful to two anonymous reviewers and associate editor for their insightful remarks and comments which considerably improved the presentation of our paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingzhen Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Key Program of Cangzhou Jiaotong College (HB202001002) and the National Natural Science Foundation of China (12071022).

Appendix

Appendix

1.1 Proof of Theorem 2

As shown in Chen et al. (2020), the derivative of \(H_\alpha ^n(B)\) is

$$\begin{aligned} \nabla _B H_\alpha ^n(B)= -\tfrac{1}{n}X^\mathrm{T}\varPsi (B), \end{aligned}$$

where \(\varPsi (B)=\left( \psi ^\mathrm{T}\left( {{\varvec{y}}}_1-B^\mathrm{T}{{\varvec{x}}}_1\right) , \cdots , \psi ^\mathrm{T}\left( {{\varvec{y}}}_n-B^\mathrm{T}{{\varvec{x}}}_n\right) \right) ^\mathrm{T}\),

$$\psi ({{\varvec{z}}})={\left\{ \begin{array}{ll} {{\varvec{z}}},&{}\Vert {{\varvec{z}}}\Vert _\mathrm{2} \le \alpha ,\\ \tfrac{\alpha }{\Vert {{\varvec{z}}}\Vert _\mathrm{2} }{{\varvec{z}}} ,&{}\Vert {{\varvec{z}}}\Vert _\mathrm{2} > \alpha , \\ \end{array}\right. }$$

and \(\varPsi (B)\) has the following upper bound

$$\begin{aligned} \Vert \varPsi (B)\Vert _\mathrm{F}=\sqrt{\sum \nolimits _{i=1}^{n}\sum \nolimits _{j=1}^{q}\varvec{\psi }_{ij}^2} =\sqrt{\sum \nolimits _{i=1}^{n}\Vert \varvec{\psi }_i\Vert _2^2}\le \sqrt{n\alpha ^2} \le \sqrt{n}\alpha . \end{aligned}$$
(10)

Let \(L(\lambda _1,\lambda _2, B) = \tfrac{1}{n}\sum \nolimits _{i=1}^{n} h_\alpha \left( {{\varvec{y}}}_i-B^\mathrm{T}{{\varvec{x}}}_i\right) +\lambda _1\sum \nolimits _{k=1}^{m}\Vert {{\varvec{b}}}_k\Vert _2+\tfrac{\lambda _2}{2}\sum \nolimits _{k=1}^{m}\Vert {{\varvec{b}}}_k\Vert _2^2\). Then, the Karush–Kuhn–Tucker condition of optimization problem (2) is

$$\begin{aligned} 0\in \nabla _B L\left( \lambda _1,\lambda _2, \widehat{B}\right) = -\tfrac{1}{n}X^\mathrm{T}\varPsi \left( \widehat{B}\right) + \lambda _2\widehat{B} + \lambda _1S, \end{aligned}$$
(11)

where \(S=\left( {{\varvec{s}}}_1,\cdots ,{{\varvec{s}}}_m\right) ^\mathrm{T}\) and \({{\varvec{s}}}_k\) satisfies

$${\left\{ \begin{array}{ll}{{\varvec{s}}}_j=\tfrac{\hat{{{\varvec{b}}}}_k}{\Vert \hat{{{\varvec{b}}}}_k\Vert _2}, &{} \hat{{{\varvec{b}}}}_k\ne 0,\\ \Vert {{\varvec{s}}}_k\Vert _2\le 1, &{}\hat{{{\varvec{b}}}}_k=0.\end{array}\right. }$$

Let \({{\varvec{e}}}^{(k)}= (0,\cdots , 0,1, 0,\cdots , 0)^\mathrm{T}\in \mathbb {R}^m\), where “1” is the kth component of \({{\varvec{e}}}^{(k)}\). Multiplying \({{\varvec{e}}}^{(k)}\) on the both sides of equation (11), it follows that

$$0\in -\tfrac{1}{n}{{\varvec{e}}}^{(k)}X^\mathrm{T}\varPsi \left( \widehat{B}\right) + \lambda _2{{\varvec{e}}}^{(k)}\widehat{B} + \lambda _1{{\varvec{e}}}^{(k)}S,$$

i.e.,

$$\begin{aligned} -\tfrac{1}{n}\varPsi \left( \widehat{B}\right) ^\mathrm{T}\dot{{{\varvec{x}}}}_k + \lambda _2\hat{{{\varvec{b}}}}_k + \lambda _1 {{\varvec{s}}}_k\ni 0. \end{aligned}$$
(12)

For the purpose of deriving the upper bound for \(\Vert \hat{{{\varvec{b}}}}_i-\hat{{{\varvec{b}}}}_j\Vert _2\), we consider the case \(\hat{{{\varvec{b}}}}_i\ne 0\) and \(\hat{{{\varvec{b}}}}_j\ne 0\). By letting \(k=i\) and \(k=j\) in (12), we obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} -\tfrac{1}{n}\varPsi (\widehat{B})^\mathrm{T}\dot{{{\varvec{x}}}}_i + \lambda _2\hat{{{\varvec{b}}}}_i + \lambda _1 \tfrac{\hat{{{\varvec{b}}}}_i}{\Vert \hat{{{\varvec{b}}}}_i\Vert _2}=0, \\ {-\tfrac{1}{n}}\varPsi (\widehat{B})^\mathrm{T}\dot{{{\varvec{x}}}}_j + \lambda _2\hat{{{\varvec{b}}}}_j + \lambda _1 \tfrac{\hat{{{\varvec{b}}}}_j}{\Vert \hat{{{\varvec{b}}}}_j\Vert _2}=0. \end{array}\right. } \end{aligned}$$

Then, we have

$$\begin{aligned} \lambda _2(\hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j)+ \lambda _1 \left( \tfrac{\hat{{{\varvec{b}}}}_i}{\Vert \hat{{{\varvec{b}}}}_i\Vert _2}- \tfrac{\hat{{{\varvec{b}}}}_j}{\Vert \hat{{{\varvec{b}}}}_j\Vert _2}\right) = \tfrac{1}{n}\varPsi (\widehat{B})^\mathrm{T}\left( \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right) . \end{aligned}$$

It follows that

$$\begin{aligned} \left\| \lambda _2(\hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j) + \lambda _1 \left( \tfrac{\hat{{{\varvec{b}}}}_i}{\Vert \hat{{{\varvec{b}}}}_i\Vert _2} - \tfrac{\hat{{{\varvec{b}}}}_j}{\Vert \hat{{{\varvec{b}}}}_j\Vert _2}\right) \right\| _2= \tfrac{1}{n}\left\| \varPsi (\widehat{B})^\mathrm{T}\left( \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right) \right\| _2. \end{aligned}$$
(13)

On the one hand,

$$\begin{aligned} \left\| \varPsi (\widehat{B})^\mathrm{T}\left( \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right) \right\| _2&\le \left\| \varPsi (\widehat{B})\right\| _2\cdot \left\| \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right\| _2 \nonumber \\ {}&\le \left\| \varPsi (\widehat{B})\right\| _\mathrm{F}\cdot \left\| \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right\| _2\nonumber \\ {}&\le \sqrt{n}\alpha \cdot \left\| \dot{{{\varvec{x}}}}_i-\dot{{{\varvec{x}}}}_j\right\| _2\nonumber \\ {}&\le n\alpha \cdot \sqrt{2(1-\rho )}. \end{aligned}$$
(14)

On the other hand,

$$\begin{aligned} (\hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j)^\mathrm{T}\left( \tfrac{\hat{{{\varvec{b}}}}_i}{\Vert \hat{{{\varvec{b}}}}_i\Vert _2}- \tfrac{\hat{{{\varvec{b}}}}_j}{\Vert \hat{{{\varvec{b}}}}_j\Vert _2}\right) =\left( \Vert \hat{{{\varvec{b}}}}_i \Vert _2 + \Vert \hat{{{\varvec{b}}}}_j \Vert _2 \right) \left( 1 -\cos (\theta )\right) \ge 0, \end{aligned}$$

where \(\theta\) is the angle between \(\hat{{{\varvec{b}}}}_i\) and \(\hat{{{\varvec{b}}}}_j\). Thus,

$$\begin{aligned} \left\| \lambda _2(\hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j)+ \lambda _1 \left( \tfrac{\hat{{{\varvec{b}}}}_i}{\Vert \hat{{{\varvec{b}}}}_i\Vert _2}- \tfrac{\hat{{{\varvec{b}}}}_j}{\Vert \hat{{{\varvec{b}}}}_j\Vert _2}\right) \right\| _2 \ge \lambda _2\Vert \hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j\Vert _2. \end{aligned}$$
(15)

Combining (13), (14) and (15), it is easy to obtain the desired result. \(\square\)

1.2 Appendix: proof of Corollary 1

If \(\dot{{{\varvec{x}}}}_{i}=\dot{{{\varvec{x}}}}_{j}\), then the sample correlation coefficient \(\rho =\tfrac{1}{n}\dot{{{\varvec{x}}}}^\mathrm{T}_{i} \dot{{{\varvec{x}}}}_{j}=1\). Considering the upper bound (5), we have \(\Vert \hat{{{\varvec{b}}}}_i -\hat{{{\varvec{b}}}}_j\Vert _2\le 0\). It follows that \(\hat{{{\varvec{b}}}}_i=\hat{{{\varvec{b}}}}_j\). \(\square\)

1.3 Appendix: proof of Theorem 3

For the optimization problem (6), the Karush–Kuhn–Tucker conditions are

$$\begin{aligned} \left( L_H+\lambda _2\right) \left( {{\varvec{b}}}_k-{{\varvec{g}}}_k\right) +\lambda _1{{\varvec{s}}}_k=0,~\forall ~k\in \{1,2,\cdots ,m\}, \end{aligned}$$
(16)

where \({{\varvec{g}}}_k^{\text {T}}\) is the kth row of G and

$$\begin{aligned} {\left\{ \begin{array}{ll}{{\varvec{s}}}_k=\tfrac{{{\varvec{b}}}_k}{\Vert {{\varvec{b}}}_k\Vert _2}, &{} {{\varvec{b}}}_k\ne 0,\\ \Vert {{\varvec{s}}}_k\Vert _2\le 1, &{}{{\varvec{b}}}_k=0.\end{array}\right. } \end{aligned}$$
(17)

If \({{\varvec{b}}}_k=0\), equality (16) becomes

$$\begin{aligned} -\left( L_H+\lambda _2\right) {{\varvec{g}}}_k+\lambda _1{{\varvec{s}}}_k=0. \end{aligned}$$

It follows that

$${{\varvec{s}}}_k=\tfrac{L_H+\lambda _2}{\lambda _1}{{\varvec{g}}}_k.$$

Considering the second inequality in (17), we use

$$\begin{aligned} \Vert {{\varvec{g}}}_k\Vert _2\le \tfrac{\lambda _1}{L_H+\lambda _2} \end{aligned}$$
(18)

to determine \({{\varvec{b}}}_k=0\).

If \({{\varvec{b}}}_k\ne 0\), (16) is in the following form

$$\begin{aligned} \left( L_H+\lambda _2\right) \left( {{\varvec{b}}}_k-{{\varvec{g}}}_k\right) +\lambda _1\tfrac{{{\varvec{b}}}_k}{\Vert {{\varvec{b}}}_k\Vert _2}=0. \end{aligned}$$
(19)

It is equivalent to

$$\begin{aligned} \left( L_H+\lambda _2+\tfrac{\lambda _1}{\Vert {{\varvec{b}}}_k\Vert _2}\right) {{\varvec{b}}}_k=\left( L_H+\lambda _2\right) {{\varvec{g}}}_k. \end{aligned}$$

Taking the \(\ell _2\)-norm on both sides, we obtain that

$$\Vert {{\varvec{b}}}_k\Vert _2=\Vert {{\varvec{g}}}_k\Vert _2-\tfrac{\lambda _1}{L_H+\lambda _2}.$$

Inserting this expression into (19), we obtain

$$\begin{aligned} {{\varvec{b}}}_k=\left( 1-\tfrac{\lambda _1}{\left( L_H+\lambda _2\right) \Vert {{\varvec{g}}}_k\Vert _2} \right) {{\varvec{g}}}_k. \end{aligned}$$
(20)

Combining (18) and (20), the desired result (7) can be obtained.\({\square }\)

1.4 Appendix: proof of Theorem 4

Following the procedure in Beck and Teboulle (2009) or Toh and Yun (2010), inequality (8) can be easily obtained.

Considering the triangular inequality \(\Vert \hat{B}-B^0\Vert _\mathrm{F}\le \Vert \hat{B}\Vert _\mathrm{F}+\Vert B^0\Vert _\mathrm{F}\) and (8), we have

$$\begin{aligned} F(B^k)- F(\hat{B})\le \tfrac{2L_H\Vert \hat{B}-B^0\Vert _\mathrm{F} ^2}{(k+1)^2}\le \tfrac{2L_H\left( \Vert \hat{B}\Vert _\mathrm{F}+\Vert B^0\Vert _\mathrm{F}\right) ^2}{(k+1)^2}. \end{aligned}$$
(21)

Note that \(\hat{B}\) is the solution to (2). It follows that

$$\begin{aligned} H_{\alpha }^n(\hat{B})+\lambda _1\sum \limits _{j=1}^{m}\Vert \hat{{{\varvec{b}}}}_j\Vert _2 + \lambda _2 \Vert \hat{B}\Vert _\mathrm{F}^2&< F(0)=H_{\alpha }^n(0)\nonumber =\tfrac{1}{n}\sum _{i=1}^{n}h_{\alpha }({{\varvec{y}}}_i)\nonumber \\ {}&=\tfrac{1}{n}\sum _{i=1}^{n} {\left\{ \begin{array}{ll} \tfrac{1}{2}\Vert {{\varvec{y}}}_i\Vert _{2}^{2}, &{} \Vert {{\varvec{y}}}_i\Vert _{2}\le \alpha \\ \alpha \left( \Vert {{\varvec{y}}}_i\Vert _{2}-\tfrac{1}{2}\alpha \right) , &{} \Vert {{\varvec{y}}}_i\Vert _{2}>\alpha \end{array}\right. }\nonumber \\&\le \tfrac{1}{n}\sum _{i=1}^{n}\max \left\{ \tfrac{1}{2}\Vert {{\varvec{y}}}_i\Vert _{2}^{2},~\alpha \left( \Vert {{\varvec{y}}}_i\Vert _{2}-\tfrac{1}{2}\alpha \right) \right\} \nonumber \\&\le \tfrac{1}{2n}\Vert Y\Vert _\mathrm{F}^{2}. \end{aligned}$$
(22)

It is easy to obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} H_{\alpha }^n(\hat{B})+ \lambda _1\sum \limits _{j=1}^{m}\Vert \hat{{{\varvec{b}}}}_j\Vert _2 + \tfrac{\lambda _2}{2} \Vert \hat{B}\Vert _\mathrm{F}^2 \ge \lambda _1\sum \limits _{j=1}^{m}\Vert \hat{{{\varvec{b}}}}_j\Vert _2 \ge \lambda _1\Vert \hat{B}\Vert _\mathrm{F}, \\ H_{\alpha }^n(\hat{B})+ \lambda _1\sum \limits _{j=1}^{m}\Vert \hat{{{\varvec{b}}}}_j\Vert _2 + \tfrac{\lambda _2}{2}\Vert \hat{B}\Vert _\mathrm{F}^2\ge \tfrac{\lambda _2}{2} \Vert \hat{B}\Vert _\mathrm{F}^2. \end{array}\right. } \end{aligned}$$
(23)

Combining (22) and (23), we can obtain the following upper bound of \(\Vert \hat{B}\Vert _\mathrm{F}\)

$$\Vert \hat{B}\Vert _\mathrm{F}<\min \left\{ \Vert Y\Vert _\mathrm{F}^{2}/(2n\lambda _1),~\Vert Y\Vert _\mathrm{F}\sqrt{1/(n\lambda _2)}\right\} .$$

Inserting this inequality to (21), it follows that

$$\begin{aligned} F(B^k)- F(\hat{B})\le \tfrac{2L_H\Vert \hat{B}-B^0\Vert _\mathrm{F} ^2}{(k+1)^2}<\tfrac{2L_H\left( C+\Vert B^0\Vert _\mathrm{F}\right) ^2}{(k+1)^2}, \end{aligned}$$

where \(C=\min \left\{ \Vert Y\Vert _\mathrm{F}^{2}/(2n\lambda _1),~\Vert Y\Vert _\mathrm{F}\sqrt{1/(n\lambda _2)}\right\}\). In order to make \(B^k\) to be the \(\epsilon\)-optimal solution, i.e., \(F(B^k)- F(\hat{B})\le \epsilon\), we only need to terminate the algorithm when

$$\begin{aligned} \tfrac{2L_H\left( C+\Vert B^0\Vert _\mathrm{F}\right) ^2}{(k+1)^2}<\epsilon . \end{aligned}$$

It follows that

$$\begin{aligned} k\ge \sqrt{2L_H/\epsilon }\left( C+\Vert B^0\Vert _\mathrm{F}\right) -1. \end{aligned}$$

\({\square }\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, B., Zhai, W. & Kong, L. Variable selection and collinearity processing for multivariate data via row-elastic-net regularization. AStA Adv Stat Anal 106, 79–96 (2022). https://doi.org/10.1007/s10182-021-00403-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-021-00403-x

Keywords

Mathematics Subject Classification

Navigation