Skip to main content
Log in

LASSO regularization within the LocalGLMnet architecture

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We call our proposal LASSO regularization of the LocalGLMnet. Whereas the initial proposal of the LASSO was indeed for the linear regression model, this has been extended to GLMs, see Sect. 3.4 in Hastie et al. (2015).

  2. The dataset is available at this link: http://lib.stat.cmu.edu/datasets/boston and code for this example is available on Github at this link: https://github.com/RonRichman/Regularized-LocalGLMnet.

  3. The dataset is available at this link: http://www2.math.uconn.edu/~valdez/telematics_syn-032021.csv

  4. Note that due to privacy concerns, these 100, 000 records were generated synthetically based on real data, see So et al. (2021) for a detailed description of this.

  5. The grouped version of the model was applied in accordance with the instructions at https://github.com/lasso-net/lassonet/issues/7.

References

  • Agarwal R, Frosst N, Zhang X, Caruana R, Hinton GE (2020) Neural additive models: interpretable machine learning with neural nets. arXiv:2004.13912v1

  • Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised learning models. J R Stat Soc Ser B 82(4):1059–1086

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  • Gneiting T (2011) Making and evaluating point forecasts. J Am Stat Assoc 106(494):746–762

    Article  MathSciNet  MATH  Google Scholar 

  • Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102(477):359–378

    Article  MathSciNet  MATH  Google Scholar 

  • Harrison D, Rubinfeld DL (1978) Hedonic prices and the demand for clean air. J Environ Econ Manag 5:81–102

    Article  MATH  Google Scholar 

  • Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the Lasso and generalizations. CRC Press

    Book  MATH  Google Scholar 

  • Hoerl A, Kennard R (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

    Article  MATH  Google Scholar 

  • Lee JD, Sun DL, Sun Y, Taylor JE (2016) Exact post-selection inference, with application to the LASSO. Ann Stat 44(3):907–927

    Article  MathSciNet  MATH  Google Scholar 

  • Lemhadri I, Ruan F, Abraham L, Tibshirani R (2021) LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29

    MathSciNet  MATH  Google Scholar 

  • Lindholm M, Richman R, Tsanakas A, Wüthrich MV (2022) Discrimination-free insurance pricing. ASTIN Bull J IAA 52:55–89

    Article  MathSciNet  MATH  Google Scholar 

  • Merity S, McCann B, Socher R (2017) Revisiting activation regularization for language RNNs. arXiv:1708.01009v1

  • Merz M, Richman R, Tsanakas A, Wüthrich MV (2022) Interpreting deep learning models with marginal attribution by conditioning on quantiles. Data Min Knowl Discov 36:1335–1370

    Article  MathSciNet  MATH  Google Scholar 

  • Oelker M-R, Tutz G (2017) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 11:97–120

    Article  MathSciNet  MATH  Google Scholar 

  • Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231

    Google Scholar 

  • Richman R (2021) Mind the gap—safely incorporating deep learning models into the actuarial toolkit. SSRN Manuscript ID 3857693

  • Richman R, Wüthrich MV (2022) LocalGLMnet: interpretable deep learning for tabular data. Scand Actuar J, in press

  • Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B Stat Methodol 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused LASSO. J R Stat Soc Ser B Stat Methodol 67:91–108

    Article  MathSciNet  MATH  Google Scholar 

  • Tikhonov AN (1943) On the stability of inverse problems. Dokl Akad Nauk SSSR 39(5):195–198

    MathSciNet  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762v5

  • So B, Boucher JP, Valdez EA (2021) Synthetic dataset generation of driver telematics. Risks 9(4):58

    Article  Google Scholar 

  • Vaughan J, Sudjianto A, Brahimi E, Chen J, Nair VN (2018) Explainable neural networks based on additive index models. arXiv:1806.01933v1

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68:49–67

    Article  MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67:301–320

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the editor, assistant editor and reviewers of an earlier version of this manuscript for their comments which helped to improve the manuscript significantly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ronald Richman.

Ethics declarations

Conflict of interest

Both authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Appendix: R code

figure a
figure b
figure c

B LassoNet-training details

The LassoNet models used were based on the Python code provided for the group LASSO version of the LassoNet at https://github.com/lasso-net/lassonet/tree/group-lasso.Footnote 5 The dimensions of the hidden layers of the LassoNet were set equal to the same dimensions as the corresponding LocalGLMnet so that the model capacity is roughly comparable, i.e., any differences in performance will be mainly attributable to the way in which regularization is applied within each of the models. The main hyperparameter tested for the LassoNet was the budget parameter M; a range of LassoNet models are fit automatically for different values of the regularization parameter \(\eta \). Values of \(M \in \{1, 10, 100\}\) were tested for each example.

For the Boston housing dataset, the best LassoNet model as indicated by the MSE on the learning set was selected (since there are no validation or test sets used in that example). For the telematics data, the LassoNet producing the lowest values of the binary cross-entropy loss on the validation set was selected.

Only a single run of the LassoNet model was used for these results, nonetheless, it was observed that the results varied quite significantly for each training run (see last line in Table 8, which shows that the LassoNet has the highest standard deviation over training runs among the models considered) indicating that better results could be perhaps achieved using multiple runs and averaging over these.

Table 9 Boston housing data - feature components used
Table 10 Telematics data - feature components used
Fig. 12
figure 12

Telematics example: importance measures produced by group LASSO regularized LocalGLMnet; validation set. Territory variable only

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Richman, R., Wüthrich, M.V. LASSO regularization within the LocalGLMnet architecture. Adv Data Anal Classif 17, 951–981 (2023). https://doi.org/10.1007/s11634-022-00529-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-022-00529-z

Keywords

Mathematics Subject Classification

Navigation