Skip to main content

Regularization and Model Selection with Categorical Covariates

  • Conference paper
  • First Online:
Algorithms from and for Nature and Life

Abstract

The challenge in regression problems with categorical covariates is the high number of parameters involved. Common regularization methods like the Lasso, which allow for selection of predictors, are typically designed for metric predictors. If independent variables are categorical, selection strategies should be based on modified penalties. For categorical predictor variables with many categories a useful strategy is to search for clusters of categories with similar effects. We focus on generalized linear models and present L 1-penalty approaches for factor selection and clustering of categories. The methods proposed are investigated in simulation studies and applied to a real world classification problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The original dataset (Wolberg and Mangasarian 1990) was of size 369 (reported January 1989). Two instances were removed later and additional groups of all in all 332 samples were collected (between October 1989 and November 1991).

References

  • Bondell, H. D., & Reich, B. J. (2009). Simultaneous factor selection and collapsing levels in anova. Biometrics, 65, 169–177.

    Article  MathSciNet  MATH  Google Scholar 

  • Fahrmeir, L., & Tutz, G. (2001). Multivariate statistical modelling based on generalized linear models (2nd ed.). New York: Springer.

    Book  MATH  Google Scholar 

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

    Article  MathSciNet  MATH  Google Scholar 

  • Gertheiss, J. (2011). ordPens: Selection and/or Smoothing of Ordinal Predictors. R package version 0.1–7

    Google Scholar 

  • Gertheiss, J., & Tutz, G. (2009). Penalized regression with ordinal predictors. International Statistical Review, 77, 345–365.

    Article  Google Scholar 

  • Gertheiss, J., & Tutz, G. (2010). Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics, 4, 2150–2180.

    Article  MathSciNet  MATH  Google Scholar 

  • Leisch, F., & Dimitriadou, E. (2010). mlbench: Machine Learning Benchmark Problems. R package version 2.0-0

    Google Scholar 

  • McCullagh, P., & Nelder, J. A. (1989). Generalized linear models, 2nd edn. New York: Chapman & Hall

    MATH  Google Scholar 

  • Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA, URL http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Park, M.Y, & Hastie, T. (2007). L1 regularization-path algorithm for generalized linear models. Journal of the Royal Statistical Society, B 69, 659–677.

    Google Scholar 

  • Stelz, V. (2010). L1-Regularisierung bei kategorialen Prädiktoren in generalisierten linearen modellen. Master thesis, Ludwig-Maximilians-University Munich

    Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Ulbricht, J. (2010). lqa: Penalized Likelihood Inference for GLMs. R package version 1.0–3

    Google Scholar 

  • Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, 87, 9193–9196.

    Article  MATH  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported in part by DFG project GE2353/1-1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Gertheiss .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Gertheiss, J., Stelz, V., Tutz, G. (2013). Regularization and Model Selection with Categorical Covariates. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_21

Download citation

Publish with us

Policies and ethics