Skip to main content
Log in

Variable selection for linear regression in large databases: exact methods

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper analyzes the variable selection problem in the context of Linear Regression for large databases. The problem consists of selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared to well-known methods in the literature and with commercial software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Mundry R, Nunn CL (2009) Stepwise model fitting and statistical inference: turning noise into signal pollution. Am Nat 173(1):119–123

    Article  Google Scholar 

  2. Wang Y, Feng L (2019) A new hybrid feature selection based on multi-filter weights and multi-feature weights. Appl Intell 49(12):4033–4057

    Article  MathSciNet  Google Scholar 

  3. Sayed GI, Khoriba G, Haggag MH (2018) A novel chaotic salp swarm algorithm for global optimization and feature selection. Appl Intell 48(10):3462–3481

    Article  Google Scholar 

  4. Nardo M, Saisana M, Saltelli A, Tarantola S, Hoffman A, Giovannini E (2005a) Handbook on constructing composite indicators: methodology and user guide. OECD statistics, working paper 2005/3

  5. Bandura R (2008) A survey of composite indices measuring country performance: 2008 update. Office of Development Studies, United Nations Development Programme, Working Paper

    Google Scholar 

  6. Blancas Peral FJ, Gonzalez Lozano M, Guerrero Casas FM, Lozano Oyola M (2010) Indicadores Sintéticos de Turismo Sostenible: Una aplicación para los destinos turísticos de Andalucia. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 11:85–118

    Google Scholar 

  7. Parada Rico SE, Fiallo Leal E, Blasco-Blasco O (2015) Construcción de indicadores sintéticos basados en juicio experto: aplicación a una medida integral de excelencia académica. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 16:51–67

    Google Scholar 

  8. Févotte C, Torrésani B, Daudet L, Godsill SJ (2008) Sparse linear regression with structured priors and application to denoising of musical audio. IEEE Trans Audio Speech Lang Process 16(1):174–185

    Article  Google Scholar 

  9. Févotte C, Daudet L, Godsill SJ, Torrésani B (2006) Sparse regression with structured priors: application to audio denoising. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol 3. IEEE, pp III–III

  10. Mateos G, Bazerque JA, Giannakis GB (2010) Distributed sparse linear regression. IEEE Trans Signal Process 58(10):5262–5276

    Article  MathSciNet  Google Scholar 

  11. Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J Sel Top Appl Earth Obs Remote Sens 5(2):354–379

    Article  Google Scholar 

  12. Iordache MD, Bioucas-Dias JM, Plaza A (2014) Collaborative sparse regression for hyperspectral unmixing. IEEE Trans Geosci Remote Sens 52(1):341–354

    Article  Google Scholar 

  13. Bioucas-Dias JM, Plaza A (2010) Hyperspectral unmixing: geometrical, statistical, and sparse regression-based approaches. In: Image and signal processing for remote sensing XVI, vol 7830. International Society for Optics and Photonics, p 78300A

  14. Filzmoser P, Gschwandtner M, TodorovV (2012) Review of sparse methods in regression and classification with application to chemometrics. J Chemom 26(3–4):42–51

  15. Li Y, Nan B, Zhu J (2015) Multivariate sparse group lasso for the multivariatemultiple linear regression with an arbitrary group structure. Biometrics 71(2):354–363

    Article  MathSciNet  Google Scholar 

  16. Vounou M, Nichols TE, Montana G, Initiative ADN (2010) Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage 53(3):1147–1159

    Article  Google Scholar 

  17. Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, Ganguly A (2012) Sparse group lasso: consistency and climate applications. In: Proceedings of the 2012 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp 47–58

  18. Rish I, Grabarnik G (2014) Sparse modeling: theory, algorithms, and applications. CRC press

  19. Aneiros G, Ferraty F, Vieu P (2015) Variable selection in partial linear regression with functional covariate. Statistics 49(6):1322–1347

    Article  MathSciNet  Google Scholar 

  20. Gijbels I, Vrinssen I (2015) Robust nonnegative garrote variable selection in linear regression. Comput Stat Data Anal 85:1–22

    Article  MathSciNet  Google Scholar 

  21. Fan J, Li R (2001) Variable selection via non concave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  Google Scholar 

  22. Luo S, Ghosal S (2016) Forward selection and estimation in high dimensional single index models. Statistical Methodology 33:172–179

    Article  MathSciNet  Google Scholar 

  23. Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37(4):373–384

    Article  MathSciNet  Google Scholar 

  24. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–278

    MathSciNet  MATH  Google Scholar 

  25. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  Google Scholar 

  26. Hans C, Dobra A, West M (2007) Shotgun stochastic search for “large p” regression. J Am Stat Assoc 102(478):507–516

    Article  MathSciNet  Google Scholar 

  27. Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problemin marketing applications. Eur J Oper Res 171:842–858

    Article  Google Scholar 

  28. Kilinc BK, Asikgil B, Erar A, Yazici B (2016) Variable selection with genetic algorithm and multivariate adaptive regression splines in the presence of multicollinearity. Int J Adv Appl Sci 3(12):26–31

    Article  Google Scholar 

  29. Sayed GI, Tharwat A, Hassanien AE (2019) Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection. Appl Intell 49(1):188–205

    Article  Google Scholar 

  30. Brusco MJ, Steinley D (2011) Exact and approximate algorithms for variable selection in linear discriminant analysis. Comput Stat Data Anal 55(1):123–131

    Article  MathSciNet  Google Scholar 

  31. Brusco MJ, Singh R, Steinley D (2009) Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis. Psychometrika 74:705–726

    Article  MathSciNet  Google Scholar 

  32. Pacheco J, Casado S, Porras S (2013) Exact methods for variable selection in principal component analysis: guide functions and preselection. Comput Stat Data Anal 57:95–111

    Article  Google Scholar 

  33. Pacheco J, Casado S, Núñez L (2009) A variable selection method based on Tabu search for logistic regression models. Eur J Oper Res 199(2):506–511

    Article  MathSciNet  Google Scholar 

  34. Brusco MJ (2014) A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Comput Stat Data Anal 77:38–53

    Article  MathSciNet  Google Scholar 

  35. Dua, D and Graff, C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

  36. Efroymson M (1960) Multiple regression analysis. Mathematical Methods for Digital Computers 1:191–203

    MathSciNet  Google Scholar 

  37. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68(1):49–67

    Article  MathSciNet  Google Scholar 

  38. Yuan M, Lin Y (2007) On the non-negative garrotte estimator. J R Stat Soc B 69(2):143–161

    Article  MathSciNet  Google Scholar 

  39. Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was partially supported by FEDER funds and the Spanish Ministry of Economy and Competitiveness (Projects ECO2016-76567-C4-2-R and PID2019-104263RB-C44), the Regional Government of “Castilla y León”, Spain (Project BU329U14 and BU071G19), the Regional Government of “Castilla y León” and FEDER funds (Project BU062U16 and COV2000375).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquín Pacheco.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Corollary used by our Branch & Bound methods

1.1 Corollary

p, p ’ ∈ {1, …, n }, if p < p’ then g(p) ≤ g(p’).

Proof:

By simplifying, we can define p ’ = p + 1, and \( {S}_p^{\ast }=\left\{1,..,p\right\} \).

Let us also define S ’ = {1, …, p, p + 1}. Obviously, \( {S}_p^{\ast}\subset {S}^{\prime } \). Therefore

$$ g(p)=f\left({S}_p^{\ast}\right)\le f\left(S'\right)\le \mathit{\max}\ \left\{f(S):S\subset V,\left|S\right|=p+1\right\}=g\left(p+1\right). $$

Appendix 2. Calculation of the objective function

1.1 Pre-process for calculating the objective function

To facilitate the calculation of the objective function f(S) a pre-calculation is initially performed before beginning to execute the algorithms. The matrix of independent variables X and the vector of the dependent variable Y are considered, as defined in section 2. In other words

X = (xij)i = 1, …, m; j = 1, …, n e Y = (yi)i = 1, …, m

The pre-process consists of the following steps:

- The matrix \( {X}^{\ast }={\left({x}_{ij}^{\ast}\right)}_{i=1,\dots, m;j=1,\dots, n} \) is calculated where \( {x}_{ij}^{\ast }=\left(\frac{x_{ij}-{\overline{x}}_j}{\sqrt{m}\cdotp {s}_j}\right) \), i = 1, …, m; j = 1, …, n

and \( {\overline{x}}_j \) and \( {s}_j^2 \) are respectively the mean and the sample variance of the variable j; j = 1, …, n.

- The matrix \( {Y}^{\ast }={\left({y}_i^{\ast}\right)}_{i=1,\dots, m} \) is calculated where \( {y}_i^{\ast }=\left(\frac{y-\overline{y}}{\sqrt{m}\cdotp {s}_y}\right) \), i = 1, …, m; and \( \overline{y} \) and \( {s}_y^2 \) are respectively the mean and the sample variance of the variable Y.

- The matrix R = X∗′ · X and the vector H = X∗′ · Y are calculated. Note that R is the matrix of correlations of the independent variables. We denote the elements of R, j, j ′ = 1, …, n, as rjj l; and the elements of H, j = 1, …, n, as hj.

The matrix R and the vector H will be used in calculating f(S) for the various sets S in the algorithms proposed in this paper. This pre-process requires Θ(n2 · m) operations. However, it is executed only once.

1.2 Calculation of the objective function f(S)

Let there be a set S ⊂ V of size p. We denote S = {s(1), s(2), …, s(p)}. The calculation of f(S) consists of the following steps:

- The matrix \( {R}^S=\left({r}_{jj\prime}^S\right) \) is constructed, where \( {r}_{jj\prime}^S={r}_{S(j)S\left({j}^{\prime}\right)} \), j, j = 1, …, p;

- The vector \( {H}^S=\left({h}_j^S\right) \), where \( {h}_j^S={h}_{S(j)} \) j = 1, …, p;

- Find the inverse of the matrix RS: (RS)−1

- Calculate the vector of coefficients Β = (RS)−1 · HS. We denote the elements of Β, j = 1, …, p as βj.

- Calculate the value of \( (S)={\sum}_{j=1}^p{\sum}_{j^{\prime }=1}^p{\beta}_j\cdotp {\beta}_{j^{\prime }}\cdotp {r}_{jj\prime}^S \) .

This calculation requires Θ(p2) operations and is therefore independent of the number of cases m and of the initial number of variables n.

Appendix 3. Analysis of the complexity of our Branch & Bound methods

As described in Section 3, the Branch & Bound methods are based on a recursive exploration in the set of solutions. This set is represented by a search tree. Each node of this tree corresponds with a subset of solutions. In the exploration of the set of solutions corresponding to a node (Pseudocode 1), a variable a is selected and the corresponding set is divided into two subsets: one with the variable a fixed and another with the variable a forbidden. The process is then repeated with each subset. Since at most p0 variables are selected (p0 fixed variables), there are p0 divisions until a solution with maximum size p0 is reached. Therefore, \( \theta \left({2}^{p_0}\right) \) nodes are explored. On the other hand, the number of variables that are examined to be selected is limited by n, so determining the variable a assumes θ(n) calculations of the function f (line 5 of Pseudocode 1). Therefore, the Branch & Bound methods calculate the objective function of \( \theta \left(n\cdotp {2}^{p_0}\right) \) solutions S, with |S| ≤ p0. As the calculation of each f(S) requires θ(|S|2) operations, the complexity of our methods is \( \theta \left({p_0}^2\cdotp n\cdotp {2}^{p_0}\right) \).

Appendix 4. Basic ideas of traditional methods and our Branch & Bound methods

- Our Branch & Bound methods find the global optimum solution to the problem defined by (1)–(3) in Section 2.

- The Forward Method finds a solution to the same problem defined by (1)–(3), but this solution is and approximate solution (not necessarily the global optimal). The method works as follows:

1. Do S = ∅

Repeat

2. Determine i = argmax { f(S ∪ {i}) : i ∈ V − S}

3. Make S = S ∪ {i}

$$ \left(\mathrm{Until}\right)\left|S\right|=p $$

Note that the problem defined by (1)–(3) can also be defined compactly as:

$$ {\mathit{\min}}_{\beta }\ {\left\Vert Y-{X}^{\ast}\beta \right\Vert}_2 $$

subject to:

$$ {\left\Vert \beta \right\Vert}_0=p $$

where

X = (1| X) and 1t = (1, 1, …1) ∈ Rm, that is X is matrix X extended with vector 1

βt = (β0, β1, ⋯, βn) is the vector of the coefficients of the variables and the independent coefficient

r = r - Norme

- The LASSO Method solves the following model:

$$ {\mathit{\min}}_{\beta}\kern0.5em {\left\Vert Y-{X}^{\ast}\beta \right\Vert}_2+\lambda \cdotp {\left\Vert \beta \right\Vert}_1 $$

where λ is a previously established positive parameter. The higher the value of λ the fewer the variables with a coefficient βi ≠ 0 are selected – that is, the fewer variables are selected. To find the solution, methods based on the Coordinate Descent algorithm are usually used.

- The LARS method solves the same model as LASSO but using strategies analogous to the Forward method – that is, selecting in each step a variable to enter in the solution.

- The GARROTE method calculates its coefficients \( {\beta}_i^g \) as follows:

$$ {\beta}_i^g={\left[1-\frac{\lambda }{{\left({\beta}_i^r\right)}^2}\right]}_{+}\cdotp {\beta}_i^r $$

where the values \( {\beta}_i^r \) are the coefficients of the linear regression model with all variables, [z]+ is the positive part of a real number z, and λ is a previously established positive parameter. As in previous model, the higher the value of λ the fewer the variables with coefficient βi ≠ 0 – that is, the fewer variables are selected.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pacheco, J., Casado, S. Variable selection for linear regression in large databases: exact methods. Appl Intell 51, 3736–3756 (2021). https://doi.org/10.1007/s10489-020-01927-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01927-6

Keywords

Navigation