Variable selection for linear regression in large databases: exact methods

Pacheco, Joaquín; Casado, Silvia

doi:10.1007/s10489-020-01927-6

Variable selection for linear regression in large databases: exact methods

Published: 18 November 2020

Volume 51, pages 3736–3756, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

330 Accesses
1 Citation
Explore all metrics

Abstract

This paper analyzes the variable selection problem in the context of Linear Regression for large databases. The problem consists of selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared to well-known methods in the literature and with commercial software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

Gérard Biau & Erwan Scornet

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

Eugenio Melilli & Piero Veronese

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Aki Vehtari, Andrew Gelman & Jonah Gabry

References

Mundry R, Nunn CL (2009) Stepwise model fitting and statistical inference: turning noise into signal pollution. Am Nat 173(1):119–123
Article Google Scholar
Wang Y, Feng L (2019) A new hybrid feature selection based on multi-filter weights and multi-feature weights. Appl Intell 49(12):4033–4057
Article MathSciNet Google Scholar
Sayed GI, Khoriba G, Haggag MH (2018) A novel chaotic salp swarm algorithm for global optimization and feature selection. Appl Intell 48(10):3462–3481
Article Google Scholar
Nardo M, Saisana M, Saltelli A, Tarantola S, Hoffman A, Giovannini E (2005a) Handbook on constructing composite indicators: methodology and user guide. OECD statistics, working paper 2005/3
Bandura R (2008) A survey of composite indices measuring country performance: 2008 update. Office of Development Studies, United Nations Development Programme, Working Paper
Google Scholar
Blancas Peral FJ, Gonzalez Lozano M, Guerrero Casas FM, Lozano Oyola M (2010) Indicadores Sintéticos de Turismo Sostenible: Una aplicación para los destinos turísticos de Andalucia. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 11:85–118
Google Scholar
Parada Rico SE, Fiallo Leal E, Blasco-Blasco O (2015) Construcción de indicadores sintéticos basados en juicio experto: aplicación a una medida integral de excelencia académica. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 16:51–67
Google Scholar
Févotte C, Torrésani B, Daudet L, Godsill SJ (2008) Sparse linear regression with structured priors and application to denoising of musical audio. IEEE Trans Audio Speech Lang Process 16(1):174–185
Article Google Scholar
Févotte C, Daudet L, Godsill SJ, Torrésani B (2006) Sparse regression with structured priors: application to audio denoising. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol 3. IEEE, pp III–III
Mateos G, Bazerque JA, Giannakis GB (2010) Distributed sparse linear regression. IEEE Trans Signal Process 58(10):5262–5276
Article MathSciNet Google Scholar
Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J Sel Top Appl Earth Obs Remote Sens 5(2):354–379
Article Google Scholar
Iordache MD, Bioucas-Dias JM, Plaza A (2014) Collaborative sparse regression for hyperspectral unmixing. IEEE Trans Geosci Remote Sens 52(1):341–354
Article Google Scholar
Bioucas-Dias JM, Plaza A (2010) Hyperspectral unmixing: geometrical, statistical, and sparse regression-based approaches. In: Image and signal processing for remote sensing XVI, vol 7830. International Society for Optics and Photonics, p 78300A
Filzmoser P, Gschwandtner M, TodorovV (2012) Review of sparse methods in regression and classification with application to chemometrics. J Chemom 26(3–4):42–51
Li Y, Nan B, Zhu J (2015) Multivariate sparse group lasso for the multivariatemultiple linear regression with an arbitrary group structure. Biometrics 71(2):354–363
Article MathSciNet Google Scholar
Vounou M, Nichols TE, Montana G, Initiative ADN (2010) Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage 53(3):1147–1159
Article Google Scholar
Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, Ganguly A (2012) Sparse group lasso: consistency and climate applications. In: Proceedings of the 2012 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp 47–58
Rish I, Grabarnik G (2014) Sparse modeling: theory, algorithms, and applications. CRC press
Aneiros G, Ferraty F, Vieu P (2015) Variable selection in partial linear regression with functional covariate. Statistics 49(6):1322–1347
Article MathSciNet Google Scholar
Gijbels I, Vrinssen I (2015) Robust nonnegative garrote variable selection in linear regression. Comput Stat Data Anal 85:1–22
Article MathSciNet Google Scholar
Fan J, Li R (2001) Variable selection via non concave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article Google Scholar
Luo S, Ghosal S (2016) Forward selection and estimation in high dimensional single index models. Statistical Methodology 33:172–179
Article MathSciNet Google Scholar
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37(4):373–384
Article MathSciNet Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–278
MathSciNet MATH Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet Google Scholar
Hans C, Dobra A, West M (2007) Shotgun stochastic search for “large p” regression. J Am Stat Assoc 102(478):507–516
Article MathSciNet Google Scholar
Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problemin marketing applications. Eur J Oper Res 171:842–858
Article Google Scholar
Kilinc BK, Asikgil B, Erar A, Yazici B (2016) Variable selection with genetic algorithm and multivariate adaptive regression splines in the presence of multicollinearity. Int J Adv Appl Sci 3(12):26–31
Article Google Scholar
Sayed GI, Tharwat A, Hassanien AE (2019) Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection. Appl Intell 49(1):188–205
Article Google Scholar
Brusco MJ, Steinley D (2011) Exact and approximate algorithms for variable selection in linear discriminant analysis. Comput Stat Data Anal 55(1):123–131
Article MathSciNet Google Scholar
Brusco MJ, Singh R, Steinley D (2009) Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis. Psychometrika 74:705–726
Article MathSciNet Google Scholar
Pacheco J, Casado S, Porras S (2013) Exact methods for variable selection in principal component analysis: guide functions and preselection. Comput Stat Data Anal 57:95–111
Article Google Scholar
Pacheco J, Casado S, Núñez L (2009) A variable selection method based on Tabu search for logistic regression models. Eur J Oper Res 199(2):506–511
Article MathSciNet Google Scholar
Brusco MJ (2014) A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Comput Stat Data Anal 77:38–53
Article MathSciNet Google Scholar
Dua, D and Graff, C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Efroymson M (1960) Multiple regression analysis. Mathematical Methods for Digital Computers 1:191–203
MathSciNet Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68(1):49–67
Article MathSciNet Google Scholar
Yuan M, Lin Y (2007) On the non-negative garrotte estimator. J R Stat Soc B 69(2):143–161
Article MathSciNet Google Scholar
Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was partially supported by FEDER funds and the Spanish Ministry of Economy and Competitiveness (Projects ECO2016-76567-C4-2-R and PID2019-104263RB-C44), the Regional Government of “Castilla y León”, Spain (Project BU329U14 and BU071G19), the Regional Government of “Castilla y León” and FEDER funds (Project BU062U16 and COV2000375).

Author information

Authors and Affiliations

Department of Applied Economics, University of Burgos, Burgos, Spain
Joaquín Pacheco & Silvia Casado

Authors

Joaquín Pacheco
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Casado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquín Pacheco.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Corollary used by our Branch & Bound methods

1.1 Corollary

∀p, p ’ ∈ {1, …, n }, if p < p’ then g(p) ≤ g(p’).

Proof:

By simplifying, we can define p ’ = p + 1, and $ {S}_p^{\ast }=\left\{1,..,p\right\} $.

Let us also define S ’ = {1, …, p, p + 1}. Obviously, $ {S}_p^{\ast}\subset {S}^{\prime } $. Therefore

$$ g(p)=f\left({S}_p^{\ast}\right)\le f\left(S'\right)\le \mathit{\max}\ \left\{f(S):S\subset V,\left|S\right|=p+1\right\}=g\left(p+1\right). $$

Appendix 2. Calculation of the objective function

1.1 Pre-process for calculating the objective function

To facilitate the calculation of the objective function f(S) a pre-calculation is initially performed before beginning to execute the algorithms. The matrix of independent variables X and the vector of the dependent variable Y are considered, as defined in section 2. In other words

X = (x_ij)_{i = 1, …, m; j = 1, …, n} e Y = (y_i)_{i = 1, …, m}

The pre-process consists of the following steps:

- The matrix $ {X}^{\ast }={\left({x}_{ij}^{\ast}\right)}_{i=1,\dots, m;j=1,\dots, n} $ is calculated where $ {x}_{ij}^{\ast }=\left(\frac{x_{ij}-{\overline{x}}_j}{\sqrt{m}\cdotp {s}_j}\right) $, i = 1, …, m; j = 1, …, n

and $ {\overline{x}}_j $ and $ {s}_j^2 $ are respectively the mean and the sample variance of the variable j; j = 1, …, n.

- The matrix $ {Y}^{\ast }={\left({y}_i^{\ast}\right)}_{i=1,\dots, m} $ is calculated where $ {y}_i^{\ast }=\left(\frac{y-\overline{y}}{\sqrt{m}\cdotp {s}_y}\right) $, i = 1, …, m; and $ \overline{y} $ and $ {s}_y^2 $ are respectively the mean and the sample variance of the variable Y.

- The matrix R = X^∗′ · X^∗ and the vector H = X^∗′ · Y^∗ are calculated. Note that R is the matrix of correlations of the independent variables. We denote the elements of R, j, j ′ = 1, …, n, as r_jj′ l; and the elements of H, j = 1, …, n, as h_j.

The matrix R and the vector H will be used in calculating f(S) for the various sets S in the algorithms proposed in this paper. This pre-process requires Θ(n² · m) operations. However, it is executed only once.

1.2 Calculation of the objective function f(S)

Let there be a set S ⊂ V of size p. We denote S = {s(1), s(2), …, s(p)}. The calculation of f(S) consists of the following steps:

- The matrix $ {R}^S=\left({r}_{jj\prime}^S\right) $ is constructed, where $ {r}_{jj\prime}^S={r}_{S(j)S\left({j}^{\prime}\right)} $, j, j^′ = 1, …, p;

- The vector $ {H}^S=\left({h}_j^S\right) $, where $ {h}_j^S={h}_{S(j)} $ j = 1, …, p;

- Find the inverse of the matrix R^S: (R^S)⁻¹

- Calculate the vector of coefficients Β = (R^S)⁻¹ · H^S. We denote the elements of Β, j = 1, …, p as β_j.

- Calculate the value of $ (S)={\sum}_{j=1}^p{\sum}_{j^{\prime }=1}^p{\beta}_j\cdotp {\beta}_{j^{\prime }}\cdotp {r}_{jj\prime}^S $ .

This calculation requires Θ(p²) operations and is therefore independent of the number of cases m and of the initial number of variables n.

Appendix 3. Analysis of the complexity of our Branch & Bound methods

As described in Section 3, the Branch & Bound methods are based on a recursive exploration in the set of solutions. This set is represented by a search tree. Each node of this tree corresponds with a subset of solutions. In the exploration of the set of solutions corresponding to a node (Pseudocode 1), a variable a is selected and the corresponding set is divided into two subsets: one with the variable a fixed and another with the variable a forbidden. The process is then repeated with each subset. Since at most p₀ variables are selected (p₀ fixed variables), there are p₀ divisions until a solution with maximum size p₀ is reached. Therefore, $ \theta \left({2}^{p_0}\right) $ nodes are explored. On the other hand, the number of variables that are examined to be selected is limited by n, so determining the variable a assumes θ(n) calculations of the function f (line 5 of Pseudocode 1). Therefore, the Branch & Bound methods calculate the objective function of $ \theta \left(n\cdotp {2}^{p_0}\right) $ solutions S, with |S| ≤ p₀. As the calculation of each f(S) requires θ(|S|²) operations, the complexity of our methods is $ \theta \left({p_0}^2\cdotp n\cdotp {2}^{p_0}\right) $.

Appendix 4. Basic ideas of traditional methods and our Branch & Bound methods

- Our Branch & Bound methods find the global optimum solution to the problem defined by (1)–(3) in Section 2.

- The Forward Method finds a solution to the same problem defined by (1)–(3), but this solution is and approximate solution (not necessarily the global optimal). The method works as follows:

1. Do S = ∅

Repeat

2. Determine i^∗ = argmax { f(S ∪ {i}) : i ∈ V − S}

3. Make S = S ∪ {i^∗}

$$ \left(\mathrm{Until}\right)\left|S\right|=p $$

Note that the problem defined by (1)–(3) can also be defined compactly as:

$$ {\mathit{\min}}_{\beta }\ {\left\Vert Y-{X}^{\ast}\beta \right\Vert}_2 $$

subject to:

$$ {\left\Vert \beta \right\Vert}_0=p $$

where

X^∗ = (1| X) and 1^t = (1, 1, …1) ∈ R^m, that is X^∗ is matrix X extended with vector 1

β^t = (β₀, β₁, ⋯, β_n) is the vector of the coefficients of the variables and the independent coefficient

‖_r = r - Norme

- The LASSO Method solves the following model:

$$ {\mathit{\min}}_{\beta}\kern0.5em {\left\Vert Y-{X}^{\ast}\beta \right\Vert}_2+\lambda \cdotp {\left\Vert \beta \right\Vert}_1 $$

where λ is a previously established positive parameter. The higher the value of λ the fewer the variables with a coefficient β_i ≠ 0 are selected – that is, the fewer variables are selected. To find the solution, methods based on the Coordinate Descent algorithm are usually used.

- The LARS method solves the same model as LASSO but using strategies analogous to the Forward method – that is, selecting in each step a variable to enter in the solution.

- The GARROTE method calculates its coefficients $ {\beta}_i^g $ as follows:

$$ {\beta}_i^g={\left[1-\frac{\lambda }{{\left({\beta}_i^r\right)}^2}\right]}_{+}\cdotp {\beta}_i^r $$

where the values $ {\beta}_i^r $ are the coefficients of the linear regression model with all variables, [z]₊ is the positive part of a real number z, and λ is a previously established positive parameter. As in previous model, the higher the value of λ the fewer the variables with coefficient β_i ≠ 0 – that is, the fewer variables are selected.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pacheco, J., Casado, S. Variable selection for linear regression in large databases: exact methods. Appl Intell 51, 3736–3756 (2021). https://doi.org/10.1007/s10489-020-01927-6

Download citation

Accepted: 02 September 2020
Published: 18 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10489-020-01927-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection for linear regression in large databases: exact methods

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Confidence distributions and hypothesis testing

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1. Corollary used by our Branch & Bound methods

1.1 Corollary

Appendix 2. Calculation of the objective function

1.1 Pre-process for calculating the objective function

1.2 Calculation of the objective function f(S)

Appendix 3. Analysis of the complexity of our Branch & Bound methods

Appendix 4. Basic ideas of traditional methods and our Branch & Bound methods

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variable selection for linear regression in large databases: exact methods

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Confidence distributions and hypothesis testing

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1. Corollary used by our Branch & Bound methods

1.1 Corollary

Appendix 2. Calculation of the objective function

1.1 Pre-process for calculating the objective function

1.2 Calculation of the objective function f(S)

Appendix 3. Analysis of the complexity of our Branch & Bound methods

Appendix 4. Basic ideas of traditional methods and our Branch & Bound methods

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation