Abstract
This paper analyzes the variable selection problem in the context of Linear Regression for large databases. The problem consists of selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared to well-known methods in the literature and with commercial software.
Similar content being viewed by others
References
Mundry R, Nunn CL (2009) Stepwise model fitting and statistical inference: turning noise into signal pollution. Am Nat 173(1):119–123
Wang Y, Feng L (2019) A new hybrid feature selection based on multi-filter weights and multi-feature weights. Appl Intell 49(12):4033–4057
Sayed GI, Khoriba G, Haggag MH (2018) A novel chaotic salp swarm algorithm for global optimization and feature selection. Appl Intell 48(10):3462–3481
Nardo M, Saisana M, Saltelli A, Tarantola S, Hoffman A, Giovannini E (2005a) Handbook on constructing composite indicators: methodology and user guide. OECD statistics, working paper 2005/3
Bandura R (2008) A survey of composite indices measuring country performance: 2008 update. Office of Development Studies, United Nations Development Programme, Working Paper
Blancas Peral FJ, Gonzalez Lozano M, Guerrero Casas FM, Lozano Oyola M (2010) Indicadores Sintéticos de Turismo Sostenible: Una aplicación para los destinos turísticos de Andalucia. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 11:85–118
Parada Rico SE, Fiallo Leal E, Blasco-Blasco O (2015) Construcción de indicadores sintéticos basados en juicio experto: aplicación a una medida integral de excelencia académica. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 16:51–67
Févotte C, Torrésani B, Daudet L, Godsill SJ (2008) Sparse linear regression with structured priors and application to denoising of musical audio. IEEE Trans Audio Speech Lang Process 16(1):174–185
Févotte C, Daudet L, Godsill SJ, Torrésani B (2006) Sparse regression with structured priors: application to audio denoising. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol 3. IEEE, pp III–III
Mateos G, Bazerque JA, Giannakis GB (2010) Distributed sparse linear regression. IEEE Trans Signal Process 58(10):5262–5276
Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J Sel Top Appl Earth Obs Remote Sens 5(2):354–379
Iordache MD, Bioucas-Dias JM, Plaza A (2014) Collaborative sparse regression for hyperspectral unmixing. IEEE Trans Geosci Remote Sens 52(1):341–354
Bioucas-Dias JM, Plaza A (2010) Hyperspectral unmixing: geometrical, statistical, and sparse regression-based approaches. In: Image and signal processing for remote sensing XVI, vol 7830. International Society for Optics and Photonics, p 78300A
Filzmoser P, Gschwandtner M, TodorovV (2012) Review of sparse methods in regression and classification with application to chemometrics. J Chemom 26(3–4):42–51
Li Y, Nan B, Zhu J (2015) Multivariate sparse group lasso for the multivariatemultiple linear regression with an arbitrary group structure. Biometrics 71(2):354–363
Vounou M, Nichols TE, Montana G, Initiative ADN (2010) Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage 53(3):1147–1159
Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, Ganguly A (2012) Sparse group lasso: consistency and climate applications. In: Proceedings of the 2012 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp 47–58
Rish I, Grabarnik G (2014) Sparse modeling: theory, algorithms, and applications. CRC press
Aneiros G, Ferraty F, Vieu P (2015) Variable selection in partial linear regression with functional covariate. Statistics 49(6):1322–1347
Gijbels I, Vrinssen I (2015) Robust nonnegative garrote variable selection in linear regression. Comput Stat Data Anal 85:1–22
Fan J, Li R (2001) Variable selection via non concave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Luo S, Ghosal S (2016) Forward selection and estimation in high dimensional single index models. Statistical Methodology 33:172–179
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37(4):373–384
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–278
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Hans C, Dobra A, West M (2007) Shotgun stochastic search for “large p” regression. J Am Stat Assoc 102(478):507–516
Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problemin marketing applications. Eur J Oper Res 171:842–858
Kilinc BK, Asikgil B, Erar A, Yazici B (2016) Variable selection with genetic algorithm and multivariate adaptive regression splines in the presence of multicollinearity. Int J Adv Appl Sci 3(12):26–31
Sayed GI, Tharwat A, Hassanien AE (2019) Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection. Appl Intell 49(1):188–205
Brusco MJ, Steinley D (2011) Exact and approximate algorithms for variable selection in linear discriminant analysis. Comput Stat Data Anal 55(1):123–131
Brusco MJ, Singh R, Steinley D (2009) Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis. Psychometrika 74:705–726
Pacheco J, Casado S, Porras S (2013) Exact methods for variable selection in principal component analysis: guide functions and preselection. Comput Stat Data Anal 57:95–111
Pacheco J, Casado S, Núñez L (2009) A variable selection method based on Tabu search for logistic regression models. Eur J Oper Res 199(2):506–511
Brusco MJ (2014) A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Comput Stat Data Anal 77:38–53
Dua, D and Graff, C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Efroymson M (1960) Multiple regression analysis. Mathematical Methods for Digital Computers 1:191–203
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68(1):49–67
Yuan M, Lin Y (2007) On the non-negative garrotte estimator. J R Stat Soc B 69(2):143–161
Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244
Acknowledgments
This work was partially supported by FEDER funds and the Spanish Ministry of Economy and Competitiveness (Projects ECO2016-76567-C4-2-R and PID2019-104263RB-C44), the Regional Government of “Castilla y León”, Spain (Project BU329U14 and BU071G19), the Regional Government of “Castilla y León” and FEDER funds (Project BU062U16 and COV2000375).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1. Corollary used by our Branch & Bound methods
1.1 Corollary
∀p, p ’ ∈ {1, …, n }, if p < p’ then g(p) ≤ g(p’).
Proof:
By simplifying, we can define p ’ = p + 1, and \( {S}_p^{\ast }=\left\{1,..,p\right\} \).
Let us also define S ’ = {1, …, p, p + 1}. Obviously, \( {S}_p^{\ast}\subset {S}^{\prime } \). Therefore
Appendix 2. Calculation of the objective function
1.1 Pre-process for calculating the objective function
To facilitate the calculation of the objective function f(S) a pre-calculation is initially performed before beginning to execute the algorithms. The matrix of independent variables X and the vector of the dependent variable Y are considered, as defined in section 2. In other words
X = (xij)i = 1, …, m; j = 1, …, n e Y = (yi)i = 1, …, m
The pre-process consists of the following steps:
- The matrix \( {X}^{\ast }={\left({x}_{ij}^{\ast}\right)}_{i=1,\dots, m;j=1,\dots, n} \) is calculated where \( {x}_{ij}^{\ast }=\left(\frac{x_{ij}-{\overline{x}}_j}{\sqrt{m}\cdotp {s}_j}\right) \), i = 1, …, m; j = 1, …, n
and \( {\overline{x}}_j \) and \( {s}_j^2 \) are respectively the mean and the sample variance of the variable j; j = 1, …, n.
- The matrix \( {Y}^{\ast }={\left({y}_i^{\ast}\right)}_{i=1,\dots, m} \) is calculated where \( {y}_i^{\ast }=\left(\frac{y-\overline{y}}{\sqrt{m}\cdotp {s}_y}\right) \), i = 1, …, m; and \( \overline{y} \) and \( {s}_y^2 \) are respectively the mean and the sample variance of the variable Y.
- The matrix R = X∗′ · X∗ and the vector H = X∗′ · Y∗ are calculated. Note that R is the matrix of correlations of the independent variables. We denote the elements of R, j, j ′ = 1, …, n, as rjj′ l; and the elements of H, j = 1, …, n, as hj.
The matrix R and the vector H will be used in calculating f(S) for the various sets S in the algorithms proposed in this paper. This pre-process requires Θ(n2 · m) operations. However, it is executed only once.
1.2 Calculation of the objective function f(S)
Let there be a set S ⊂ V of size p. We denote S = {s(1), s(2), …, s(p)}. The calculation of f(S) consists of the following steps:
- The matrix \( {R}^S=\left({r}_{jj\prime}^S\right) \) is constructed, where \( {r}_{jj\prime}^S={r}_{S(j)S\left({j}^{\prime}\right)} \), j, j′ = 1, …, p;
- The vector \( {H}^S=\left({h}_j^S\right) \), where \( {h}_j^S={h}_{S(j)} \) j = 1, …, p;
- Find the inverse of the matrix RS: (RS)−1
- Calculate the vector of coefficients Β = (RS)−1 · HS. We denote the elements of Β, j = 1, …, p as βj.
- Calculate the value of \( (S)={\sum}_{j=1}^p{\sum}_{j^{\prime }=1}^p{\beta}_j\cdotp {\beta}_{j^{\prime }}\cdotp {r}_{jj\prime}^S \) .
This calculation requires Θ(p2) operations and is therefore independent of the number of cases m and of the initial number of variables n.
Appendix 3. Analysis of the complexity of our Branch & Bound methods
As described in Section 3, the Branch & Bound methods are based on a recursive exploration in the set of solutions. This set is represented by a search tree. Each node of this tree corresponds with a subset of solutions. In the exploration of the set of solutions corresponding to a node (Pseudocode 1), a variable a is selected and the corresponding set is divided into two subsets: one with the variable a fixed and another with the variable a forbidden. The process is then repeated with each subset. Since at most p0 variables are selected (p0 fixed variables), there are p0 divisions until a solution with maximum size p0 is reached. Therefore, \( \theta \left({2}^{p_0}\right) \) nodes are explored. On the other hand, the number of variables that are examined to be selected is limited by n, so determining the variable a assumes θ(n) calculations of the function f (line 5 of Pseudocode 1). Therefore, the Branch & Bound methods calculate the objective function of \( \theta \left(n\cdotp {2}^{p_0}\right) \) solutions S, with |S| ≤ p0. As the calculation of each f(S) requires θ(|S|2) operations, the complexity of our methods is \( \theta \left({p_0}^2\cdotp n\cdotp {2}^{p_0}\right) \).
Appendix 4. Basic ideas of traditional methods and our Branch & Bound methods
- Our Branch & Bound methods find the global optimum solution to the problem defined by (1)–(3) in Section 2.
- The Forward Method finds a solution to the same problem defined by (1)–(3), but this solution is and approximate solution (not necessarily the global optimal). The method works as follows:
1. Do S = ∅
Repeat
2. Determine i∗ = argmax { f(S ∪ {i}) : i ∈ V − S}
3. Make S = S ∪ {i∗}
Note that the problem defined by (1)–(3) can also be defined compactly as:
subject to:
where
X∗ = (1| X) and 1t = (1, 1, …1) ∈ Rm, that is X∗ is matrix X extended with vector 1
βt = (β0, β1, ⋯, βn) is the vector of the coefficients of the variables and the independent coefficient
‖r = r - Norme
- The LASSO Method solves the following model:
where λ is a previously established positive parameter. The higher the value of λ the fewer the variables with a coefficient βi ≠ 0 are selected – that is, the fewer variables are selected. To find the solution, methods based on the Coordinate Descent algorithm are usually used.
- The LARS method solves the same model as LASSO but using strategies analogous to the Forward method – that is, selecting in each step a variable to enter in the solution.
- The GARROTE method calculates its coefficients \( {\beta}_i^g \) as follows:
where the values \( {\beta}_i^r \) are the coefficients of the linear regression model with all variables, [z]+ is the positive part of a real number z, and λ is a previously established positive parameter. As in previous model, the higher the value of λ the fewer the variables with coefficient βi ≠ 0 – that is, the fewer variables are selected.
Rights and permissions
About this article
Cite this article
Pacheco, J., Casado, S. Variable selection for linear regression in large databases: exact methods. Appl Intell 51, 3736–3756 (2021). https://doi.org/10.1007/s10489-020-01927-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01927-6