Abstract
Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming models for regression subset selection based on mean square and absolute errors, and minimal-redundancy–maximal-relevance criteria. The proposed models are tested using a linear-program-based branch-and-bound algorithm with tailored valid inequalities and big M values and are compared against the algorithms in the literature. For high dimensional cases, an iterative heuristic algorithm is proposed based on the mathematical programming models and a core set concept, and a randomized version of the algorithm is derived to guarantee convergence to the global optimum. From the computational experiments, we find that our models quickly find a quality solution while the rest of the time is spent to prove optimality; the iterative algorithms find solutions in a relatively short time and are competitive compared to state-of-the-art algorithms; using ad-hoc big M values is not recommended.
Similar content being viewed by others
References
Bertsimas, D., Weismantel, R.: Optimization over integers. Dyn. Ideas 13, 471 (2005)
Bertsimas, D., Shioda, R.: Algorithm for cardinality-constrained quadratic optimization. Comput. Optim. Appl. 43, 1–22 (2009)
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016)
Bertsimas, D., King, A.: OR forum an algorithmic approach to linear regression. Oper. Res. 64(1), 2–16 (2016)
Bienstock, D.: Computational study of a family of mixed-integer quadratic programming problems. Math. Program. 74, 121–140 (1996)
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Feature selection via mathematical programming. INFORMS J. Comput. 10:2, 209–217 (1998)
Candes, E., Tao, T.: The Danzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7:3, 1247–1250 (2004)
Charnes, A., Cooper, W.W., Ferguson, R.O.: Optimal estimation of executive compensation by linear programming. Manag. Sci. 1, 138–151 (1955)
de Farias Jr., I.R., Nemhauser, G.L.: A polyhedral study of the cardinality constrained knapsack problem. Math. Program. 96:3, 439–467 (2003)
Dielman, Terry E.: Least absolute value regression: recent contributions. J. Stat. Comput. Simul. 75:4, 263–286 (2005)
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3:2, 185–205 (2005)
Fung, G.N., Mangasarian, O.L.: A feature selection Newton method for support vector machine classification. Comput. Optim. Appl. 28:2, 185–202 (2004)
Furnival, G.M., Wilson, R.W.: Regressions by leaps and bounds. Technometrics 16, 499–511 (1974)
Glover, F.: Improved linear integer programming formulations of nonlinear integer problems. Manag. Sci. 22(4), 455–460 (1975)
Har-Peled, S.: Geometric Approximation Algorithms. American Mathematical Society, Providence (2011)
Harrell, F.E.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Berlin (2001)
Hastie, T., Tibshirani, R., Tibshirani, R.: Bestsubset: Tools for best subset selection in regression. R Package version 1.0.4 (2017). https://github.com/ryantibs/best-subset/. Accessed 22 Aug 2018
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Hwang, K., Kim, D., Lee, K., Lee, C., Park, S.: Embedded variable selection method using signomial classification. Ann. Oper. Res. 254, 89–109 (2017)
Johnson, R.W.: Fitting percentage of body fat to simple body measurements. J. Stat. Educ. 4, 1 (1996)
Konno, H., Yamamoto, R.: Choosing the best set of variables in regression analysis using integer programming. J. Global Optim. 44, 273–282 (2009)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2018
Lumley, T.: Leaps: regression subset selection. R package version 2.9 (2009) http://CRAN.R-project.org/package=leaps. Accessed 18 Oct 2016
Miller, A.J.: Selection of subsets of regression variables. J. R. Stat. Soc. Ser. A 147, 389–425 (1984)
Miller, A.J.: Subset Selection in Regression. Chapman and Hall, London (2002)
Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247, 721–731 (2015)
Narula, S.C., Wellington, J.F.: The minimum sum of absolute error regression: a state of the art survey. Int. Stat. Rev. 50, 317–326 (1982)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Rafiei, M.H., Adeli, H.: A novel machine learning model for estimation of sale prices of real estate units. J. Constr. Eng. Manag. 142(2), 04015066 (2015)
Rinaldi, F., Sciandrone, M.: Feature selection combining linear support vector machines and concave optimization. Optim. Methods Softw. 25(1), 117–128 (2010)
Schaible, S., Shi, J.: Recent developments in fractional programming: single-ratio and max-min case. Nonlinear Anal. Convex Anal. 5, 493–506 (2004)
Stancu-Minasian, I.M.: Fractional Programming: Theory, Methods and Applications. Springer, Berlin (2012)
Schlossmacher, E.J.: An iterative technique for absolute deviations curve fitting. J. Am. Stat. Assoc. 68, 857–859 (1973)
Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Hoboken (1998)
Stodden, V.: Model selection when the number of variables exceeds the number of observations. PhD dissertation. Stanford University (2006)
Tamhane, A.C., Dunlop, D.D.: Statistics and Data Analysis: From Elementary to Intermediate. Pearson, London (1999)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
Wagner, H.M.: Linear programming techniques for regression analysis. J. Am. Stat. Assoc. 54, 206–212 (1959)
Western, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)
Acknowledgements
The authors appreciate the editors and reviewers for their constructive comments and suggestions that strengthened the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix
A Proof of Lemmas and Propositions
Proof of Proposition 1
The proof is based on the fact that feasible solutions to (4) and (5) map to each other. Hence, we consider the following two cases.
- 1.
Case: (4) \(\Rightarrow \) (5)
Let \(S = \{ j | z_j = 1 \}\) be the column index set of a solution to (4). We set \(v_j = u\) for \(j \in S\) and \(v_j = 0\) for \(j \notin S\). Then,
$$\begin{aligned} \sum _{i \in I} |t_i|&=(n-1)u- \sum _{j \in J} u z_j&(\hbox {from }(4\hbox {b}))\\&= (n-1)u-\sum _{j \in S} u \\&= (n-1)u - \sum _{j \in S} v_j&(\hbox {by definition of }v_j)\\&= (n-1)u-\sum _{j \in J} v_j, \end{aligned}$$which satisfies (5). Further, we satisfy the following.
- (a)
Constraint (5e): We have \(v_j = u \le u\) for \(j \in S\) and \(v_j = 0 \le u\) for \(j \notin S\). Hence, \(v_j \le u\) for all \(j \in J\).
- (b)
Constraint (5f): We have \(u - M(1-z_j) = u \le v_j = u \le M z_j = M\) for \(j \in S\) and \(u-M(1-z_j) = u-M \le v_j = 0 \le M z_j = 0\) for \(j \notin S\). Hence, we satisfy (5f).
- (c)
Constraint (5g): We have \(v_j \in \{0,u\} \ge 0\), for all \(j \in J\).
Note that (5c) is automatically satisfied since it is equal to (4c). Hence, we obtain a feasible solution to (5).
- (a)
- 2.
Case: (5) \(\Rightarrow \) (4)
Let \(S = \{ j | z_j = 1 \}\) be the column index set of a solution to (5). Since we are minimizing u, (5e) is equivalent to \(\max _{j} v_j = u\). Note that, in an optimal solution, we must have \(v_j = u\) for all \(j \in S\). Hence, starting from (5b), we derive
$$\begin{aligned} \sum _{i \in I} |t_i|= & {} (n-1)u-\sum _{j \in J} v_j \qquad \qquad (\hbox {from }(5\hbox {b}))\\&= (n-1)u-\sum _{j \in S} v_j =(n-1)u-\sum _{j \in S} u \qquad \qquad (v_j = u\hbox { for all }j \in S)\\&= (n-1)u-\sum _{j \in S} u z_j = (n-1)u-\sum _{j \in J} u z_j, \end{aligned}$$which satisfies (5).
This ends the proof. \(\square \)
Proof of Proposition 2
Let \({\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})\) be an optimal solution to (6) and let \({\bar{p}} = \sum _{j \in J} {\bar{z}}_j\) be the number of optimal regression variables. For a contradiction, let us assume that there exists an index k such that \({\bar{t}}_k^+ >0\) and \({\bar{t}}_k^- >0\). Without loss of generality, let us also assume \({\bar{t}}_k^+ \ge {\bar{t}}_k^-\). For simplicity, let \(\delta = {\bar{t}}_k^-\). Let us generate \({\tilde{X}}\) that is equal to \({\bar{X}}\) except \({\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta \), \({\tilde{t}}_k^- = {\bar{t}}_k^- - \delta = 0\), \({\tilde{u}} = {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}}\), and \({\tilde{v}}_j = {\tilde{u}}\) if \({\bar{z}}_j = 1\). We show that \({\tilde{X}}\) is a feasible solution to (6) with strictly lower cost than \({\bar{X}}\).
- 1.
\({\tilde{X}}\) has lower cost than \({\bar{X}}\) since \({\tilde{u}} < {\bar{u}}\) by definition.
- 2.
\({\tilde{X}}\) satisfies (6b) because \(\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)= \sum _{i \in I} ( {\bar{t}}_i^+ + {\bar{t}}_i^-) - 2 \delta = (n-1) {\bar{u}} - \sum _{j \in J} \bar{v_j} - 2 \delta = (n-1 - {\bar{p}}){\bar{u}} - 2 \delta = (n-1 - {\bar{p}}) ({\bar{u}} - \frac{2 \delta }{n-1 - {\bar{p}}}) = (n-1 - {\bar{p}}) {\tilde{u}} = (n-1){\tilde{u}} - \sum _{j \in J} {\tilde{v}}_j \), in which the second equality holds because \({\bar{X}}\) satisfies (6b).
- 3.
Observe that (6c), (6d), and (6e) are automatically satisfied. Further, since we set \({\tilde{v}}_j = {\tilde{u}}\) for j such that \({\tilde{z}}_j = 1\), (6f) and (6g) are satisfied.
- 4.
Finally, (6h) is automatically satisfied except for \({\tilde{t}}_k^+\),\({\tilde{t}}_k^-\), and \({\tilde{u}}\). Note that \({\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta = {\bar{t}}_k^+ - {\bar{t}}_k^- \ge 0\) and \({\tilde{t}}_k^- = 0\). Also, we have
$$\begin{aligned} \begin{array}{llll} {\tilde{u}} &{} = &{} {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}} = \frac{\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}} - \frac{2\delta }{n-1-{\bar{p}}}\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + ({\tilde{t}}_k^+ + {\tilde{t}}_k^-) - 2 \delta }{n-1-{\bar{p}}}\\ &{} \ge &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + 2 {\tilde{t}}_k^- - 2 \delta }{n-1-{\bar{p}}} &{} (\hbox {since }{\tilde{t}}_k^+ \ge {\tilde{t}}_k^-)\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}}&{} (\hbox {by the definition of }\delta )\\ &{} \ge &{} 0.\\ \end{array} \end{aligned}$$Hence, \({\tilde{X}}\) satisfies (6h).
Hence, \({\bar{X}}\) is not an optimal solution to (6), which is a contradiction. \(\square \)
Proof of Proposition 3
Let \({\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})\) be an optimal solution to (8) with \({\bar{p}} = \sum _{j \in J} {\bar{z}}_j\). For a contradiction, let us assume that \({\bar{X}}\) does not satisfy (7) at equality. Let \(\delta = (n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j - \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2 > 0\). Let us generate \({\tilde{X}}\) that is equivalent to \({\bar{X}}\) except that \({\tilde{u}} = {\bar{u}} - \frac{2 \delta }{n-1-{\bar{p}}}\) and \({\tilde{v}}_j = {\tilde{u}}\) if \({\bar{z}}_j = 1\). We first show that \({\tilde{u}} \ge 0\) since
in which the second equality is obtained by the definition of \(\delta \). For the remaining part, using a similar technique as in the proof of Proposition 2, it can be seen that \({\tilde{X}}\) is a feasible solution to (8) with strictly lower objective function value than \({\bar{X}}\). This is a contradiction. \(\square \)
Lemma 6
Let c be a vector that has 1 for \(t_i^+\)’s and \(t_i^-\)’s and 0 for all other variables of (10). Then, for every extreme ray r in the recession cone of (10), we must have \(c^{\top } r > 0\).
Proof
Suppose that there exists extreme ray r in the recession cone of (10) with \(c^{\top } r \le 0\). Let us consider linear program min \(\{ c^{\top } Y |(10\hbox {a}) - (10\hbox {e})\}\). We have two cases.
- 1.
Suppose that \(c^{\top } r < 0\). Note that \({\bar{Y}} + \delta r\) is feasible for any \(\delta \ge 0\) and a feasible solution \({\bar{Y}}\), since r is extreme ray. Then, \(c^{\top } ( {\bar{Y}} + \delta r) = c^{\top } {\bar{Y}} + \delta c^{\top } r\) goes to negative infinity and thus the LP is unbounded from below. However, from the definition of the LP, the objective value is always non-negative. This is a contradiction.
- 2.
Suppose that \(c^{\top } r = 0\). This implies that the LP has the optimal objective value of 0. This contradicts Assumption 1 since \(c^{\top } Y=0\) implies \(\sum _{i=1}^n (t_i^+ + t_i^-) = 0\).
By the above two cases, we must have \(c^{\top } r > 0\). \(\square \)
Proof of Proposition 4
From Lemma 6, we know that there is no extreme rays with non-positive \(\sum _{i=1}^n (t_i^+ + t_i^-)\). For the proof of the proposition, let us assume that (11) is unbounded and thus there is an extreme ray r such that \({\bar{c}}^{\top } r < 0\), where \({\bar{c}}\) is the objective vector of objective function of (11). Given such extreme ray r, we must have \(c^{\top } r >0\) by Lemma 6, where c is a vector that has 1 for \(t_i^*\)’s and \(t_i^-\)’s and 0 for all other variables of (10). For a feasible solution \({\bar{Y}}\) to (11) and any \(\delta \ge 0\), \({\bar{Y}} = Y + \delta r\) is also feasible. Note that \(\delta \) must go to infinity for (11) to be an unbounded LP. However, \(\delta c^{\top } r >0\) implies \(\sum _{i \in I} (t_i^+ + t_i^-)\) increases as \(\delta \) increases. Hence, \(\delta \) must be bounded by (10a). This implies that \({\bar{Y}}\) cannot be bounded for any \(\delta \). \(\square \)
Proof of Lemma 1
With fixed \({\bar{z}}_j\), we have fixed \({\bar{v}}_j\) and \({\bar{u}}\) from (6f). Note that, since \({\bar{Y}}\) has \(\textit{SSE}\) less than or equal to \(T_{max}\), we have \((n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j = \sum _{i \in I} (t_i^+ + t_i^-) \le T_{max}\), which satisfies (10a). Observe that \(v_j\)’s and u can be ignored in (10). Observe also that (10c) and (10d) cover (6d) and (6e) regardless of \({\bar{z}}_j\). Finally, (6c) and (10b) are the same. Therefore, \({\tilde{Y}} = ({\bar{x}}^+, {\bar{x}}^-, {\bar{y}}^+,{\bar{y}}^-, {\bar{t}}^+, {\bar{t}}^-, {\hat{M}})\) is feasible for (10). \(\square \)
B Alternative approach for big M
In this section, we derive an approximated value for Big M for \(x_j\)’s in (21) and (36).
Instead of trying to get a valid value of M, we use a statistical approach to get an approximated value of M for \(x_j\). In Algorithm 4, we estimate a valid value of M for each k. In Steps 2-5, we obtain 30 i.i.d. sample values of M when explanatory variable k is included in the regression model. Then, in Step 6, we obtain the upper tail of the confidence interval. With 95% confidence, the true valid value of M is less than \({\hat{M}}\) in Step 6. Hence, we set \(M_k := {\hat{M}}_k\) for \(x_k\) in (21) and (36) for the fat case (\(m > n\)).
C New objective function and modified formulations for fat case \((m \ge n)\)
Before we derive the objective function, let us temporarily assume \(|J| = n-2\) so that any subset S of J automatically satisfies \(|S| = p \le n-2 = |J|\). We will relax this assumption later to consider \(|J| > n-2\). Suppose that we want to penalize large p in a way that the best model with \(n-2\) explanatory variables is as bad as a regression model with no explanatory variables. Hence, we want the objective function to give the same value for models with \(p=0\) and \(p=n-2\). With this in mind, we propose (20), which we call the adjusted \(\textit{MAE}\) .
Let us now assume that \(\textit{SAE}\) is near zero when \(p = n-2\), which happens often. Then we have \(MAE_a = \frac{SAE + \frac{n-2}{n-2}mae_0}{n-1-(n-2)} = SAE + mae_0 \approx mae_0\). Hence, instead of near-zero \(\textit{MAE}\), the new objective has almost the same value as \(mae_0\) when \(p=n-2\). Recall that \(u = MAE\) and u is the objective function in the previous thin case model. Hence, we need to modify the definitions and constraints. First we rewrite constraint (6b) as \(\sum _{i \in I} (t_i^+ + t_i^-) = (n-1)u - \sum _{j \in J} z_j \Big ( u + \frac{mae_0}{n-2}\Big ) \). Let \(v_j = (u + \frac{mae_0}{n-2})z_j\). Then, (6f) and (6g) are modified to
Finally, we remove the assumption we made (\(|J| = n-2\)) at the beginning of this section by adding cardinality constraint
and obtain the following final formulations,
which is presented in (21). In fact, without (30), \(MAE_a\) cannot be well-defined since it becomes negative for \(p > n-1\) and the denominator becomes 0 for \(p=n-1\). Observe that (21) is an MIP with \(2n+4m+3\) variables (including m binary variables) and \(n+5m+2\) constraints. Observe also that (6) with the additional constraint (30) can be used for the fat case. However, using \(n-2\) explanatory variables out of m candidate explanatory variables can lead to an extremely small \(\textit{SAE}\) as we explained at the beginning of this section.
To obtain a valid value of M for \(v_j\)’s in (21), we can use a similar concept used in Sect. 2. In detail, we set
for \(v_j\)’s to consider regression models that are better than having no regression variables. Given a heuristic solution with objective function value \(mae_a^{heur}\), we can strengthen M by making solutions worse than the heuristic solution infeasible. Hence, we set \(M := mae^{heur}_a + \frac{mae_0}{n-2}\) for \(v_j\)’s in (29).
However, obtaining a valid value of M for \(x_j\)’s in (21) is not trivial. Note that (12), which we used for the thin case, is not applicable for the fat case because LP (10) can easily be unbounded for the fat case. One valid procedure is to (i) generate all possible combinations of \(n-2\) explanatory variables and all n observations, (ii) compute M for each combination using the procedure in Sect. 2.1.3, and (iii) pick the maximum value out of all possible combinations. However, this is a combinatorial problem. Actually, the computational complexity of this procedure is as much as that of solving (1) for all possible subsets. Hence, enumerating all possible subsets just to get a valid big M is not tractable.
Instead, we can use a heuristic approach to obtain a good estimation of the valid value of M. In “Appendix B”, we propose a statistic-based procedure that ensures a valid value of M with a certain confidence level. This procedure can give an M value that is valid with \(95\%\) confidence. However, for the instances considered in this paper, this procedure gives values of M that are too large because many columns can be strongly correlated to each other. Note that a large value of M can cause numerical errors when solving the MIP’s.
Hence, for computational experiment, we use a simple heuristic approach instead. Let us assume that we are given a feasible solution to (21) from a heuristic and \(x^{heur}_j\)’s are the coefficient of the regression model. Then, we set
Note that we cannot say that (32) is valid or valid with \(95\%\) confidence. If we use (21) with this M, we get a heuristic (even if (21) is solved optimally).
Similar to \(MAE_a\) in (20), \(MSE_a\) can be defined as
where \(mse_0 = \frac{\sum _{i \in I}(b_i - {\bar{b}})^2}{n-1}\) is the mean squared error of an optimal regression model when \(p=0\). Next, similar to (28) and (29), we define
while (7) remains the same. Finally, we obtain
for the \(MSE_a\) objective. Note that (36) is mixed integer quadratically constrained program that has \(2n+4m+3\) variables and \(n+5m+2\) constraints.
For the core set algorithm, similar to (23), we have
Rights and permissions
About this article
Cite this article
Park, Y.W., Klabjan, D. Subset selection for multiple linear regression via optimization . J Glob Optim 77, 543–574 (2020). https://doi.org/10.1007/s10898-020-00876-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-020-00876-1