Subset selection for multiple linear regression via optimization

Park, Young Woong; Klabjan, Diego

doi:10.1007/s10898-020-00876-1

Subset selection for multiple linear regression via optimization

Published: 24 January 2020

Volume 77, pages 543–574, (2020)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

1107 Accesses
13 Citations
4 Altmetric
Explore all metrics

Abstract

Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming models for regression subset selection based on mean square and absolute errors, and minimal-redundancy–maximal-relevance criteria. The proposed models are tested using a linear-program-based branch-and-bound algorithm with tailored valid inequalities and big M values and are compared against the algorithms in the literature. For high dimensional cases, an iterative heuristic algorithm is proposed based on the mathematical programming models and a core set concept, and a randomized version of the algorithm is derived to guarantee convergence to the global optimum. From the computational experiments, we find that our models quickly find a quality solution while the rest of the time is spent to prove optimality; the iterative algorithms find solutions in a relatively short time and are competitive compared to state-of-the-art algorithms; using ad-hoc big M values is not recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Correlation and variable importance in random forests

Article 23 March 2016

References

Bertsimas, D., Weismantel, R.: Optimization over integers. Dyn. Ideas 13, 471 (2005)
Google Scholar
Bertsimas, D., Shioda, R.: Algorithm for cardinality-constrained quadratic optimization. Comput. Optim. Appl. 43, 1–22 (2009)
Article MathSciNet Google Scholar
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016)
Article MathSciNet Google Scholar
Bertsimas, D., King, A.: OR forum an algorithmic approach to linear regression. Oper. Res. 64(1), 2–16 (2016)
Article MathSciNet Google Scholar
Bienstock, D.: Computational study of a family of mixed-integer quadratic programming problems. Math. Program. 74, 121–140 (1996)
MathSciNet MATH Google Scholar
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Feature selection via mathematical programming. INFORMS J. Comput. 10:2, 209–217 (1998)
Article MathSciNet Google Scholar
Candes, E., Tao, T.: The Danzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)
Article Google Scholar
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7:3, 1247–1250 (2004)
Google Scholar
Charnes, A., Cooper, W.W., Ferguson, R.O.: Optimal estimation of executive compensation by linear programming. Manag. Sci. 1, 138–151 (1955)
Article MathSciNet Google Scholar
de Farias Jr., I.R., Nemhauser, G.L.: A polyhedral study of the cardinality constrained knapsack problem. Math. Program. 96:3, 439–467 (2003)
Article MathSciNet Google Scholar
Dielman, Terry E.: Least absolute value regression: recent contributions. J. Stat. Comput. Simul. 75:4, 263–286 (2005)
Article MathSciNet Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3:2, 185–205 (2005)
Article Google Scholar
Fung, G.N., Mangasarian, O.L.: A feature selection Newton method for support vector machine classification. Comput. Optim. Appl. 28:2, 185–202 (2004)
Article MathSciNet Google Scholar
Furnival, G.M., Wilson, R.W.: Regressions by leaps and bounds. Technometrics 16, 499–511 (1974)
Article Google Scholar
Glover, F.: Improved linear integer programming formulations of nonlinear integer problems. Manag. Sci. 22(4), 455–460 (1975)
Article MathSciNet Google Scholar
Har-Peled, S.: Geometric Approximation Algorithms. American Mathematical Society, Providence (2011)
Book Google Scholar
Harrell, F.E.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Berlin (2001)
Book Google Scholar
Hastie, T., Tibshirani, R., Tibshirani, R.: Bestsubset: Tools for best subset selection in regression. R Package version 1.0.4 (2017). https://github.com/ryantibs/best-subset/. Accessed 22 Aug 2018
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Article Google Scholar
Hwang, K., Kim, D., Lee, K., Lee, C., Park, S.: Embedded variable selection method using signomial classification. Ann. Oper. Res. 254, 89–109 (2017)
Article MathSciNet Google Scholar
Johnson, R.W.: Fitting percentage of body fat to simple body measurements. J. Stat. Educ. 4, 1 (1996)
Article Google Scholar
Konno, H., Yamamoto, R.: Choosing the best set of variables in regression analysis using integer programming. J. Global Optim. 44, 273–282 (2009)
Article MathSciNet Google Scholar
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2018
Lumley, T.: Leaps: regression subset selection. R package version 2.9 (2009) http://CRAN.R-project.org/package=leaps. Accessed 18 Oct 2016
Miller, A.J.: Selection of subsets of regression variables. J. R. Stat. Soc. Ser. A 147, 389–425 (1984)
Article MathSciNet Google Scholar
Miller, A.J.: Subset Selection in Regression. Chapman and Hall, London (2002)
Book Google Scholar
Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247, 721–731 (2015)
Article MathSciNet Google Scholar
Narula, S.C., Wellington, J.F.: The minimum sum of absolute error regression: a state of the art survey. Int. Stat. Rev. 50, 317–326 (1982)
Article MathSciNet Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Rafiei, M.H., Adeli, H.: A novel machine learning model for estimation of sale prices of real estate units. J. Constr. Eng. Manag. 142(2), 04015066 (2015)
Article Google Scholar
Rinaldi, F., Sciandrone, M.: Feature selection combining linear support vector machines and concave optimization. Optim. Methods Softw. 25(1), 117–128 (2010)
Article MathSciNet Google Scholar
Schaible, S., Shi, J.: Recent developments in fractional programming: single-ratio and max-min case. Nonlinear Anal. Convex Anal. 5, 493–506 (2004)
MathSciNet MATH Google Scholar
Stancu-Minasian, I.M.: Fractional Programming: Theory, Methods and Applications. Springer, Berlin (2012)
MATH Google Scholar
Schlossmacher, E.J.: An iterative technique for absolute deviations curve fitting. J. Am. Stat. Assoc. 68, 857–859 (1973)
Article Google Scholar
Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Hoboken (1998)
MATH Google Scholar
Stodden, V.: Model selection when the number of variables exceeds the number of observations. PhD dissertation. Stanford University (2006)
Tamhane, A.C., Dunlop, D.D.: Statistics and Data Analysis: From Elementary to Intermediate. Pearson, London (1999)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wagner, H.M.: Linear programming techniques for regression analysis. J. Am. Stat. Assoc. 54, 206–212 (1959)
Article MathSciNet Google Scholar
Western, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
MathSciNet MATH Google Scholar
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)
Article Google Scholar

Download references

Acknowledgements

The authors appreciate the editors and reviewers for their constructive comments and suggestions that strengthened the paper.

Author information

Authors and Affiliations

Ivy College of Business, Iowa State University, Ames, IA, USA
Young Woong Park
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA
Diego Klabjan

Authors

Young Woong Park
View author publications
You can also search for this author in PubMed Google Scholar
Diego Klabjan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Young Woong Park.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 970 KB)

Appendices

Appendix

A Proof of Lemmas and Propositions

Proof of Proposition 1

The proof is based on the fact that feasible solutions to (4) and (5) map to each other. Hence, we consider the following two cases.

1.
Case: (4) $\Rightarrow $ (5)
Let $S = \{ j | z_j = 1 \}$ be the column index set of a solution to (4). We set $v_j = u$ for $j \in S$ and $v_j = 0$ for $j \notin S$. Then,
$$\begin{aligned} \sum _{i \in I} |t_i|&=(n-1)u- \sum _{j \in J} u z_j&(\hbox {from }(4\hbox {b}))\\&= (n-1)u-\sum _{j \in S} u \\&= (n-1)u - \sum _{j \in S} v_j&(\hbox {by definition of }v_j)\\&= (n-1)u-\sum _{j \in J} v_j, \end{aligned}$$
which satisfies (5). Further, we satisfy the following.
1. (a)
  Constraint (5e): We have $v_j = u \le u$ for $j \in S$ and $v_j = 0 \le u$ for $j \notin S$. Hence, $v_j \le u$ for all $j \in J$.
2. (b)
  Constraint (5f): We have $u - M(1-z_j) = u \le v_j = u \le M z_j = M$ for $j \in S$ and $u-M(1-z_j) = u-M \le v_j = 0 \le M z_j = 0$ for $j \notin S$. Hence, we satisfy (5f).
3. (c)
  Constraint (5g): We have $v_j \in \{0,u\} \ge 0$, for all $j \in J$.
Note that (5c) is automatically satisfied since it is equal to (4c). Hence, we obtain a feasible solution to (5).
2.
Case: (5) $\Rightarrow $ (4)
Let $S = \{ j | z_j = 1 \}$ be the column index set of a solution to (5). Since we are minimizing u, (5e) is equivalent to $\max _{j} v_j = u$. Note that, in an optimal solution, we must have $v_j = u$ for all $j \in S$. Hence, starting from (5b), we derive
$$\begin{aligned} \sum _{i \in I} |t_i|= & {} (n-1)u-\sum _{j \in J} v_j \qquad \qquad (\hbox {from }(5\hbox {b}))\\&= (n-1)u-\sum _{j \in S} v_j =(n-1)u-\sum _{j \in S} u \qquad \qquad (v_j = u\hbox { for all }j \in S)\\&= (n-1)u-\sum _{j \in S} u z_j = (n-1)u-\sum _{j \in J} u z_j, \end{aligned}$$
which satisfies (5).

This ends the proof. $\square $

Proof of Proposition 2

Let ${\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})$ be an optimal solution to (6) and let ${\bar{p}} = \sum _{j \in J} {\bar{z}}_j$ be the number of optimal regression variables. For a contradiction, let us assume that there exists an index k such that ${\bar{t}}_k^+ >0$ and ${\bar{t}}_k^- >0$. Without loss of generality, let us also assume ${\bar{t}}_k^+ \ge {\bar{t}}_k^-$. For simplicity, let $\delta = {\bar{t}}_k^-$. Let us generate ${\tilde{X}}$ that is equal to ${\bar{X}}$ except ${\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta $, ${\tilde{t}}_k^- = {\bar{t}}_k^- - \delta = 0$, ${\tilde{u}} = {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}}$, and ${\tilde{v}}_j = {\tilde{u}}$ if ${\bar{z}}_j = 1$. We show that ${\tilde{X}}$ is a feasible solution to (6) with strictly lower cost than ${\bar{X}}$.

1.
${\tilde{X}}$ has lower cost than ${\bar{X}}$ since ${\tilde{u}} < {\bar{u}}$ by definition.
2.
${\tilde{X}}$ satisfies (6b) because $\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)= \sum _{i \in I} ( {\bar{t}}_i^+ + {\bar{t}}_i^-) - 2 \delta = (n-1) {\bar{u}} - \sum _{j \in J} \bar{v_j} - 2 \delta = (n-1 - {\bar{p}}){\bar{u}} - 2 \delta = (n-1 - {\bar{p}}) ({\bar{u}} - \frac{2 \delta }{n-1 - {\bar{p}}}) = (n-1 - {\bar{p}}) {\tilde{u}} = (n-1){\tilde{u}} - \sum _{j \in J} {\tilde{v}}_j $, in which the second equality holds because ${\bar{X}}$ satisfies (6b).
3.
Observe that (6c), (6d), and (6e) are automatically satisfied. Further, since we set ${\tilde{v}}_j = {\tilde{u}}$ for j such that ${\tilde{z}}_j = 1$, (6f) and (6g) are satisfied.
4.
Finally, (6h) is automatically satisfied except for ${\tilde{t}}_k^+$,${\tilde{t}}_k^-$, and ${\tilde{u}}$. Note that ${\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta = {\bar{t}}_k^+ - {\bar{t}}_k^- \ge 0$ and ${\tilde{t}}_k^- = 0$. Also, we have
$$\begin{aligned} \begin{array}{llll} {\tilde{u}} &{} = &{} {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}} = \frac{\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}} - \frac{2\delta }{n-1-{\bar{p}}}\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + ({\tilde{t}}_k^+ + {\tilde{t}}_k^-) - 2 \delta }{n-1-{\bar{p}}}\\ &{} \ge &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + 2 {\tilde{t}}_k^- - 2 \delta }{n-1-{\bar{p}}} &{} (\hbox {since }{\tilde{t}}_k^+ \ge {\tilde{t}}_k^-)\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}}&{} (\hbox {by the definition of }\delta )\\ &{} \ge &{} 0.\\ \end{array} \end{aligned}$$
Hence, ${\tilde{X}}$ satisfies (6h).

Hence, ${\bar{X}}$ is not an optimal solution to (6), which is a contradiction. $\square $

Proof of Proposition 3

Let ${\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})$ be an optimal solution to (8) with ${\bar{p}} = \sum _{j \in J} {\bar{z}}_j$. For a contradiction, let us assume that ${\bar{X}}$ does not satisfy (7) at equality. Let $\delta = (n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j - \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2 > 0$. Let us generate ${\tilde{X}}$ that is equivalent to ${\bar{X}}$ except that ${\tilde{u}} = {\bar{u}} - \frac{2 \delta }{n-1-{\bar{p}}}$ and ${\tilde{v}}_j = {\tilde{u}}$ if ${\bar{z}}_j = 1$. We first show that ${\tilde{u}} \ge 0$ since

$$\begin{aligned} {\tilde{u}}= & {} \frac{{\bar{u}}(n-1-{\bar{p}}) - 2 \delta }{n-1-{\bar{p}}} \quad \\= & {} \quad \frac{{\bar{u}}(n-1)-{\bar{u}}{\bar{p}} - 2 (n-1) {\bar{u}} + 2 \sum _{j \in J} {\bar{v}}_j +2 \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \\= & {} \frac{\sum _{j \in J} {\bar{v}}_j - {\bar{u}}(n-1) + +2 \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \quad \\= & {} \frac{\delta }{n-1-{\bar{p}}} + \frac{\sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \quad \ge \quad \frac{\delta }{n-1-{\bar{p}}} \quad \ge \quad 0,\\ \end{aligned}$$

in which the second equality is obtained by the definition of $\delta $. For the remaining part, using a similar technique as in the proof of Proposition 2, it can be seen that ${\tilde{X}}$ is a feasible solution to (8) with strictly lower objective function value than ${\bar{X}}$. This is a contradiction. $\square $

Lemma 6

Let c be a vector that has 1 for $t_i^+$’s and $t_i^-$’s and 0 for all other variables of (10). Then, for every extreme ray r in the recession cone of (10), we must have $c^{\top } r > 0$.

Proof

Suppose that there exists extreme ray r in the recession cone of (10) with $c^{\top } r \le 0$. Let us consider linear program min $\{ c^{\top } Y |(10\hbox {a}) - (10\hbox {e})\}$. We have two cases.

1.
Suppose that $c^{\top } r < 0$. Note that ${\bar{Y}} + \delta r$ is feasible for any $\delta \ge 0$ and a feasible solution ${\bar{Y}}$, since r is extreme ray. Then, $c^{\top } ( {\bar{Y}} + \delta r) = c^{\top } {\bar{Y}} + \delta c^{\top } r$ goes to negative infinity and thus the LP is unbounded from below. However, from the definition of the LP, the objective value is always non-negative. This is a contradiction.
2.
Suppose that $c^{\top } r = 0$. This implies that the LP has the optimal objective value of 0. This contradicts Assumption 1 since $c^{\top } Y=0$ implies $\sum _{i=1}^n (t_i^+ + t_i^-) = 0$.

By the above two cases, we must have $c^{\top } r > 0$. $\square $

Proof of Proposition 4

From Lemma 6, we know that there is no extreme rays with non-positive $\sum _{i=1}^n (t_i^+ + t_i^-)$. For the proof of the proposition, let us assume that (11) is unbounded and thus there is an extreme ray r such that ${\bar{c}}^{\top } r < 0$, where ${\bar{c}}$ is the objective vector of objective function of (11). Given such extreme ray r, we must have $c^{\top } r >0$ by Lemma 6, where c is a vector that has 1 for $t_i^*$’s and $t_i^-$’s and 0 for all other variables of (10). For a feasible solution ${\bar{Y}}$ to (11) and any $\delta \ge 0$, ${\bar{Y}} = Y + \delta r$ is also feasible. Note that $\delta $ must go to infinity for (11) to be an unbounded LP. However, $\delta c^{\top } r >0$ implies $\sum _{i \in I} (t_i^+ + t_i^-)$ increases as $\delta $ increases. Hence, $\delta $ must be bounded by (10a). This implies that ${\bar{Y}}$ cannot be bounded for any $\delta $. $\square $

Proof of Lemma 1

With fixed ${\bar{z}}_j$, we have fixed ${\bar{v}}_j$ and ${\bar{u}}$ from (6f). Note that, since ${\bar{Y}}$ has $\textit{SSE}$ less than or equal to $T_{max}$, we have $(n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j = \sum _{i \in I} (t_i^+ + t_i^-) \le T_{max}$, which satisfies (10a). Observe that $v_j$’s and u can be ignored in (10). Observe also that (10c) and (10d) cover (6d) and (6e) regardless of ${\bar{z}}_j$. Finally, (6c) and (10b) are the same. Therefore, ${\tilde{Y}} = ({\bar{x}}^+, {\bar{x}}^-, {\bar{y}}^+,{\bar{y}}^-, {\bar{t}}^+, {\bar{t}}^-, {\hat{M}})$ is feasible for (10). $\square $

B Alternative approach for big M

In this section, we derive an approximated value for Big M for $x_j$’s in (21) and (36).

Instead of trying to get a valid value of M, we use a statistical approach to get an approximated value of M for $x_j$. In Algorithm 4, we estimate a valid value of M for each k. In Steps 2-5, we obtain 30 i.i.d. sample values of M when explanatory variable k is included in the regression model. Then, in Step 6, we obtain the upper tail of the confidence interval. With 95% confidence, the true valid value of M is less than ${\hat{M}}$ in Step 6. Hence, we set $M_k := {\hat{M}}_k$ for $x_k$ in (21) and (36) for the fat case ($m > n$).

C New objective function and modified formulations for fat case $(m \ge n)$

Before we derive the objective function, let us temporarily assume $|J| = n-2$ so that any subset S of J automatically satisfies $|S| = p \le n-2 = |J|$. We will relax this assumption later to consider $|J| > n-2$. Suppose that we want to penalize large p in a way that the best model with $n-2$ explanatory variables is as bad as a regression model with no explanatory variables. Hence, we want the objective function to give the same value for models with $p=0$ and $p=n-2$. With this in mind, we propose (20), which we call the adjusted $\textit{MAE}$ .

Let us now assume that $\textit{SAE}$ is near zero when $p = n-2$, which happens often. Then we have $MAE_a = \frac{SAE + \frac{n-2}{n-2}mae_0}{n-1-(n-2)} = SAE + mae_0 \approx mae_0$. Hence, instead of near-zero $\textit{MAE}$, the new objective has almost the same value as $mae_0$ when $p=n-2$. Recall that $u = MAE$ and u is the objective function in the previous thin case model. Hence, we need to modify the definitions and constraints. First we rewrite constraint (6b) as $\sum _{i \in I} (t_i^+ + t_i^-) = (n-1)u - \sum _{j \in J} z_j \Big ( u + \frac{mae_0}{n-2}\Big ) $. Let $v_j = (u + \frac{mae_0}{n-2})z_j$. Then, (6f) and (6g) are modified to

$$\begin{aligned}&v_j \le u + \frac{mae_0}{n-2} \end{aligned}$$

(28)

$$\begin{aligned}&u + \frac{mae_0}{n-2} - M(1-z_j) \le v_j \le M z_j . \end{aligned}$$

(29)

Finally, we remove the assumption we made ($|J| = n-2$) at the beginning of this section by adding cardinality constraint

$$\begin{aligned} \textstyle \sum _{j \in J} z_j \le n-2 \end{aligned}$$

(30)

and obtain the following final formulations,

$$\begin{aligned} \min \{u | (6\hbox {b}) - (6\hbox {e}), (6\hbox {h}) (28),(29), (30) \}, \end{aligned}$$

which is presented in (21). In fact, without (30), $MAE_a$ cannot be well-defined since it becomes negative for $p > n-1$ and the denominator becomes 0 for $p=n-1$. Observe that (21) is an MIP with $2n+4m+3$ variables (including m binary variables) and $n+5m+2$ constraints. Observe also that (6) with the additional constraint (30) can be used for the fat case. However, using $n-2$ explanatory variables out of m candidate explanatory variables can lead to an extremely small $\textit{SAE}$ as we explained at the beginning of this section.

To obtain a valid value of M for $v_j$’s in (21), we can use a similar concept used in Sect. 2. In detail, we set

$$\begin{aligned} M:= mae_0 + \frac{mae_0}{n-2} = \frac{n-1}{n-2} mae_0 \end{aligned}$$

(31)

for $v_j$’s to consider regression models that are better than having no regression variables. Given a heuristic solution with objective function value $mae_a^{heur}$, we can strengthen M by making solutions worse than the heuristic solution infeasible. Hence, we set $M := mae^{heur}_a + \frac{mae_0}{n-2}$ for $v_j$’s in (29).

However, obtaining a valid value of M for $x_j$’s in (21) is not trivial. Note that (12), which we used for the thin case, is not applicable for the fat case because LP (10) can easily be unbounded for the fat case. One valid procedure is to (i) generate all possible combinations of $n-2$ explanatory variables and all n observations, (ii) compute M for each combination using the procedure in Sect. 2.1.3, and (iii) pick the maximum value out of all possible combinations. However, this is a combinatorial problem. Actually, the computational complexity of this procedure is as much as that of solving (1) for all possible subsets. Hence, enumerating all possible subsets just to get a valid big M is not tractable.

Instead, we can use a heuristic approach to obtain a good estimation of the valid value of M. In “Appendix B”, we propose a statistic-based procedure that ensures a valid value of M with a certain confidence level. This procedure can give an M value that is valid with $95\%$ confidence. However, for the instances considered in this paper, this procedure gives values of M that are too large because many columns can be strongly correlated to each other. Note that a large value of M can cause numerical errors when solving the MIP’s.

Hence, for computational experiment, we use a simple heuristic approach instead. Let us assume that we are given a feasible solution to (21) from a heuristic and $x^{heur}_j$’s are the coefficient of the regression model. Then, we set

$$\begin{aligned} M := \max _{j \in J} |x^{heur}_j |. \end{aligned}$$

(32)

Note that we cannot say that (32) is valid or valid with $95\%$ confidence. If we use (21) with this M, we get a heuristic (even if (21) is solved optimally).

Similar to $MAE_a$ in (20), $MSE_a$ can be defined as

$$\begin{aligned} MSE_a = \frac{SSE + \frac{p}{n-2}mse_0}{n-1-p}, \end{aligned}$$

(33)

where $mse_0 = \frac{\sum _{i \in I}(b_i - {\bar{b}})^2}{n-1}$ is the mean squared error of an optimal regression model when $p=0$. Next, similar to (28) and (29), we define

$$\begin{aligned}&v_j \le u + \frac{mse_0}{n-2}, \end{aligned}$$

(34)

$$\begin{aligned}&u + \frac{mse_0}{n-2} - M(1-z_j) \le v_j \le M z_j, \end{aligned}$$

(35)

while (7) remains the same. Finally, we obtain

$$\begin{aligned} \min \{u | (7),(6\hbox {c}) - (6\hbox {e}), (6\hbox {h}) (34),(35),(30) \} \end{aligned}$$

(36)

for the $MSE_a$ objective. Note that (36) is mixed integer quadratically constrained program that has $2n+4m+3$ variables and $n+5m+2$ constraints.

For the core set algorithm, similar to (23), we have

$$\begin{aligned} \min \{u | (7),(6\hbox {c}) - (6\hbox {e}), (6\hbox {h}), (34),(35) \}. \end{aligned}$$

(37)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, Y.W., Klabjan, D. Subset selection for multiple linear regression via optimization . J Glob Optim 77, 543–574 (2020). https://doi.org/10.1007/s10898-020-00876-1

Download citation

Received: 21 September 2018
Accepted: 13 January 2020
Published: 24 January 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s10898-020-00876-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subset selection for multiple linear regression via optimization

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Correlation and variable importance in random forests

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 970 KB)

Appendices

Appendix

A Proof of Lemmas and Propositions

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Lemma 6

Proof

Proof of Proposition 4

Proof of Lemma 1

B Alternative approach for big M

C New objective function and modified formulations for fat case \((m \ge n)\)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Subset selection for multiple linear regression via optimization

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Correlation and variable importance in random forests

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 970 KB)

Appendices

Appendix

A Proof of Lemmas and Propositions

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Lemma 6

Proof

Proof of Proposition 4

Proof of Lemma 1

B Alternative approach for big M

C New objective function and modified formulations for fat case \((m \ge n)\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation