Skip to main content
Log in

Subset selection for multiple linear regression via optimization

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming models for regression subset selection based on mean square and absolute errors, and minimal-redundancy–maximal-relevance criteria. The proposed models are tested using a linear-program-based branch-and-bound algorithm with tailored valid inequalities and big M values and are compared against the algorithms in the literature. For high dimensional cases, an iterative heuristic algorithm is proposed based on the mathematical programming models and a core set concept, and a randomized version of the algorithm is derived to guarantee convergence to the global optimum. From the computational experiments, we find that our models quickly find a quality solution while the rest of the time is spent to prove optimality; the iterative algorithms find solutions in a relatively short time and are competitive compared to state-of-the-art algorithms; using ad-hoc big M values is not recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bertsimas, D., Weismantel, R.: Optimization over integers. Dyn. Ideas 13, 471 (2005)

    Google Scholar 

  2. Bertsimas, D., Shioda, R.: Algorithm for cardinality-constrained quadratic optimization. Comput. Optim. Appl. 43, 1–22 (2009)

    Article  MathSciNet  Google Scholar 

  3. Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016)

    Article  MathSciNet  Google Scholar 

  4. Bertsimas, D., King, A.: OR forum an algorithmic approach to linear regression. Oper. Res. 64(1), 2–16 (2016)

    Article  MathSciNet  Google Scholar 

  5. Bienstock, D.: Computational study of a family of mixed-integer quadratic programming problems. Math. Program. 74, 121–140 (1996)

    MathSciNet  MATH  Google Scholar 

  6. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Feature selection via mathematical programming. INFORMS J. Comput. 10:2, 209–217 (1998)

    Article  MathSciNet  Google Scholar 

  7. Candes, E., Tao, T.: The Danzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)

    Article  Google Scholar 

  8. Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7:3, 1247–1250 (2004)

    Google Scholar 

  9. Charnes, A., Cooper, W.W., Ferguson, R.O.: Optimal estimation of executive compensation by linear programming. Manag. Sci. 1, 138–151 (1955)

    Article  MathSciNet  Google Scholar 

  10. de Farias Jr., I.R., Nemhauser, G.L.: A polyhedral study of the cardinality constrained knapsack problem. Math. Program. 96:3, 439–467 (2003)

    Article  MathSciNet  Google Scholar 

  11. Dielman, Terry E.: Least absolute value regression: recent contributions. J. Stat. Comput. Simul. 75:4, 263–286 (2005)

    Article  MathSciNet  Google Scholar 

  12. Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3:2, 185–205 (2005)

    Article  Google Scholar 

  13. Fung, G.N., Mangasarian, O.L.: A feature selection Newton method for support vector machine classification. Comput. Optim. Appl. 28:2, 185–202 (2004)

    Article  MathSciNet  Google Scholar 

  14. Furnival, G.M., Wilson, R.W.: Regressions by leaps and bounds. Technometrics 16, 499–511 (1974)

    Article  Google Scholar 

  15. Glover, F.: Improved linear integer programming formulations of nonlinear integer problems. Manag. Sci. 22(4), 455–460 (1975)

    Article  MathSciNet  Google Scholar 

  16. Har-Peled, S.: Geometric Approximation Algorithms. American Mathematical Society, Providence (2011)

    Book  Google Scholar 

  17. Harrell, F.E.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Berlin (2001)

    Book  Google Scholar 

  18. Hastie, T., Tibshirani, R., Tibshirani, R.: Bestsubset: Tools for best subset selection in regression. R Package version 1.0.4 (2017). https://github.com/ryantibs/best-subset/. Accessed 22 Aug 2018

  19. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)

    Article  Google Scholar 

  20. Hwang, K., Kim, D., Lee, K., Lee, C., Park, S.: Embedded variable selection method using signomial classification. Ann. Oper. Res. 254, 89–109 (2017)

    Article  MathSciNet  Google Scholar 

  21. Johnson, R.W.: Fitting percentage of body fat to simple body measurements. J. Stat. Educ. 4, 1 (1996)

    Article  Google Scholar 

  22. Konno, H., Yamamoto, R.: Choosing the best set of variables in regression analysis using integer programming. J. Global Optim. 44, 273–282 (2009)

    Article  MathSciNet  Google Scholar 

  23. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2018

  24. Lumley, T.: Leaps: regression subset selection. R package version 2.9 (2009) http://CRAN.R-project.org/package=leaps. Accessed 18 Oct 2016

  25. Miller, A.J.: Selection of subsets of regression variables. J. R. Stat. Soc. Ser. A 147, 389–425 (1984)

    Article  MathSciNet  Google Scholar 

  26. Miller, A.J.: Subset Selection in Regression. Chapman and Hall, London (2002)

    Book  Google Scholar 

  27. Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247, 721–731 (2015)

    Article  MathSciNet  Google Scholar 

  28. Narula, S.C., Wellington, J.F.: The minimum sum of absolute error regression: a state of the art survey. Int. Stat. Rev. 50, 317–326 (1982)

    Article  MathSciNet  Google Scholar 

  29. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  30. Rafiei, M.H., Adeli, H.: A novel machine learning model for estimation of sale prices of real estate units. J. Constr. Eng. Manag. 142(2), 04015066 (2015)

    Article  Google Scholar 

  31. Rinaldi, F., Sciandrone, M.: Feature selection combining linear support vector machines and concave optimization. Optim. Methods Softw. 25(1), 117–128 (2010)

    Article  MathSciNet  Google Scholar 

  32. Schaible, S., Shi, J.: Recent developments in fractional programming: single-ratio and max-min case. Nonlinear Anal. Convex Anal. 5, 493–506 (2004)

    MathSciNet  MATH  Google Scholar 

  33. Stancu-Minasian, I.M.: Fractional Programming: Theory, Methods and Applications. Springer, Berlin (2012)

    MATH  Google Scholar 

  34. Schlossmacher, E.J.: An iterative technique for absolute deviations curve fitting. J. Am. Stat. Assoc. 68, 857–859 (1973)

    Article  Google Scholar 

  35. Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Hoboken (1998)

    MATH  Google Scholar 

  36. Stodden, V.: Model selection when the number of variables exceeds the number of observations. PhD dissertation. Stanford University (2006)

  37. Tamhane, A.C., Dunlop, D.D.: Statistics and Data Analysis: From Elementary to Intermediate. Pearson, London (1999)

    Google Scholar 

  38. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  39. Wagner, H.M.: Linear programming techniques for regression analysis. J. Am. Stat. Assoc. 54, 206–212 (1959)

    Article  MathSciNet  Google Scholar 

  40. Western, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)

    MathSciNet  MATH  Google Scholar 

  41. Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

The authors appreciate the editors and reviewers for their constructive comments and suggestions that strengthened the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young Woong Park.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 970 KB)

Appendices

Appendix

A Proof of Lemmas and Propositions

Proof of Proposition 1

The proof is based on the fact that feasible solutions to (4) and (5) map to each other. Hence, we consider the following two cases.

  1. 1.

    Case: (4) \(\Rightarrow \) (5)

    Let \(S = \{ j | z_j = 1 \}\) be the column index set of a solution to (4). We set \(v_j = u\) for \(j \in S\) and \(v_j = 0\) for \(j \notin S\). Then,

    $$\begin{aligned} \sum _{i \in I} |t_i|&=(n-1)u- \sum _{j \in J} u z_j&(\hbox {from }(4\hbox {b}))\\&= (n-1)u-\sum _{j \in S} u \\&= (n-1)u - \sum _{j \in S} v_j&(\hbox {by definition of }v_j)\\&= (n-1)u-\sum _{j \in J} v_j, \end{aligned}$$

    which satisfies (5). Further, we satisfy the following.

    1. (a)

      Constraint (5e): We have \(v_j = u \le u\) for \(j \in S\) and \(v_j = 0 \le u\) for \(j \notin S\). Hence, \(v_j \le u\) for all \(j \in J\).

    2. (b)

      Constraint (5f): We have \(u - M(1-z_j) = u \le v_j = u \le M z_j = M\) for \(j \in S\) and \(u-M(1-z_j) = u-M \le v_j = 0 \le M z_j = 0\) for \(j \notin S\). Hence, we satisfy (5f).

    3. (c)

      Constraint (5g): We have \(v_j \in \{0,u\} \ge 0\), for all \(j \in J\).

    Note that (5c) is automatically satisfied since it is equal to (4c). Hence, we obtain a feasible solution to (5).

  2. 2.

    Case: (5) \(\Rightarrow \) (4)

    Let \(S = \{ j | z_j = 1 \}\) be the column index set of a solution to (5). Since we are minimizing u, (5e) is equivalent to \(\max _{j} v_j = u\). Note that, in an optimal solution, we must have \(v_j = u\) for all \(j \in S\). Hence, starting from (5b), we derive

    $$\begin{aligned} \sum _{i \in I} |t_i|= & {} (n-1)u-\sum _{j \in J} v_j \qquad \qquad (\hbox {from }(5\hbox {b}))\\&= (n-1)u-\sum _{j \in S} v_j =(n-1)u-\sum _{j \in S} u \qquad \qquad (v_j = u\hbox { for all }j \in S)\\&= (n-1)u-\sum _{j \in S} u z_j = (n-1)u-\sum _{j \in J} u z_j, \end{aligned}$$

    which satisfies (5).

This ends the proof. \(\square \)

Proof of Proposition 2

Let \({\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})\) be an optimal solution to (6) and let \({\bar{p}} = \sum _{j \in J} {\bar{z}}_j\) be the number of optimal regression variables. For a contradiction, let us assume that there exists an index k such that \({\bar{t}}_k^+ >0\) and \({\bar{t}}_k^- >0\). Without loss of generality, let us also assume \({\bar{t}}_k^+ \ge {\bar{t}}_k^-\). For simplicity, let \(\delta = {\bar{t}}_k^-\). Let us generate \({\tilde{X}}\) that is equal to \({\bar{X}}\) except \({\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta \), \({\tilde{t}}_k^- = {\bar{t}}_k^- - \delta = 0\), \({\tilde{u}} = {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}}\), and \({\tilde{v}}_j = {\tilde{u}}\) if \({\bar{z}}_j = 1\). We show that \({\tilde{X}}\) is a feasible solution to (6) with strictly lower cost than \({\bar{X}}\).

  1. 1.

    \({\tilde{X}}\) has lower cost than \({\bar{X}}\) since \({\tilde{u}} < {\bar{u}}\) by definition.

  2. 2.

    \({\tilde{X}}\) satisfies (6b) because \(\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)= \sum _{i \in I} ( {\bar{t}}_i^+ + {\bar{t}}_i^-) - 2 \delta = (n-1) {\bar{u}} - \sum _{j \in J} \bar{v_j} - 2 \delta = (n-1 - {\bar{p}}){\bar{u}} - 2 \delta = (n-1 - {\bar{p}}) ({\bar{u}} - \frac{2 \delta }{n-1 - {\bar{p}}}) = (n-1 - {\bar{p}}) {\tilde{u}} = (n-1){\tilde{u}} - \sum _{j \in J} {\tilde{v}}_j \), in which the second equality holds because \({\bar{X}}\) satisfies (6b).

  3. 3.

    Observe that (6c), (6d), and (6e) are automatically satisfied. Further, since we set \({\tilde{v}}_j = {\tilde{u}}\) for j such that \({\tilde{z}}_j = 1\), (6f) and (6g) are satisfied.

  4. 4.

    Finally, (6h) is automatically satisfied except for \({\tilde{t}}_k^+\),\({\tilde{t}}_k^-\), and \({\tilde{u}}\). Note that \({\tilde{t}}_k^+ = {\bar{t}}_k^+ - \delta = {\bar{t}}_k^+ - {\bar{t}}_k^- \ge 0\) and \({\tilde{t}}_k^- = 0\). Also, we have

    $$\begin{aligned} \begin{array}{llll} {\tilde{u}} &{} = &{} {\bar{u}} - \frac{2\delta }{n-1-{\bar{p}}} = \frac{\sum _{i \in I} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}} - \frac{2\delta }{n-1-{\bar{p}}}\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + ({\tilde{t}}_k^+ + {\tilde{t}}_k^-) - 2 \delta }{n-1-{\bar{p}}}\\ &{} \ge &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-) + 2 {\tilde{t}}_k^- - 2 \delta }{n-1-{\bar{p}}} &{} (\hbox {since }{\tilde{t}}_k^+ \ge {\tilde{t}}_k^-)\\ &{} = &{} \frac{\sum _{i \in I \setminus \{k\}} ( {\tilde{t}}_i^+ + {\tilde{t}}_i^-)}{n-1-{\bar{p}}}&{} (\hbox {by the definition of }\delta )\\ &{} \ge &{} 0.\\ \end{array} \end{aligned}$$

    Hence, \({\tilde{X}}\) satisfies (6h).

Hence, \({\bar{X}}\) is not an optimal solution to (6), which is a contradiction. \(\square \)

Proof of Proposition 3

Let \({\bar{X}} = ({\bar{x}}, {\bar{y}}, {\bar{v}}, {\bar{u}}, {\bar{t}}, {\bar{z}})\) be an optimal solution to (8) with \({\bar{p}} = \sum _{j \in J} {\bar{z}}_j\). For a contradiction, let us assume that \({\bar{X}}\) does not satisfy (7) at equality. Let \(\delta = (n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j - \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2 > 0\). Let us generate \({\tilde{X}}\) that is equivalent to \({\bar{X}}\) except that \({\tilde{u}} = {\bar{u}} - \frac{2 \delta }{n-1-{\bar{p}}}\) and \({\tilde{v}}_j = {\tilde{u}}\) if \({\bar{z}}_j = 1\). We first show that \({\tilde{u}} \ge 0\) since

$$\begin{aligned} {\tilde{u}}= & {} \frac{{\bar{u}}(n-1-{\bar{p}}) - 2 \delta }{n-1-{\bar{p}}} \quad \\= & {} \quad \frac{{\bar{u}}(n-1)-{\bar{u}}{\bar{p}} - 2 (n-1) {\bar{u}} + 2 \sum _{j \in J} {\bar{v}}_j +2 \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \\= & {} \frac{\sum _{j \in J} {\bar{v}}_j - {\bar{u}}(n-1) + +2 \sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \quad \\= & {} \frac{\delta }{n-1-{\bar{p}}} + \frac{\sum _{i \in I} ({\bar{t}}_i^+ - {\bar{t}}_i^-)^2}{n-1-{\bar{p}}} \quad \ge \quad \frac{\delta }{n-1-{\bar{p}}} \quad \ge \quad 0,\\ \end{aligned}$$

in which the second equality is obtained by the definition of \(\delta \). For the remaining part, using a similar technique as in the proof of Proposition 2, it can be seen that \({\tilde{X}}\) is a feasible solution to (8) with strictly lower objective function value than \({\bar{X}}\). This is a contradiction. \(\square \)

Lemma 6

Let c be a vector that has 1 for \(t_i^+\)’s and \(t_i^-\)’s and 0 for all other variables of (10). Then, for every extreme ray r in the recession cone of (10), we must have \(c^{\top } r > 0\).

Proof

Suppose that there exists extreme ray r in the recession cone of (10) with \(c^{\top } r \le 0\). Let us consider linear program min \(\{ c^{\top } Y |(10\hbox {a}) - (10\hbox {e})\}\). We have two cases.

  1. 1.

    Suppose that \(c^{\top } r < 0\). Note that \({\bar{Y}} + \delta r\) is feasible for any \(\delta \ge 0\) and a feasible solution \({\bar{Y}}\), since r is extreme ray. Then, \(c^{\top } ( {\bar{Y}} + \delta r) = c^{\top } {\bar{Y}} + \delta c^{\top } r\) goes to negative infinity and thus the LP is unbounded from below. However, from the definition of the LP, the objective value is always non-negative. This is a contradiction.

  2. 2.

    Suppose that \(c^{\top } r = 0\). This implies that the LP has the optimal objective value of 0. This contradicts Assumption 1 since \(c^{\top } Y=0\) implies \(\sum _{i=1}^n (t_i^+ + t_i^-) = 0\).

By the above two cases, we must have \(c^{\top } r > 0\). \(\square \)

Proof of Proposition 4

From Lemma 6, we know that there is no extreme rays with non-positive \(\sum _{i=1}^n (t_i^+ + t_i^-)\). For the proof of the proposition, let us assume that (11) is unbounded and thus there is an extreme ray r such that \({\bar{c}}^{\top } r < 0\), where \({\bar{c}}\) is the objective vector of objective function of (11). Given such extreme ray r, we must have \(c^{\top } r >0\) by Lemma 6, where c is a vector that has 1 for \(t_i^*\)’s and \(t_i^-\)’s and 0 for all other variables of (10). For a feasible solution \({\bar{Y}}\) to (11) and any \(\delta \ge 0\), \({\bar{Y}} = Y + \delta r\) is also feasible. Note that \(\delta \) must go to infinity for (11) to be an unbounded LP. However, \(\delta c^{\top } r >0\) implies \(\sum _{i \in I} (t_i^+ + t_i^-)\) increases as \(\delta \) increases. Hence, \(\delta \) must be bounded by (10a). This implies that \({\bar{Y}}\) cannot be bounded for any \(\delta \). \(\square \)

Proof of Lemma 1

With fixed \({\bar{z}}_j\), we have fixed \({\bar{v}}_j\) and \({\bar{u}}\) from (6f). Note that, since \({\bar{Y}}\) has \(\textit{SSE}\) less than or equal to \(T_{max}\), we have \((n-1) {\bar{u}} - \sum _{j \in J} {\bar{v}}_j = \sum _{i \in I} (t_i^+ + t_i^-) \le T_{max}\), which satisfies (10a). Observe that \(v_j\)’s and u can be ignored in (10). Observe also that (10c) and (10d) cover (6d) and (6e) regardless of \({\bar{z}}_j\). Finally, (6c) and (10b) are the same. Therefore, \({\tilde{Y}} = ({\bar{x}}^+, {\bar{x}}^-, {\bar{y}}^+,{\bar{y}}^-, {\bar{t}}^+, {\bar{t}}^-, {\hat{M}})\) is feasible for (10). \(\square \)

B Alternative approach for big M

In this section, we derive an approximated value for Big M for \(x_j\)’s in (21) and (36).

figure d

Instead of trying to get a valid value of M, we use a statistical approach to get an approximated value of M for \(x_j\). In Algorithm 4, we estimate a valid value of M for each k. In Steps 2-5, we obtain 30 i.i.d. sample values of M when explanatory variable k is included in the regression model. Then, in Step 6, we obtain the upper tail of the confidence interval. With 95% confidence, the true valid value of M is less than \({\hat{M}}\) in Step 6. Hence, we set \(M_k := {\hat{M}}_k\) for \(x_k\) in (21) and (36) for the fat case (\(m > n\)).

C New objective function and modified formulations for fat case \((m \ge n)\)

Before we derive the objective function, let us temporarily assume \(|J| = n-2\) so that any subset S of J automatically satisfies \(|S| = p \le n-2 = |J|\). We will relax this assumption later to consider \(|J| > n-2\). Suppose that we want to penalize large p in a way that the best model with \(n-2\) explanatory variables is as bad as a regression model with no explanatory variables. Hence, we want the objective function to give the same value for models with \(p=0\) and \(p=n-2\). With this in mind, we propose (20), which we call the adjusted \(\textit{MAE}\) .

Let us now assume that \(\textit{SAE}\) is near zero when \(p = n-2\), which happens often. Then we have \(MAE_a = \frac{SAE + \frac{n-2}{n-2}mae_0}{n-1-(n-2)} = SAE + mae_0 \approx mae_0\). Hence, instead of near-zero \(\textit{MAE}\), the new objective has almost the same value as \(mae_0\) when \(p=n-2\). Recall that \(u = MAE\) and u is the objective function in the previous thin case model. Hence, we need to modify the definitions and constraints. First we rewrite constraint (6b) as \(\sum _{i \in I} (t_i^+ + t_i^-) = (n-1)u - \sum _{j \in J} z_j \Big ( u + \frac{mae_0}{n-2}\Big ) \). Let \(v_j = (u + \frac{mae_0}{n-2})z_j\). Then, (6f) and (6g) are modified to

$$\begin{aligned}&v_j \le u + \frac{mae_0}{n-2} \end{aligned}$$
(28)
$$\begin{aligned}&u + \frac{mae_0}{n-2} - M(1-z_j) \le v_j \le M z_j . \end{aligned}$$
(29)

Finally, we remove the assumption we made (\(|J| = n-2\)) at the beginning of this section by adding cardinality constraint

$$\begin{aligned} \textstyle \sum _{j \in J} z_j \le n-2 \end{aligned}$$
(30)

and obtain the following final formulations,

$$\begin{aligned} \min \{u | (6\hbox {b}) - (6\hbox {e}), (6\hbox {h}) (28),(29), (30) \}, \end{aligned}$$

which is presented in (21). In fact, without (30), \(MAE_a\) cannot be well-defined since it becomes negative for \(p > n-1\) and the denominator becomes 0 for \(p=n-1\). Observe that (21) is an MIP with \(2n+4m+3\) variables (including m binary variables) and \(n+5m+2\) constraints. Observe also that (6) with the additional constraint (30) can be used for the fat case. However, using \(n-2\) explanatory variables out of m candidate explanatory variables can lead to an extremely small \(\textit{SAE}\) as we explained at the beginning of this section.

To obtain a valid value of M for \(v_j\)’s in (21), we can use a similar concept used in Sect. 2. In detail, we set

$$\begin{aligned} M:= mae_0 + \frac{mae_0}{n-2} = \frac{n-1}{n-2} mae_0 \end{aligned}$$
(31)

for \(v_j\)’s to consider regression models that are better than having no regression variables. Given a heuristic solution with objective function value \(mae_a^{heur}\), we can strengthen M by making solutions worse than the heuristic solution infeasible. Hence, we set \(M := mae^{heur}_a + \frac{mae_0}{n-2}\) for \(v_j\)’s in (29).

However, obtaining a valid value of M for \(x_j\)’s in (21) is not trivial. Note that (12), which we used for the thin case, is not applicable for the fat case because LP (10) can easily be unbounded for the fat case. One valid procedure is to (i) generate all possible combinations of \(n-2\) explanatory variables and all n observations, (ii) compute M for each combination using the procedure in Sect. 2.1.3, and (iii) pick the maximum value out of all possible combinations. However, this is a combinatorial problem. Actually, the computational complexity of this procedure is as much as that of solving (1) for all possible subsets. Hence, enumerating all possible subsets just to get a valid big M is not tractable.

Instead, we can use a heuristic approach to obtain a good estimation of the valid value of M. In “Appendix B”, we propose a statistic-based procedure that ensures a valid value of M with a certain confidence level. This procedure can give an M value that is valid with \(95\%\) confidence. However, for the instances considered in this paper, this procedure gives values of M that are too large because many columns can be strongly correlated to each other. Note that a large value of M can cause numerical errors when solving the MIP’s.

Hence, for computational experiment, we use a simple heuristic approach instead. Let us assume that we are given a feasible solution to (21) from a heuristic and \(x^{heur}_j\)’s are the coefficient of the regression model. Then, we set

$$\begin{aligned} M := \max _{j \in J} |x^{heur}_j |. \end{aligned}$$
(32)

Note that we cannot say that (32) is valid or valid with \(95\%\) confidence. If we use (21) with this M, we get a heuristic (even if (21) is solved optimally).

Similar to \(MAE_a\) in (20), \(MSE_a\) can be defined as

$$\begin{aligned} MSE_a = \frac{SSE + \frac{p}{n-2}mse_0}{n-1-p}, \end{aligned}$$
(33)

where \(mse_0 = \frac{\sum _{i \in I}(b_i - {\bar{b}})^2}{n-1}\) is the mean squared error of an optimal regression model when \(p=0\). Next, similar to (28) and (29), we define

$$\begin{aligned}&v_j \le u + \frac{mse_0}{n-2}, \end{aligned}$$
(34)
$$\begin{aligned}&u + \frac{mse_0}{n-2} - M(1-z_j) \le v_j \le M z_j, \end{aligned}$$
(35)

while (7) remains the same. Finally, we obtain

$$\begin{aligned} \min \{u | (7),(6\hbox {c}) - (6\hbox {e}), (6\hbox {h}) (34),(35),(30) \} \end{aligned}$$
(36)

for the \(MSE_a\) objective. Note that (36) is mixed integer quadratically constrained program that has \(2n+4m+3\) variables and \(n+5m+2\) constraints.

For the core set algorithm, similar to (23), we have

$$\begin{aligned} \min \{u | (7),(6\hbox {c}) - (6\hbox {e}), (6\hbox {h}), (34),(35) \}. \end{aligned}$$
(37)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, Y.W., Klabjan, D. Subset selection for multiple linear regression via optimization . J Glob Optim 77, 543–574 (2020). https://doi.org/10.1007/s10898-020-00876-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-020-00876-1

Keywords

Navigation