Skip to main content
Log in

Robust Grouped Variable Selection Using Distributionally Robust Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We propose a distributionally robust optimization formulation with a Wasserstein-based uncertainty set for selecting grouped variables under perturbations on the data for both linear regression and classification problems. The resulting model offers robustness explanations for grouped least absolute shrinkage and selection operator algorithms and highlights the connection between robustness and regularization. We prove probabilistic bounds on the out-of-sample loss and the estimation bias, and establish the grouping effect of our estimator, showing that coefficients in the same group converge to the same value as the sample correlation between covariates approaches 1. Based on this result, we propose to use the spectral clustering algorithm with the Gaussian similarity function to perform grouping on the predictors, which makes our approach applicable without knowing the grouping structure a priori. We compare our approach to an array of alternatives and provide extensive numerical results on both synthetic data and a real large dataset of surgery-related medical records, showing that our formulation produces an interpretable and parsimonious model that encourages sparsity at a group level and is able to achieve better prediction and estimation performance in the presence of outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Bakin, S.: Adaptive regression and model selection in data mining problems, Doctoral dissertation, Australian National University (1999). https://openresearch-repository.anu.edu.au/handle/1885/9449

  2. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)

    MathSciNet  MATH  Google Scholar 

  3. Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. (2017)

  4. Bertsimas, D., Gupta, V., Paschalidis, I.C.: Data-driven estimation in equilibrium using inverse optimization. Math. Program. 153(2), 595–633 (2015)

    Article  MathSciNet  Google Scholar 

  5. Blanchet, J., Kang, Y.: Distributionally robust groupwise regularization estimator. arXiv preprint arXiv:1705.04241 (2017)

  6. Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143(11), 1835–1858 (2013)

    Article  MathSciNet  Google Scholar 

  7. Bunea, F., Lederer, J., She, Y.: The group square-root LASSO: theoretical properties and fast algorithms. IEEE Trans. Inf. Theory 60(2), 1313–1325 (2014)

    Article  MathSciNet  Google Scholar 

  8. Chen, R., Paschalidis, I.C.: A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 19(1), 517–564 (2018)

    MathSciNet  MATH  Google Scholar 

  9. Chen, R., Paschalidis, I.Ch: Distributionally robust learning. Found. Trends ® Optim. 4(1–2), 1–243 (2020)

  10. Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3), 595–612 (2010)

    Article  MathSciNet  Google Scholar 

  11. Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. Ann. Stat. 49(3), 1378–1406 (2021)

    Article  MathSciNet  Google Scholar 

  12. Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Available at Optimization Online (2015)

  13. Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, (2017)

  14. Gao, R., Kleywegt, A.J.: Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199, (2016)

  15. Goh, J., Sim, M.: Distributionally robust optimization and its tractable approximations. Oper. Res. 58(4–part–1), 902–917 (2010)

    Article  MathSciNet  Google Scholar 

  16. Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692 (2017)

  17. Huang, J., Zhang, T., et al.: The benefit of group sparsity. Ann. Stat. 38(4), 1978–2004 (2010)

    MathSciNet  MATH  Google Scholar 

  18. Jacob, L., Obozinski, G., Vert, J.-P.: Group LASSO with overlap and graph LASSO. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433–440. ACM (2009)

  19. Jenatton, R., Audibert, J.-Y., Bach, F.: Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12(Oct), 2777–2824 (2011)

    MathSciNet  MATH  Google Scholar 

  20. Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)

    Article  MathSciNet  Google Scholar 

  21. Meier, L., Van De Geer, S., Bühlmann, P.: The group LASSO for logistic regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)

    Article  MathSciNet  Google Scholar 

  22. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)

  23. Obozinski, G., Jacob, L., Vert, J.-P.: Group LASSO with overlaps: the latent group LASSO approach. arXiv preprint arXiv:1110.0413, (2011)

  24. Roth, V., Fischer, B.: The group-LASSO for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th International Conference on Machine Learning, pp. 848–855. ACM (2008)

  25. Shafieezadeh-Abadeh, S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Advances in Neural Information Processing Systems, pp. 1576–1584 (2015)

  26. Shafieezadeh-Abadeh, S., Kuhn, D., Esfahani, P.M.: Regularization via mass transportation. arXiv preprint arXiv:1710.10016 (2017)

  27. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  28. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group LASSO. J. Comput. Graph. Stat. 22(2), 231–245 (2013)

    Article  MathSciNet  Google Scholar 

  29. Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B (Methodol.) 267–288 (1996)

  30. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  31. Xu, H., Caramanis, C., Mannor, S.: Robust regression and LASSO. In: Advances in Neural Information Processing Systems, pp. 1801–1808 (2009)

  32. Yang, W., Xu, H.: A unified robust regression model for LASSO-like algorithms. In: International Conference on Machine Learning, pp. 585–593 (2013)

  33. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)

    Article  MathSciNet  Google Scholar 

  34. Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 3468–3497 (2009)

  35. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  Google Scholar 

  36. Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order moment information. Math. Program. 137(1–2), 167–198 (2013)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank George Kasotakis, MD, MPH, for providing access to the surgery dataset. We also thank Taiyao Wang for help in processing this dataset. The research was partially supported by the NSF under Grants DMS-1664644, CNS-1645681, and IIS-1914792, by the ONR under Grants N00014-19-1-2571 and N00014-21-1-2844, by the NIH under Grants R01 GM135930 and UL54 TR004130, by the DOE under grants DE-AR-0001282 and NETL-EE0009696, and by the Boston University Kilachand Fund for Integrated Life Science and Engineering.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis Ch. Paschalidis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Omitted Theoretical Results and Proofs

This section contains the theoretical statements and proofs that are omitted in Sects. 2 and 3.

Proof of Theorem 2.1

From the definition of the Wasserstein distance, \(W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}})\) is the optimal value of the following optimization problem:

$$\begin{aligned}&\min \limits _{\varPi \in \mathcal {P}(\mathcal {Z}\times \mathcal {Z})} \quad \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ \varPi \bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \nonumber \\&\quad \text {s.t.} \quad \int _{\mathcal {Z}}\varPi \bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_2 = \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z}, \\&\quad \quad \int _{\mathcal {Z}}\varPi \bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_1 = q \mathbb {P}_{\text {out}}({\mathbf {z}}_2) + (1-q)\mathbb {P}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}.\nonumber \end{aligned}$$
(16)

Similarly, \(W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}})\) is the optimal value of the following optimization problem:

$$\begin{aligned} \begin{aligned} \min \limits _{\varPi \in \mathcal {P}(\mathcal {Z}\times \mathcal {Z})}&\quad \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ \varPi \bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \\ \text {s.t.}&\quad \int _{\mathcal {Z}}\varPi \bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_2 = \mathbb {P}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z}, \\&\quad \int _{\mathcal {Z}}\varPi \bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_1 = q \mathbb {P}_{\text {out}}({\mathbf {z}}_2) + (1-q)\mathbb {P}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}. \end{aligned} \end{aligned}$$
(17)

We propose a decomposition strategy. For Problem (16), decompose the joint distribution \(\varPi \) as \(\varPi = (1-q)S + q T\), where S and T are two joint distributions of \({\mathbf {z}}_1\) and \({\mathbf {z}}_2\). The first set of constraints in Problem (16) can be equivalently expressed as:

$$\begin{aligned} (1-q)\int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_2 + q \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_2 = (1-q)\mathbb {P}_{\text {out}}({\mathbf {z}}_1) + q \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z}, \end{aligned}$$

which is satisfied if

$$\begin{aligned} \int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_2 = \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \quad \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_2 = \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z}. \end{aligned}$$

The second set of constraints can be expressed as:

$$\begin{aligned} (1-q)\int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_1 + q \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_1 = q \mathbb {P}_{\text {out}}({\mathbf {z}}_2) + (1-q)\mathbb {P}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}, \end{aligned}$$

which is satisfied if

$$\begin{aligned} \int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_1 = \mathbb {P}({\mathbf {z}}_2), \quad \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr ) d{\mathbf {z}}_1 = \mathbb {P}_{\text {out}}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}. \end{aligned}$$

The objective function can be decomposed as:

$$\begin{aligned} \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ \varPi \bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr )= & {} \ (1-q)\int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ S\bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \nonumber \\&+ q \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2)T \bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ). \end{aligned}$$

Therefore, Problem (16) can be decomposed into the following two subproblems.

$$\begin{aligned}&\text {Subproblem 1:} \qquad \begin{array}{rl} \min \limits _{S \in \mathcal {P}(\mathcal {Z}\times \mathcal {Z})} &{} \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ S\bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \\ \text {s.t.} &{} \int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_2 = \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z},\\ &{} \int _{\mathcal {Z}}S\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_1 = \mathbb {P}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}. \end{array}\\&\text {Subproblem 2:} \qquad \begin{array}{rl} \min \limits _{T \in \mathcal {P}(\mathcal {Z}\times \mathcal {Z})} &{} \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ T\bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \\ \text {s.t.} &{} \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_2 = \mathbb {P}_{\text {out}}({\mathbf {z}}_1), \forall {\mathbf {z}}_1 \in \mathcal {Z},\\ &{} \int _{\mathcal {Z}}T\bigl ({\mathbf {z}}_1, {\mathbf {z}}_2\bigr )d{\mathbf {z}}_1 = \mathbb {P}_{\text {out}}({\mathbf {z}}_2), \forall {\mathbf {z}}_2 \in \mathcal {Z}. \end{array} \end{aligned}$$

Assume that the optimal solutions to the two subproblems are \(S^*\) and \(T^*\), respectively, we know \(\varPi _0 = (1-q)S^* + q T^*\) is a feasible solution to Problem (16). Therefore,

$$\begin{aligned} \begin{aligned} W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}})&\le \int _{\mathcal {Z}\times \mathcal {Z}} s({\mathbf {z}}_1, {\mathbf {z}}_2) \ \varPi _0 \bigl (d{\mathbf {z}}_1, d{\mathbf {z}}_2\bigr ) \\&= (1-q)W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}) + q W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {out}})\\&= (1-q)W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned} \end{aligned}$$
(18)

Similarly,

$$\begin{aligned} W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}}) \le q W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned}$$
(19)

(18) and (19) imply that

$$\begin{aligned} W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}}) + W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}}) \le W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned}$$
(20)

On the other hand, based on the subadditivity of the Wasserstein metric, we have,

$$\begin{aligned} W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}}) + W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}}) \ge W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned}$$

We thus conclude that

$$\begin{aligned} W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}}) + W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}}) = W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned}$$
(21)

To achieve the equality in (21), (18) and (19) must be equalities, i.e.,

$$\begin{aligned} W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}}) = (1-q)W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}), \end{aligned}$$

and,

$$\begin{aligned} W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}}) = q W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}). \end{aligned}$$

To see this, notice that if either (18) or (19) is a strict inequality, then (20) becomes a strict inequality, which contradicts (21). Thus,

$$\begin{aligned} \frac{W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}})}{W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}})} = \frac{(1-q)W_1 (\mathbb {P}_{\text {out}}, \mathbb {P})}{q W_1 (\mathbb {P}_{\text {out}}, \mathbb {P})} = \frac{1-q}{q}. \end{aligned}$$

\(\square \)

Proof of Theorem 2.2

We will use Hölder’s inequality, which we state for convenience.

Hölder’s inequality: Suppose we have two scalars \(p, q \ge 1\) and \(1/p + 1/q =1\). For any two vectors \({\mathbf {a}}= (a_1, \ldots , a_n)\) and \({\mathbf {b}}= (b_1, \ldots , b_n)\),

$$\begin{aligned} \sum _{i=1}^n |a_i b_i| \le \Bigl (\sum _{i=1}^n |a_i|^p\Bigr )^{1/p} \Bigl (\sum _{i=1}^n |b_i|^q\Bigr )^{1/q}. \end{aligned}$$

The dual norm of \(\Vert \cdot \Vert _{r, s}\) evaluated at some vector \(\varvec{\beta }\) is the optimal value of problem (22):

$$\begin{aligned} \begin{aligned} \max \limits _{{\mathbf {x}}}&\quad {\mathbf {x}}' \varvec{\beta }\\ \text {s.t.}&\quad \Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{r, s} \le 1. \end{aligned} \end{aligned}$$
(22)

We assume that \(\varvec{\beta }\) has the same group structure with \({\mathbf {x}}\), i.e., \(\varvec{\beta }= (\varvec{\beta }^1, \ldots , \varvec{\beta }^L)\). Using Hölder’s inequality, we can write

$$\begin{aligned} {\mathbf {x}}'\varvec{\beta }= \sum _{l=1}^L (w_l{\mathbf {x}}^l)'\Bigl (\frac{1}{w_l}\varvec{\beta }^l\Bigr ) \le \sum _{l=1}^L \Vert w_l{\mathbf {x}}^l\Vert _r \left\| \frac{1}{w_l}\varvec{\beta }^l\right\| _q. \end{aligned}$$

Define two new vectors in \(\mathbb {R}^L\)

$$\begin{aligned} {\mathbf {x}}_{new} = (\Vert w_1{\mathbf {x}}^1\Vert _r, \ldots , \Vert w_L{\mathbf {x}}^L\Vert _r), \quad \varvec{\beta }_{new} = \left( \left\| \frac{1}{w_1}\varvec{\beta }^1\right\| _q, \ldots , \left\| \frac{1}{w_L}\varvec{\beta }^L\right\| _q\right) . \end{aligned}$$

Applying Hölder’s inequality again to \({\mathbf {x}}_{new}\) and \(\varvec{\beta }_{new}\), we obtain:

$$\begin{aligned} \begin{aligned} {\mathbf {x}}'\varvec{\beta }&\le {\mathbf {x}}_{new}'\varvec{\beta }_{new} \\&\le \Vert {\mathbf {x}}_{new}\Vert _s \Vert \varvec{\beta }_{new}\Vert _t \\&= \Bigl (\sum _{l=1}^L \bigl (\Vert w_l{\mathbf {x}}^l\Vert _r\bigr )^s\Bigr )^{1/s} \left( \sum _{l=1}^L \left( \left\| \frac{1}{w_l}\varvec{\beta }^l\right\| _q\right) ^t\right) ^{1/t}. \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} {\mathbf {x}}'\varvec{\beta }\le \Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{r, s} \Vert \varvec{\beta }_{{\mathbf {w}}^{-1}}\Vert _{q, t} \le \Vert \varvec{\beta }_{{\mathbf {w}}^{-1}}\Vert _{q, t}, \end{aligned}$$

due to the constraint \(\Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{r, s} \le 1\). The result then follows. \(\square \)

Proof of Theorem 2.3

To derive a tractable reformulation of the DRO-LG problem (9), we borrow the idea from [8] and [14], which states that for any \(\mathbb {Q} \in \varOmega \),

$$\begin{aligned} \begin{aligned}&\Bigl |\mathbb {E}^{\mathbb {Q}}\big [ l_{\varvec{\beta }}({\mathbf {x}}, y)\big ] - \mathbb {E}^{\hat{\mathbb {P}}_N}\big [ l_{\varvec{\beta }}({\mathbf {x}}, y)\big ]\Bigr | \\&\quad = \biggl |\int _{\mathcal {Z}} l_{\varvec{\beta }}({\mathbf {x}}_1, y_1) \mathbb {Q}(d({\mathbf {x}}_1, y_1)) - \int _{\mathcal {Z}} l_{\varvec{\beta }}({\mathbf {x}}_2, y_2) \hat{\mathbb {P}}_N(d({\mathbf {x}}_2, y_2)) \biggr | \\&\quad = \biggl |\int _{\mathcal {Z}} l_{\varvec{\beta }}({\mathbf {x}}_1, y_1) \int _{\mathcal {Z}}\varPi _0(d({\mathbf {x}}_1, y_1), d({\mathbf {x}}_2, y_2)) \\&\quad \quad - \int _{\mathcal {Z}} l_{\varvec{\beta }}({\mathbf {x}}_2, y_2) \int _{\mathcal {Z}}\varPi _0(d({\mathbf {x}}_1, y_1), d({\mathbf {x}}_2, y_2)) \biggr | \\&\quad \le \int _{\mathcal {Z}\times \mathcal {Z}} \bigl |l_{\varvec{\beta }}({\mathbf {x}}_1, y_1)-l_{\varvec{\beta }}({\mathbf {x}}_2, y_2)\bigr | \varPi _0(d({\mathbf {x}}_1, y_1), d({\mathbf {x}}_2, y_2)), \end{aligned} \end{aligned}$$
(23)

where \(\varPi _0\) is the optimal solution in the definition of the Wasserstein metric, i.e., it is the joint distribution of \(({\mathbf {x}}_1, y_1)\) and \(({\mathbf {x}}_2, y_2)\) with marginals \(\mathbb {Q}\) and \(\hat{\mathbb {P}}_N\) that achieves the minimum mass transportation cost. Comparing (23) with the definition of the Wasserstein distance, we wish to bound the following growth rate of \(l_{\varvec{\beta }}({\mathbf {x}}, y)\):

$$\begin{aligned} \frac{\bigl |l_{\varvec{\beta }}({\mathbf {x}}_1, y_1) - l_{\varvec{\beta }}({\mathbf {x}}_2, y_2)\bigr |}{s(({\mathbf {x}}_1, y_1), ({\mathbf {x}}_2, y_2))}, \ \forall ({\mathbf {x}}_1, y_1), ({\mathbf {x}}_2, y_2), \end{aligned}$$

in order to relate \(\big |\mathbb {E}^{\mathbb {Q}}[ l_{\varvec{\beta }}({\mathbf {x}}, y)] - \mathbb {E}^{\hat{\mathbb {P}}_N}[ l_{\varvec{\beta }}({\mathbf {x}}, y)]\big |\) with \(W_1 (\mathbb {Q}, \ \hat{\mathbb {P}}_N)\). To this end, we define a continuous and differentiable univariate function \(h(a) \triangleq \log (1+\exp (-a))\), and apply the mean value theorem to it, which yields that for any \(a, b\in \mathbb {R}\), \(\exists c \in (a,b)\) such that:

$$\begin{aligned} \biggl |\frac{h(b) - h(a)}{b-a}\biggr | = \bigl |\bigtriangledown h(c)\bigr | = \frac{e^{-c}}{1+e^{-c}} \le 1. \end{aligned}$$

By noting that \(l_{\varvec{\beta }}({\mathbf {x}}, y) = h(y\varvec{\beta }'{\mathbf {x}})\), we immediately have:

$$\begin{aligned} \begin{aligned} \bigl |l_{\varvec{\beta }}({\mathbf {x}}_1, y_1) - l_{\varvec{\beta }}({\mathbf {x}}_2, y_2)\bigr |&\le \bigl |y_1\varvec{\beta }'{\mathbf {x}}_1 - y_2\varvec{\beta }'{\mathbf {x}}_2\bigr | \\&\le \Vert y_1{\mathbf {x}}_1 - y_2 {\mathbf {x}}_2\Vert \Vert \varvec{\beta }\Vert _* \\&\le s(({\mathbf {x}}_1, y_1), ({\mathbf {x}}_2, y_2)) \Vert \varvec{\beta }\Vert _*, \ \forall ({\mathbf {x}}_1, y_1), ({\mathbf {x}}_2, y_2), \end{aligned} \end{aligned}$$
(24)

where the second step uses the Cauchy–Schwarz inequality, and the last step is due to the definition of the metric s in (8). Combining (24) with (23), it follows that for any \(\mathbb {Q} \in \varOmega \),

$$\begin{aligned} \begin{aligned} \Bigl |\mathbb {E}^{\mathbb {Q}}\big [ l_{\varvec{\beta }}({\mathbf {x}}, y)\big ] - \mathbb {E}^{\hat{\mathbb {P}}_N}\big [ l_{\varvec{\beta }}({\mathbf {x}}, y)\big ]\Bigr |&\le \Vert \varvec{\beta }\Vert _*\int _{\mathcal {Z}\times \mathcal {Z}} s(({\mathbf {x}}_1, y_1), ({\mathbf {x}}_2, y_2))\\ {}&\quad \times \varPi _0(d({\mathbf {x}}_1, y_1), d({\mathbf {x}}_2, y_2)) \\&= \Vert \varvec{\beta }\Vert _* W_1 (\mathbb {Q}, \ \hat{\mathbb {P}}_N) \le \epsilon \Vert \varvec{\beta }\Vert _*. \end{aligned} \end{aligned}$$

Therefore, the DRO-LG problem can be reformulated as:

$$\begin{aligned} \inf \limits _{\varvec{\beta }} \mathbb {E}^{\hat{\mathbb {P}}_N}\big [ l_{\varvec{\beta }}({\mathbf {x}}, y)\big ] + \epsilon \Vert \varvec{\beta }\Vert _* = \inf \limits _{\varvec{\beta }} \frac{1}{N} \sum _{i=1}^N \log \bigl (1+\exp (-y_i \varvec{\beta }'{\mathbf {x}}_i)\bigr ) + \epsilon \Vert \varvec{\beta }\Vert _*. \end{aligned}$$

\(\square \)

1.1 Prediction and Estimation Performance of the GWGL-LR Estimator

We are interested in two types of performance criteria: (1) prediction quality, or out-of-sample performance, which measures the predictive power of the GWGL solutions on new, unseen samples, and (2) estimation quality, which measures the discrepancy between the GWGL solutions and the underlying unknown true coefficients.

We note that GWGL-LR is a special case of the Wasserstein DRO formulation derived in [8, Eq. 10], and thus, the two types of performance guarantees derived in [8], one for generalization ability (prediction error) and the other for the discrepancy between the estimated and the true regression coefficients (estimation error), still apply to our GWGL-LR formulation.

We first establish a bound for the prediction bias of the solution to the GWGL-LR formulation, where the Wasserstein metric is induced by the weighted \((2, \infty )\)-norm with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}}, M)\). The dual norm in this case is just the weighted (2, 1)-norm with weight \({\mathbf {w}}^{-1}= (\sqrt{p_1}, \ldots , \sqrt{p_L}, 1/M)\). Throughout this section, we use \(\varvec{\beta }^*\) and \(\hat{\varvec{\beta }}\) to denote the true and estimated regression coefficient vectors, respectively. We first state several assumptions that are needed to establish the results.

Assumption A

The weighted \((2, \infty )\)-norm of the uncertainty parameter \(({\mathbf {x}}, y)\) with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}}, M)\) is bounded above by R almost surely.

Assumption B

For every feasible \(\varvec{\beta }\), \(\Vert (-\varvec{\beta }^1, \ldots , -\varvec{\beta }^L, 1)_{{\mathbf {w}}^{-1}}\Vert _{2, 1} \le \bar{B}\), where \({\mathbf {w}}^{-1} = (\sqrt{p_1}, \ldots , \sqrt{p_L}, 1/M)\).

Let \(\hat{\varvec{\beta }}\) be an optimal solution to (7), obtained using the samples \(({\mathbf {x}}_i, y_i)\), \(i=1,\ldots ,N\). Suppose we draw a new i.i.d. sample \(({\mathbf {x}},y)\). Using Theorem 3.3 in [8], Theorem A.1 establishes bounds on the error \(|y - {\mathbf {x}}'\hat{\varvec{\beta }}|\).

Theorem A.1

Under Assumptions A and B, for any \(0<\delta <1\), with probability at least \(1-\delta \) with respect to the sampling,

$$\begin{aligned} \mathbb {E}[|y - {\mathbf {x}}'\hat{\varvec{\beta }}|]\le \frac{1}{N}\sum _{i=1}^N |y_i - {\mathbf {x}}_i'\hat{\varvec{\beta }}|+\frac{2\bar{B}R}{\sqrt{N}}+ \bar{B}R\sqrt{\frac{8\log (2/\delta )}{N}}\ , \end{aligned}$$

and for any \(\zeta >(2\bar{B}R/\sqrt{N})+ \bar{B}R\sqrt{8\log (2/\delta )/N}\),

$$\begin{aligned} \mathbb {P}\biggl (|y - {\mathbf {x}}'\hat{\varvec{\beta }}| \ge \frac{1}{N}\sum _{i=1}^N |y_i - {\mathbf {x}}_i'\hat{\varvec{\beta }}|+\zeta \biggr ) \le \frac{\frac{1}{N}\sum _{i=1}^N |y_i - {\mathbf {x}}_i'\hat{\varvec{\beta }}|+\frac{2\bar{B}R}{\sqrt{N}}+ \bar{B}R\sqrt{\frac{8\log (2/\delta )}{N}}}{\frac{1}{N}\sum _{i=1}^N |y_i - {\mathbf {x}}_i'\hat{\varvec{\beta }}|+\zeta }. \end{aligned}$$

Theorem A.1 essentially says that with a high probability, the expected loss on new test samples using our GWGL-LR estimator can be upper bounded by the average loss in the training samples plus two terms that are related to the magnitude of the regularizer \(\bar{B}\), the uncertainty level R, the confidence level \(\delta \), and converge to zero as \(O(1/\sqrt{N})\). This result justifies the form of the regularizer used in (7) and guarantees a small generalization error of the GWGL-LR solution.

We next discuss the estimation performance of the GWGL-LR solution. Theorem A.1, a specialization of Thm. 3.11 in [8], provides a bound for the estimation bias in GWGL-LR. We first state the assumptions that are needed to establish the result.

Assumption C

The \(\ell _2\) norm of \((-\varvec{\beta }, 1)\) is bounded above by \(\bar{B}_2\).

Assumption D

For some set

$$\begin{aligned} \mathcal {A}(\varvec{\beta }^*) :=\text {cone}\{{\mathbf {v}}|\ \Vert (-\varvec{\beta }^*, 1)_{{\mathbf {w}}^{-1}}+{\mathbf {v}}_{{\mathbf {w}}^{-1}}\Vert _{2, 1}\le \Vert (-\varvec{\beta }^*, 1)_{{\mathbf {w}}^{-1}}\Vert _{2, 1}\} \cap \mathbb {S}^{p+1} \end{aligned}$$

and some positive scalar \(\underline{\alpha }\), the following holds,

$$\begin{aligned} \inf \limits _{{\mathbf {v}}\in \mathcal {A}(\varvec{\beta }^*)}{\mathbf {v}}'{\mathbf {Z}}{\mathbf {Z}}'{\mathbf {v}}\ge \underline{\alpha }, \end{aligned}$$

where \({\mathbf {Z}}=[({\mathbf {x}}_1, y_1), \ldots , ({\mathbf {x}}_N, y_N)]\) is the matrix with columns \(({\mathbf {x}}_i, y_i), i = 1, \ldots , N\), and \(\mathbb {S}^{p+1}\) is the unit sphere in the \((p+1)\)-dimensional Euclidean space.

Assumption E

\(({\mathbf {x}}, y)\) is a centered sub-Gaussian random vector, i.e., it has zero mean and satisfies the following condition:

$$\begin{aligned} {\left| \left| \left| ({\mathbf {x}}, y) \right| \right| \right| }_{\psi _2}=\sup \limits _{{\mathbf {u}}\in \mathbb {S}^{p+1}}{\left| \left| \left| ({\mathbf {x}}, y)'{\mathbf {u}} \right| \right| \right| }_{\psi _2}\le \mu . \end{aligned}$$

Assumption F

The covariance matrix of \(({\mathbf {x}}, y)\) has bounded positive eigenvalues. Set \(\varvec{\varGamma }=\mathbb {E}[({\mathbf {x}}, y)({\mathbf {x}}, y)']\); then,

$$\begin{aligned} 0<\lambda _{\text {min}} \triangleq \lambda _{\text {min}}(\varvec{\varGamma })\le \lambda _{\text {max}}(\varvec{\varGamma })\triangleq \lambda _{\text {max}}<\infty . \end{aligned}$$

Definition A.1

(Sub-Gaussian random variable) A random variable z is sub-Gaussian if it is zero mean, and the \(\psi _2\)-norm defined below is finite, i.e.,

$$\begin{aligned} {\left| \left| \left| z \right| \right| \right| }_{\psi _2}\triangleq \sup _{q \ge 1} \frac{(\mathbb {E}|z|^q)^{1/q}}{\sqrt{q}} < +\infty . \end{aligned}$$

An equivalent property for sub-Gaussian random variables is that their tail distribution decays at least as fast as a Gaussian, i.e.,

$$\begin{aligned} \mathbb {P}(|z|\ge t) \le 2 \exp \{-t^2/C^2\},\quad \forall t \ge 0, \end{aligned}$$

for some constant C. A random vector \({\mathbf {z}}\in \mathbb {R}^{p+1}\) is sub-Gaussian if \({\mathbf {z}}'{\mathbf {u}}\) is sub-Gaussian for any \({\mathbf {u}}\in \mathbb {R}^{p+1}\). The \(\psi _2\)-norm of a vector \({\mathbf {z}}\) is defined as:

$$\begin{aligned} {\left| \left| \left| {\mathbf {z}} \right| \right| \right| }_{\psi _2} \triangleq \sup \limits _{{\mathbf {u}}\in \mathbb {S}^{p+1}}{\left| \left| \left| {\mathbf {z}}'{\mathbf {u}} \right| \right| \right| }_{\psi _2}, \end{aligned}$$

where \(\mathbb {S}^{p+1}\) denotes the unit sphere in the \((p+1)\)-dimensional Euclidean space.

Definition A.2

(Gaussian width) For any set \(\mathcal {A}\subseteq \mathbb {R}^{p+1}\), its Gaussian width is defined as:

$$\begin{aligned} w(\mathcal {A}) \triangleq \mathbb {E}\Bigl [\sup _{{\mathbf {u}}\in \mathcal {A}} {\mathbf {u}}'{\mathbf {g}}\Bigr ], \end{aligned}$$

where \({\mathbf {g}}\sim \mathcal{N}({\mathbf {0}},{\mathbf {I}})\) is a \((p+1)\)-dimensional standard Gaussian random vector.

Theorem A.1

Suppose the true regression coefficient vector is \(\varvec{\beta }^*\) and the solution to GWGL-LR is \(\hat{\varvec{\beta }}\). Under Assumptions A, C, D, E, and F, when the sample size \(N\ge \bar{C_1}\bar{\mu }^4 \mu _0^2 (\lambda _{\text {max}}/\lambda _{\text {min}})\cdot (w(\mathcal {A}(\varvec{\beta }^*))+3)^2\), with probability at least \(1-\exp (-C_2N/\bar{\mu }^4)-C_4\exp (-C_5^2 (w(\mathcal {B}_u))^2/(4\rho ^2))\),

$$\begin{aligned} \Vert \hat{\varvec{\beta }}-\varvec{\beta }^*\Vert _2\le \frac{\bar{C}R\bar{B}_2\mu }{N\lambda _{\text {min}}} w(\mathcal {B}_u)\varPsi (\varvec{\beta }^*), \end{aligned}$$

where \(\bar{\mu }=\mu \sqrt{(1/\lambda _{\text {min}})}\); \(\mu _0\) is the \(\psi _2\)-norm of a standard Gaussian random vector \({\mathbf {g}}\in \mathbb {R}^{p+1}\); \(w(\mathcal {A}(\varvec{\beta }^*))\) is the Gaussian width (defined below) of \(\mathcal {A}(\varvec{\beta }^*)\) (cf. Assumption D); \(w(\mathcal {B}_u)\) is the Gaussian width of \(\mathcal {B}_u\), where \(\mathcal {B}_u\) is the unit ball of the norm \(\Vert \cdot \Vert _\infty \); \(\rho =\sup _{{\mathbf {v}}\in \mathcal {B}_u}\Vert {\mathbf {v}}\Vert _2\); \(\varPsi (\varvec{\beta }^*)=\sup _{{\mathbf {v}}\in \mathcal {A}(\varvec{\beta }^*)}\Vert {\mathbf {v}}_{{\mathbf {w}}^{-1}}\Vert _{2, 1}\); and \(\bar{C_1}, C_2, C_4, C_5, \bar{C}\) are positive constants.

With Theorem A.1, we are able to provide bounds for performance metrics, such as the relative risk (RR), relative test error (RTE), and proportion of variance explained (PVE) [16]. All these metrics evaluate the accuracy of the regression coefficient estimates on a new test sample drawn from the same probability distribution as the training samples. Let \(({\mathbf {x}}_0, y_0)\) be such a test sample satisfying \(y_0 = {\mathbf {x}}_0' \varvec{\beta }^* + \eta _0\), where \(\eta _0\) is random noise with zero mean and variance \(\sigma ^2\), and independent of the zero mean predictor \({\mathbf {x}}_0\). For a fixed set of training samples, let the solution to GWGL-LR be \(\hat{\varvec{\beta }}\). As in [16], define

$$\begin{aligned} \text {RR}(\hat{\varvec{\beta }}) = \frac{\mathbb {E}({\mathbf {x}}_0'\hat{\varvec{\beta }} - {\mathbf {x}}_0' \varvec{\beta }^*)^2}{\mathbb {E}({\mathbf {x}}_0' \varvec{\beta }^*)^2} = \frac{(\hat{\varvec{\beta }} - \varvec{\beta }^*)'\varvec{\varSigma }(\hat{\varvec{\beta }} - \varvec{\beta }^*)}{(\varvec{\beta }^*)' \varvec{\varSigma } \varvec{\beta }^*}, \end{aligned}$$

where \(\varvec{\varSigma }\) is the covariance matrix of \({\mathbf {x}}_0\), which is just the top left block of the matrix \(\varvec{\varGamma }\) in Assumption F. RTE is defined as:

$$\begin{aligned} \text {RTE}(\hat{\varvec{\beta }}) = \frac{\mathbb {E}(y_0 - {\mathbf {x}}_0'\hat{\varvec{\beta }})^2}{\sigma ^2} = \frac{(\hat{\varvec{\beta }} - \varvec{\beta }^*)'{\varvec{\varSigma }}(\hat{\varvec{\beta }} - \varvec{\beta }^*) + \sigma ^2}{\sigma ^2}. \end{aligned}$$

PVE is defined as:

$$\begin{aligned} \text {PVE}(\hat{\varvec{\beta }}) = 1- \frac{\mathbb {E}(y_0 - {\mathbf {x}}_0' \hat{\varvec{\beta }})^2}{Var(y_0)} = 1 - \frac{(\hat{\varvec{\beta }} - \varvec{\beta }^*)'{\varvec{\varSigma }}(\hat{\varvec{\beta }} - \varvec{\beta }^*) + \sigma ^2}{(\varvec{\beta }^*)' \varvec{\varSigma } \varvec{\beta }^* + \sigma ^2}. \end{aligned}$$

Using Theorem A.1, we can bound the term \((\hat{\varvec{\beta }} - \varvec{\beta }^*)'{\varvec{\varSigma }}(\hat{\varvec{\beta }} - \varvec{\beta }^*)\) as follows:

$$\begin{aligned} (\hat{\varvec{\beta }} - \varvec{\beta }^*)'{\varvec{\varSigma }}(\hat{\varvec{\beta }} - \varvec{\beta }^*) \le \lambda _{max}({\varvec{\varSigma }}) \Vert \hat{\varvec{\beta }} - \varvec{\beta }^*\Vert _2^2 \le \lambda _{max}({\varvec{\varSigma }}) \biggl (\frac{\bar{C}R\bar{B}_2\mu }{N\lambda _{\text {min}}} w(\mathcal {B}_u)\varPsi (\varvec{\beta }^*)\biggr )^2, \end{aligned}$$
(25)

where \(\lambda _{max}({\varvec{\varSigma }})\) is the maximum eigenvalue of \({\varvec{\varSigma }}\). Using (25), bounds for RR, RTE, and PVE can be readily obtained and are summarized in the following Corollary.

Corollary A.2

Under the specifications in Theorem A.1, when the sample size

$$\begin{aligned} N\ge \bar{C_1}\bar{\mu }^4 \mu _0^2 (\lambda _{\text {max}}/\lambda _{\text {min}}) (w(\mathcal {A}(\varvec{\beta }^*))+3)^2, \end{aligned}$$

with probability at least \(1-\exp (-C_2N/\bar{\mu }^4)- C_4\exp (-C_5^2 (w(\mathcal {B}_u))^2/(4\rho ^2))\),

$$\begin{aligned}&\text {RR}(\hat{\varvec{\beta }}) \le \frac{\lambda _{max}({\varvec{\varSigma }}) \biggl (\frac{\bar{C}R\bar{B}_2\mu }{N\lambda _{\text {min}}} w(\mathcal {B}_u)\varPsi (\varvec{\beta }^*)\biggr )^2}{(\varvec{\beta }^*)' \varvec{\varSigma } \varvec{\beta }^*}, \\&\text {RTE}(\hat{\varvec{\beta }}) \le \frac{ \lambda _{max}({\varvec{\varSigma }}) \biggl (\frac{\bar{C}R\bar{B}_2\mu }{N\lambda _{\text {min}}} w(\mathcal {B}_u)\varPsi (\varvec{\beta }^*)\biggr )^2 + \sigma ^2}{\sigma ^2},\\&\text {PVE}(\hat{\varvec{\beta }}) \ge 1 - \frac{\lambda _{max}({\varvec{\varSigma }}) \biggl (\frac{\bar{C}R\bar{B}_2\mu }{N\lambda _{\text {min}}} w(\mathcal {B}_u)\varPsi (\varvec{\beta }^*)\biggr )^2 + \sigma ^2}{(\varvec{\beta }^*)' \varvec{\varSigma } \varvec{\beta }^* + \sigma ^2}, \end{aligned}$$

where all parameters are defined in the same way as in Theorem A.1.

1.2 Predictive Performance of the GWGL-LG Estimator

In this subsection, we establish bounds on the prediction error of the GWGL-LG solution. Similar to [8], we will use the Rademacher complexity of the class of logloss (negative log-likelihood) functions to bound the generalization error. Two assumptions that impose conditions on the magnitude of the regularizer and the uncertainty level of the predictor are needed.

Assumption G

The weighted \((2, \infty )\)-norm of \({\mathbf {x}}\) with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}})\) is bounded above almost surely, i.e., \(\Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{2, \infty } \le R_{{\mathbf {x}}}\).

Assumption H

The weighted (2, 1)-norm of \(\varvec{\beta }\) with \({\mathbf {w}}^{-1} = (\sqrt{p_1}, \ldots , \sqrt{p_L})\) is bounded above, namely \(\sup _{\varvec{\beta }}\Vert \varvec{\beta }_{{\mathbf {w}}^{-1}}\Vert _{2, 1}=\bar{B}_1\).

Under these two assumptions, the logloss could be bounded via the Cauchy–Schwarz inequality.

Lemma A.3

Under Assumptions G and H, it follows

$$\begin{aligned} \log \big (1+\exp (-y \varvec{\beta }'{\mathbf {x}})\big ) \le \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big ), \quad \text {almost surely}. \end{aligned}$$

Now consider the following class of loss functions:

$$\begin{aligned} \mathcal {L}=\big \{({\mathbf {x}}, y) \mapsto l_{\varvec{\beta }}({\mathbf {x}}, y): l_{\varvec{\beta }}({\mathbf {x}}, y)= \log \big (1+\exp (-y \varvec{\beta }'{\mathbf {x}})\big ),\ \Vert \varvec{\beta }_{{\mathbf {w}}^{-1}}\Vert _{2, 1}\le \bar{B}_1 \big \}. \end{aligned}$$

It follows from [4, 8] that the empirical Rademacher complexity of \(\mathcal {L}\), denoted by \(\mathcal {R}_N(\mathcal {L})\), can be upper bounded by:

$$\begin{aligned} \mathcal {R}_N(\mathcal {L})\le \frac{2 \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )}{\sqrt{N}}. \end{aligned}$$

Then, applying Theorem 8 in [2], we have the following result on the prediction error of our GWGL-LG estimator.

Theorem A.4

Let \(\hat{\varvec{\beta }}\) be an optimal solution to (11), obtained using N training samples \(({\mathbf {x}}_i, y_i)\), \(i=1,\ldots ,N\). Suppose we draw a new i.i.d. sample \(({\mathbf {x}},y)\). Under Assumptions G and H, for any \(0<\delta <1\), with probability at least \(1-\delta \) with respect to the sampling,

$$\begin{aligned} \mathbb {E}\big [\log \big (1+\exp (-y {\mathbf {x}}'\hat{\varvec{\beta }})\big )\big ]&\le \frac{1}{N}\sum _{i=1}^N \log \big (1+\exp (-y_i {\mathbf {x}}_i'\hat{\varvec{\beta }})\big )+\frac{2 \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )}{\sqrt{N}}\nonumber \\&\quad + \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )\sqrt{\frac{8\log (2/\delta )}{N}}\ , \end{aligned}$$
(26)

and for any \(\zeta >\frac{2 \log (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1))}{\sqrt{N}} + \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )\sqrt{\frac{8\log (2/\delta )}{N}}\),

$$\begin{aligned}&\ \mathbb {P}\Bigl ( \log \big (1+\exp ( -y {\mathbf {x}}'\hat{\varvec{\beta }})\big ) \ge \frac{1}{N}\sum _{i=1}^N \log \big (1+\exp (-y_i {\mathbf {x}}_i'\hat{\varvec{\beta }})\big )+\zeta \Bigr ) \nonumber \\&\quad \le \ \frac{\frac{1}{N}\sum _{i=1}^N \log \big (1+\exp (-y_i {\mathbf {x}}_i'\hat{\varvec{\beta }})\big )+\frac{2 \log (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1))}{\sqrt{N}} + \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )\sqrt{\frac{8\log (2/\delta )}{N}}}{\frac{1}{N}\sum _{i=1}^N \log \big (1+\exp (-y_i {\mathbf {x}}_i'\hat{\varvec{\beta }})\big )+\zeta }\ . \nonumber \\ \end{aligned}$$
(27)

Theorem A.4 implies that the groupwise regularized LG formulation (11) yields a solution with a small generalization error on new i.i.d. samples.

1.3 Proof of Theorem 3.1 for GWGL-LR

Proof

By the optimality condition associated with formulation (7), \(\hat{\varvec{\beta }}\) satisfies:

$$\begin{aligned}&{\mathbf {x}}_{,i}'\text {sgn}({\mathbf {y}}-{\mathbf {X}}\hat{\varvec{\beta }}) = N \epsilon \sqrt{p_{l_1}} \frac{\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} , \end{aligned}$$
(28)
$$\begin{aligned}&{\mathbf {x}}_{,j}'\text {sgn}({\mathbf {y}}-{\mathbf {X}}\hat{\varvec{\beta }}) = N \epsilon \sqrt{p_{l_2}} \frac{\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}, \end{aligned}$$
(29)

where the \(\text {sgn}(\cdot )\) function is applied to a vector elementwise. Subtracting (29) from (28), we obtain:

$$\begin{aligned} ({\mathbf {x}}_{,i}- {\mathbf {x}}_{,j})'\text {sgn}({\mathbf {y}}-{\mathbf {X}}\hat{\varvec{\beta }}) = N \epsilon \Biggl ( \frac{\sqrt{p_{l_1}}\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} - \frac{\sqrt{p_{l_2}}\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}\Biggr ). \end{aligned}$$

Using the Cauchy–Schwarz inequality and \(\Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2^2=2(1-\rho )\), we obtain

$$\begin{aligned} \begin{aligned} D(i, j)&= \Biggl |\frac{\sqrt{p_{l_1}}\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} - \frac{\sqrt{p_{l_2}}\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}\Biggr | \\&\le \frac{1}{N \epsilon } \Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2 \Vert \text {sgn}({\mathbf {y}}-{\mathbf {X}}\hat{\varvec{\beta }})\Vert _2 \\&\le \frac{\sqrt{2(1-\rho )}}{\sqrt{N}\epsilon }. \end{aligned} \end{aligned}$$

\(\square \)

1.4 Proof of Theorem 3.1 for GWGL-LG

Proof

By the optimality condition associated with formulation (11), \(\hat{\varvec{\beta }}\) satisfies:

$$\begin{aligned} \sum _{k=1}^N \frac{\exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})}{1+ \exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})} y_k x_{k,i} = N \epsilon \sqrt{p_{l_1}} \frac{\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} , \end{aligned}$$
(30)
$$\begin{aligned} \sum _{k=1}^N \frac{\exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})}{1+ \exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})} y_k x_{k,j} = N \epsilon \sqrt{p_{l_2}} \frac{\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}, \end{aligned}$$
(31)

where \(x_{k,i}\) and \(x_{k, j}\) denote the ith and jth elements of \({\mathbf {x}}_k\), respectively. Subtracting (31) from (30), we get:

$$\begin{aligned} \sum _{k=1}^N \frac{\exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})}{1+ \exp (-y_k {\mathbf {x}}_k' \hat{\varvec{\beta }})} \big (y_k x_{k,i} - y_k x_{k,j}\big ) = N \epsilon \Biggl ( \frac{\sqrt{p_{l_1}}\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} - \frac{\sqrt{p_{l_2}}\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}\Biggr ). \end{aligned}$$
(32)

Note that the LHS of 32 can be written as \({\mathbf {v}}_1'{\mathbf {v}}_2\), where

$$\begin{aligned} {\mathbf {v}}_1= & {} \bigg ( \frac{\exp (-y_1 {\mathbf {x}}_1' \hat{\varvec{\beta }})}{1+ \exp (-y_1 {\mathbf {x}}_1' \hat{\varvec{\beta }})}, \ldots , \frac{\exp (-y_N {\mathbf {x}}_N' \hat{\varvec{\beta }})}{1+ \exp (-y_N {\mathbf {x}}_N' \hat{\varvec{\beta }})}\bigg ),\\ {\mathbf {v}}_2= & {} \big ( y_1 ( x_{1,i} - x_{1,j}), \ldots , y_N (x_{N,i} - x_{N,j})\big ). \end{aligned}$$

Using the Cauchy–Schwarz inequality and \(\Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2^2=2(1-\rho )\), we obtain

$$\begin{aligned} \begin{aligned} D(i, j)&= \Biggl |\frac{\sqrt{p_{l_1}}\hat{\beta }_i}{\Vert \hat{\varvec{\beta }}^{l_1}\Vert _2} - \frac{\sqrt{p_{l_2}}\hat{\beta }_j}{\Vert \hat{\varvec{\beta }}^{l_2}\Vert _2}\Biggr | \\&\le \frac{1}{N \epsilon } \Vert {\mathbf {v}}_1\Vert _2 \Vert {\mathbf {v}}_2\Vert _2 \\&\le \frac{1}{N \epsilon } \sqrt{N} \Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2 = \frac{\sqrt{2(1-\rho )}}{\sqrt{N}\epsilon }. \end{aligned} \end{aligned}$$

\(\square \)

B Omitted Numerical Results

This section contains the experimental setup and results that are omitted in Sect. 4.

1.1 Omitted Results in Sect. 4.1

1.1.1 Hyperparameter Tuning

All the penalty parameters are tuned using a separate validation dataset. Specifically, we divide all the N training samples into two sets, dataset 1 and dataset 2 (validation set). For a pre-specified range of values for the penalty parameters, dataset 1 is used to train the models and derive \(\hat{\varvec{\beta }}\), and the performance of \(\hat{\varvec{\beta }}\) is evaluated on dataset 2. We choose the penalty parameter that yields the minimum unpenalized loss of the respective approaches on the validation set. As to the range of values for the tuned parameters, we borrow ideas from [16], where the LASSO was tuned over 50 values ranging from \(\lambda _m \triangleq \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }\) to a small fraction of \(\lambda _m\) on a log scale. In our experiments, this range is properly adjusted for the GLASSO estimators. Specifically, for GWGL and GSRL, the tuning range is: \(\sqrt{\exp (\text {lin}(\log (0.005\cdot \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),\log (\Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),50)) /\max (p_1, \ldots , p_L)},\) where the function \(\text {lin}(a, b, n)\) takes in scalars a, b and n (integer) and outputs a set of n values equally spaced between a and b; the \(\exp \) function is applied elementwise to a vector. Compared to LASSO, the values are scaled by \(\max (p_1, \ldots , p_L)\), and the square-root operation is due to the \(\ell _1\)-loss function, or the square root of the \(\ell _2\)-loss used in these formulations. For the GLASSO with \(\ell _2\)-loss, the range is: \(\exp (\text {lin}(\log (0.005\cdot \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),\log (\Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),50)) /\sqrt{\max (p_1, \ldots , p_L)}.\)

1.1.2 Implementation of Spectral Clustering

In our implementation, the k-nearest neighbor similarity graph is constructed, where we connect \({\mathbf {x}}_{,i}\) and \({\mathbf {x}}_{,j}\) with an undirected edge if \({\mathbf {x}}_{,i}\) is among the k-nearest neighbors of \({\mathbf {x}}_{,j}\) (in the sense of Euclidean distance) or if \({\mathbf {x}}_{,j}\) is among the k-nearest neighbors of \({\mathbf {x}}_{,i}\). The parameter k is chosen such that the resulting graph is connected. Recall that we use the Gaussian similarity function

$$\begin{aligned} \text {Gs}({\mathbf {x}}_{,i}, {\mathbf {x}}_{,j}) \triangleq \exp \big ( -\Vert {\mathbf {x}}_{,i} - {\mathbf {x}}_{,j}\Vert _2^2/(2\sigma _s^2)\big ), \end{aligned}$$
(33)

to construct the graph. The scale parameter \(\sigma _s\) in (33) is set to the mean distance of a point to its kth nearest neighbor [30]. We assume that the number of clusters is known in order to perform spectral clustering, but in case it is unknown, the eigengap heuristic [30] can be used, where the goal is to choose the number of clusters c such that all eigenvalues \(\lambda _1, \ldots , \lambda _c\) of the graph Laplacian are very small, but \(\lambda _{c+1}\) is relatively large.

1.1.3 The Number of Dropped Groups/Features on the Surgery Data

These results are in Table 5.

Table 5 Number of dropped groups/features on the surgery data

1.1.4 The Impact on the Performance Metrics when \(q = 20\%\)

See Figs. 3 and 4.

Fig. 3
figure 3

Impact of SNR on the performance metrics, \(q = 20\%\)

Fig. 4
figure 4

Impact of within group correlation on the performance metrics, \(q = 20\%\)

1.2 Omitted Results in Sect. 4.2

1.2.1 Preprocessing the Dataset

Data were preprocessed as follows: (i) categorical variables (such as race, discharge destination, insurance type) were numerically encoded and units homogenized; (ii) missing values were replaced by the mode; (iii) all variables were normalized by subtracting the mean and divided by the standard deviation; and (iv) patients who died within 30 days of discharge or had a postoperative length of stay greater than 30 days were excluded.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, R., Paschalidis, I.C. Robust Grouped Variable Selection Using Distributionally Robust Optimization. J Optim Theory Appl 194, 1042–1071 (2022). https://doi.org/10.1007/s10957-022-02065-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-022-02065-4

Keywords

Mathematics Subject Classification

Navigation