Abstract
We propose a distributionally robust optimization formulation with a Wasserstein-based uncertainty set for selecting grouped variables under perturbations on the data for both linear regression and classification problems. The resulting model offers robustness explanations for grouped least absolute shrinkage and selection operator algorithms and highlights the connection between robustness and regularization. We prove probabilistic bounds on the out-of-sample loss and the estimation bias, and establish the grouping effect of our estimator, showing that coefficients in the same group converge to the same value as the sample correlation between covariates approaches 1. Based on this result, we propose to use the spectral clustering algorithm with the Gaussian similarity function to perform grouping on the predictors, which makes our approach applicable without knowing the grouping structure a priori. We compare our approach to an array of alternatives and provide extensive numerical results on both synthetic data and a real large dataset of surgery-related medical records, showing that our formulation produces an interpretable and parsimonious model that encourages sparsity at a group level and is able to achieve better prediction and estimation performance in the presence of outliers.
Similar content being viewed by others
References
Bakin, S.: Adaptive regression and model selection in data mining problems, Doctoral dissertation, Australian National University (1999). https://openresearch-repository.anu.edu.au/handle/1885/9449
Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)
Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. (2017)
Bertsimas, D., Gupta, V., Paschalidis, I.C.: Data-driven estimation in equilibrium using inverse optimization. Math. Program. 153(2), 595–633 (2015)
Blanchet, J., Kang, Y.: Distributionally robust groupwise regularization estimator. arXiv preprint arXiv:1705.04241 (2017)
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143(11), 1835–1858 (2013)
Bunea, F., Lederer, J., She, Y.: The group square-root LASSO: theoretical properties and fast algorithms. IEEE Trans. Inf. Theory 60(2), 1313–1325 (2014)
Chen, R., Paschalidis, I.C.: A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 19(1), 517–564 (2018)
Chen, R., Paschalidis, I.Ch: Distributionally robust learning. Found. Trends ® Optim. 4(1–2), 1–243 (2020)
Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3), 595–612 (2010)
Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. Ann. Stat. 49(3), 1378–1406 (2021)
Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Available at Optimization Online (2015)
Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, (2017)
Gao, R., Kleywegt, A.J.: Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199, (2016)
Goh, J., Sim, M.: Distributionally robust optimization and its tractable approximations. Oper. Res. 58(4–part–1), 902–917 (2010)
Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692 (2017)
Huang, J., Zhang, T., et al.: The benefit of group sparsity. Ann. Stat. 38(4), 1978–2004 (2010)
Jacob, L., Obozinski, G., Vert, J.-P.: Group LASSO with overlap and graph LASSO. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433–440. ACM (2009)
Jenatton, R., Audibert, J.-Y., Bach, F.: Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12(Oct), 2777–2824 (2011)
Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)
Meier, L., Van De Geer, S., Bühlmann, P.: The group LASSO for logistic regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)
Obozinski, G., Jacob, L., Vert, J.-P.: Group LASSO with overlaps: the latent group LASSO approach. arXiv preprint arXiv:1110.0413, (2011)
Roth, V., Fischer, B.: The group-LASSO for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th International Conference on Machine Learning, pp. 848–855. ACM (2008)
Shafieezadeh-Abadeh, S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Advances in Neural Information Processing Systems, pp. 1576–1584 (2015)
Shafieezadeh-Abadeh, S., Kuhn, D., Esfahani, P.M.: Regularization via mass transportation. arXiv preprint arXiv:1710.10016 (2017)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group LASSO. J. Comput. Graph. Stat. 22(2), 231–245 (2013)
Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B (Methodol.) 267–288 (1996)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Xu, H., Caramanis, C., Mannor, S.: Robust regression and LASSO. In: Advances in Neural Information Processing Systems, pp. 1801–1808 (2009)
Yang, W., Xu, H.: A unified robust regression model for LASSO-like algorithms. In: International Conference on Machine Learning, pp. 585–593 (2013)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 3468–3497 (2009)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order moment information. Math. Program. 137(1–2), 167–198 (2013)
Acknowledgements
We thank George Kasotakis, MD, MPH, for providing access to the surgery dataset. We also thank Taiyao Wang for help in processing this dataset. The research was partially supported by the NSF under Grants DMS-1664644, CNS-1645681, and IIS-1914792, by the ONR under Grants N00014-19-1-2571 and N00014-21-1-2844, by the NIH under Grants R01 GM135930 and UL54 TR004130, by the DOE under grants DE-AR-0001282 and NETL-EE0009696, and by the Boston University Kilachand Fund for Integrated Life Science and Engineering.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Omitted Theoretical Results and Proofs
This section contains the theoretical statements and proofs that are omitted in Sects. 2 and 3.
Proof of Theorem 2.1
From the definition of the Wasserstein distance, \(W_1 (\mathbb {P}_{\text {out}}, \mathbb {P}_{\text {mix}})\) is the optimal value of the following optimization problem:
Similarly, \(W_1 (\mathbb {P}, \mathbb {P}_{\text {mix}})\) is the optimal value of the following optimization problem:
We propose a decomposition strategy. For Problem (16), decompose the joint distribution \(\varPi \) as \(\varPi = (1-q)S + q T\), where S and T are two joint distributions of \({\mathbf {z}}_1\) and \({\mathbf {z}}_2\). The first set of constraints in Problem (16) can be equivalently expressed as:
which is satisfied if
The second set of constraints can be expressed as:
which is satisfied if
The objective function can be decomposed as:
Therefore, Problem (16) can be decomposed into the following two subproblems.
Assume that the optimal solutions to the two subproblems are \(S^*\) and \(T^*\), respectively, we know \(\varPi _0 = (1-q)S^* + q T^*\) is a feasible solution to Problem (16). Therefore,
Similarly,
On the other hand, based on the subadditivity of the Wasserstein metric, we have,
We thus conclude that
To achieve the equality in (21), (18) and (19) must be equalities, i.e.,
and,
To see this, notice that if either (18) or (19) is a strict inequality, then (20) becomes a strict inequality, which contradicts (21). Thus,
\(\square \)
Proof of Theorem 2.2
We will use Hölder’s inequality, which we state for convenience.
Hölder’s inequality: Suppose we have two scalars \(p, q \ge 1\) and \(1/p + 1/q =1\). For any two vectors \({\mathbf {a}}= (a_1, \ldots , a_n)\) and \({\mathbf {b}}= (b_1, \ldots , b_n)\),
The dual norm of \(\Vert \cdot \Vert _{r, s}\) evaluated at some vector \(\varvec{\beta }\) is the optimal value of problem (22):
We assume that \(\varvec{\beta }\) has the same group structure with \({\mathbf {x}}\), i.e., \(\varvec{\beta }= (\varvec{\beta }^1, \ldots , \varvec{\beta }^L)\). Using Hölder’s inequality, we can write
Define two new vectors in \(\mathbb {R}^L\)
Applying Hölder’s inequality again to \({\mathbf {x}}_{new}\) and \(\varvec{\beta }_{new}\), we obtain:
Therefore,
due to the constraint \(\Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{r, s} \le 1\). The result then follows. \(\square \)
Proof of Theorem 2.3
To derive a tractable reformulation of the DRO-LG problem (9), we borrow the idea from [8] and [14], which states that for any \(\mathbb {Q} \in \varOmega \),
where \(\varPi _0\) is the optimal solution in the definition of the Wasserstein metric, i.e., it is the joint distribution of \(({\mathbf {x}}_1, y_1)\) and \(({\mathbf {x}}_2, y_2)\) with marginals \(\mathbb {Q}\) and \(\hat{\mathbb {P}}_N\) that achieves the minimum mass transportation cost. Comparing (23) with the definition of the Wasserstein distance, we wish to bound the following growth rate of \(l_{\varvec{\beta }}({\mathbf {x}}, y)\):
in order to relate \(\big |\mathbb {E}^{\mathbb {Q}}[ l_{\varvec{\beta }}({\mathbf {x}}, y)] - \mathbb {E}^{\hat{\mathbb {P}}_N}[ l_{\varvec{\beta }}({\mathbf {x}}, y)]\big |\) with \(W_1 (\mathbb {Q}, \ \hat{\mathbb {P}}_N)\). To this end, we define a continuous and differentiable univariate function \(h(a) \triangleq \log (1+\exp (-a))\), and apply the mean value theorem to it, which yields that for any \(a, b\in \mathbb {R}\), \(\exists c \in (a,b)\) such that:
By noting that \(l_{\varvec{\beta }}({\mathbf {x}}, y) = h(y\varvec{\beta }'{\mathbf {x}})\), we immediately have:
where the second step uses the Cauchy–Schwarz inequality, and the last step is due to the definition of the metric s in (8). Combining (24) with (23), it follows that for any \(\mathbb {Q} \in \varOmega \),
Therefore, the DRO-LG problem can be reformulated as:
\(\square \)
1.1 Prediction and Estimation Performance of the GWGL-LR Estimator
We are interested in two types of performance criteria: (1) prediction quality, or out-of-sample performance, which measures the predictive power of the GWGL solutions on new, unseen samples, and (2) estimation quality, which measures the discrepancy between the GWGL solutions and the underlying unknown true coefficients.
We note that GWGL-LR is a special case of the Wasserstein DRO formulation derived in [8, Eq. 10], and thus, the two types of performance guarantees derived in [8], one for generalization ability (prediction error) and the other for the discrepancy between the estimated and the true regression coefficients (estimation error), still apply to our GWGL-LR formulation.
We first establish a bound for the prediction bias of the solution to the GWGL-LR formulation, where the Wasserstein metric is induced by the weighted \((2, \infty )\)-norm with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}}, M)\). The dual norm in this case is just the weighted (2, 1)-norm with weight \({\mathbf {w}}^{-1}= (\sqrt{p_1}, \ldots , \sqrt{p_L}, 1/M)\). Throughout this section, we use \(\varvec{\beta }^*\) and \(\hat{\varvec{\beta }}\) to denote the true and estimated regression coefficient vectors, respectively. We first state several assumptions that are needed to establish the results.
Assumption A
The weighted \((2, \infty )\)-norm of the uncertainty parameter \(({\mathbf {x}}, y)\) with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}}, M)\) is bounded above by R almost surely.
Assumption B
For every feasible \(\varvec{\beta }\), \(\Vert (-\varvec{\beta }^1, \ldots , -\varvec{\beta }^L, 1)_{{\mathbf {w}}^{-1}}\Vert _{2, 1} \le \bar{B}\), where \({\mathbf {w}}^{-1} = (\sqrt{p_1}, \ldots , \sqrt{p_L}, 1/M)\).
Let \(\hat{\varvec{\beta }}\) be an optimal solution to (7), obtained using the samples \(({\mathbf {x}}_i, y_i)\), \(i=1,\ldots ,N\). Suppose we draw a new i.i.d. sample \(({\mathbf {x}},y)\). Using Theorem 3.3 in [8], Theorem A.1 establishes bounds on the error \(|y - {\mathbf {x}}'\hat{\varvec{\beta }}|\).
Theorem A.1
Under Assumptions A and B, for any \(0<\delta <1\), with probability at least \(1-\delta \) with respect to the sampling,
and for any \(\zeta >(2\bar{B}R/\sqrt{N})+ \bar{B}R\sqrt{8\log (2/\delta )/N}\),
Theorem A.1 essentially says that with a high probability, the expected loss on new test samples using our GWGL-LR estimator can be upper bounded by the average loss in the training samples plus two terms that are related to the magnitude of the regularizer \(\bar{B}\), the uncertainty level R, the confidence level \(\delta \), and converge to zero as \(O(1/\sqrt{N})\). This result justifies the form of the regularizer used in (7) and guarantees a small generalization error of the GWGL-LR solution.
We next discuss the estimation performance of the GWGL-LR solution. Theorem A.1, a specialization of Thm. 3.11 in [8], provides a bound for the estimation bias in GWGL-LR. We first state the assumptions that are needed to establish the result.
Assumption C
The \(\ell _2\) norm of \((-\varvec{\beta }, 1)\) is bounded above by \(\bar{B}_2\).
Assumption D
For some set
and some positive scalar \(\underline{\alpha }\), the following holds,
where \({\mathbf {Z}}=[({\mathbf {x}}_1, y_1), \ldots , ({\mathbf {x}}_N, y_N)]\) is the matrix with columns \(({\mathbf {x}}_i, y_i), i = 1, \ldots , N\), and \(\mathbb {S}^{p+1}\) is the unit sphere in the \((p+1)\)-dimensional Euclidean space.
Assumption E
\(({\mathbf {x}}, y)\) is a centered sub-Gaussian random vector, i.e., it has zero mean and satisfies the following condition:
Assumption F
The covariance matrix of \(({\mathbf {x}}, y)\) has bounded positive eigenvalues. Set \(\varvec{\varGamma }=\mathbb {E}[({\mathbf {x}}, y)({\mathbf {x}}, y)']\); then,
Definition A.1
(Sub-Gaussian random variable) A random variable z is sub-Gaussian if it is zero mean, and the \(\psi _2\)-norm defined below is finite, i.e.,
An equivalent property for sub-Gaussian random variables is that their tail distribution decays at least as fast as a Gaussian, i.e.,
for some constant C. A random vector \({\mathbf {z}}\in \mathbb {R}^{p+1}\) is sub-Gaussian if \({\mathbf {z}}'{\mathbf {u}}\) is sub-Gaussian for any \({\mathbf {u}}\in \mathbb {R}^{p+1}\). The \(\psi _2\)-norm of a vector \({\mathbf {z}}\) is defined as:
where \(\mathbb {S}^{p+1}\) denotes the unit sphere in the \((p+1)\)-dimensional Euclidean space.
Definition A.2
(Gaussian width) For any set \(\mathcal {A}\subseteq \mathbb {R}^{p+1}\), its Gaussian width is defined as:
where \({\mathbf {g}}\sim \mathcal{N}({\mathbf {0}},{\mathbf {I}})\) is a \((p+1)\)-dimensional standard Gaussian random vector.
Theorem A.1
Suppose the true regression coefficient vector is \(\varvec{\beta }^*\) and the solution to GWGL-LR is \(\hat{\varvec{\beta }}\). Under Assumptions A, C, D, E, and F, when the sample size \(N\ge \bar{C_1}\bar{\mu }^4 \mu _0^2 (\lambda _{\text {max}}/\lambda _{\text {min}})\cdot (w(\mathcal {A}(\varvec{\beta }^*))+3)^2\), with probability at least \(1-\exp (-C_2N/\bar{\mu }^4)-C_4\exp (-C_5^2 (w(\mathcal {B}_u))^2/(4\rho ^2))\),
where \(\bar{\mu }=\mu \sqrt{(1/\lambda _{\text {min}})}\); \(\mu _0\) is the \(\psi _2\)-norm of a standard Gaussian random vector \({\mathbf {g}}\in \mathbb {R}^{p+1}\); \(w(\mathcal {A}(\varvec{\beta }^*))\) is the Gaussian width (defined below) of \(\mathcal {A}(\varvec{\beta }^*)\) (cf. Assumption D); \(w(\mathcal {B}_u)\) is the Gaussian width of \(\mathcal {B}_u\), where \(\mathcal {B}_u\) is the unit ball of the norm \(\Vert \cdot \Vert _\infty \); \(\rho =\sup _{{\mathbf {v}}\in \mathcal {B}_u}\Vert {\mathbf {v}}\Vert _2\); \(\varPsi (\varvec{\beta }^*)=\sup _{{\mathbf {v}}\in \mathcal {A}(\varvec{\beta }^*)}\Vert {\mathbf {v}}_{{\mathbf {w}}^{-1}}\Vert _{2, 1}\); and \(\bar{C_1}, C_2, C_4, C_5, \bar{C}\) are positive constants.
With Theorem A.1, we are able to provide bounds for performance metrics, such as the relative risk (RR), relative test error (RTE), and proportion of variance explained (PVE) [16]. All these metrics evaluate the accuracy of the regression coefficient estimates on a new test sample drawn from the same probability distribution as the training samples. Let \(({\mathbf {x}}_0, y_0)\) be such a test sample satisfying \(y_0 = {\mathbf {x}}_0' \varvec{\beta }^* + \eta _0\), where \(\eta _0\) is random noise with zero mean and variance \(\sigma ^2\), and independent of the zero mean predictor \({\mathbf {x}}_0\). For a fixed set of training samples, let the solution to GWGL-LR be \(\hat{\varvec{\beta }}\). As in [16], define
where \(\varvec{\varSigma }\) is the covariance matrix of \({\mathbf {x}}_0\), which is just the top left block of the matrix \(\varvec{\varGamma }\) in Assumption F. RTE is defined as:
PVE is defined as:
Using Theorem A.1, we can bound the term \((\hat{\varvec{\beta }} - \varvec{\beta }^*)'{\varvec{\varSigma }}(\hat{\varvec{\beta }} - \varvec{\beta }^*)\) as follows:
where \(\lambda _{max}({\varvec{\varSigma }})\) is the maximum eigenvalue of \({\varvec{\varSigma }}\). Using (25), bounds for RR, RTE, and PVE can be readily obtained and are summarized in the following Corollary.
Corollary A.2
Under the specifications in Theorem A.1, when the sample size
with probability at least \(1-\exp (-C_2N/\bar{\mu }^4)- C_4\exp (-C_5^2 (w(\mathcal {B}_u))^2/(4\rho ^2))\),
where all parameters are defined in the same way as in Theorem A.1.
1.2 Predictive Performance of the GWGL-LG Estimator
In this subsection, we establish bounds on the prediction error of the GWGL-LG solution. Similar to [8], we will use the Rademacher complexity of the class of logloss (negative log-likelihood) functions to bound the generalization error. Two assumptions that impose conditions on the magnitude of the regularizer and the uncertainty level of the predictor are needed.
Assumption G
The weighted \((2, \infty )\)-norm of \({\mathbf {x}}\) with weight \({\mathbf {w}}= (\frac{1}{\sqrt{p_1}}, \ldots , \frac{1}{\sqrt{p_L}})\) is bounded above almost surely, i.e., \(\Vert {\mathbf {x}}_{{\mathbf {w}}}\Vert _{2, \infty } \le R_{{\mathbf {x}}}\).
Assumption H
The weighted (2, 1)-norm of \(\varvec{\beta }\) with \({\mathbf {w}}^{-1} = (\sqrt{p_1}, \ldots , \sqrt{p_L})\) is bounded above, namely \(\sup _{\varvec{\beta }}\Vert \varvec{\beta }_{{\mathbf {w}}^{-1}}\Vert _{2, 1}=\bar{B}_1\).
Under these two assumptions, the logloss could be bounded via the Cauchy–Schwarz inequality.
Lemma A.3
Under Assumptions G and H, it follows
Now consider the following class of loss functions:
It follows from [4, 8] that the empirical Rademacher complexity of \(\mathcal {L}\), denoted by \(\mathcal {R}_N(\mathcal {L})\), can be upper bounded by:
Then, applying Theorem 8 in [2], we have the following result on the prediction error of our GWGL-LG estimator.
Theorem A.4
Let \(\hat{\varvec{\beta }}\) be an optimal solution to (11), obtained using N training samples \(({\mathbf {x}}_i, y_i)\), \(i=1,\ldots ,N\). Suppose we draw a new i.i.d. sample \(({\mathbf {x}},y)\). Under Assumptions G and H, for any \(0<\delta <1\), with probability at least \(1-\delta \) with respect to the sampling,
and for any \(\zeta >\frac{2 \log (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1))}{\sqrt{N}} + \log \big (1 + \exp (R_{{\mathbf {x}}} \bar{B}_1)\big )\sqrt{\frac{8\log (2/\delta )}{N}}\),
Theorem A.4 implies that the groupwise regularized LG formulation (11) yields a solution with a small generalization error on new i.i.d. samples.
1.3 Proof of Theorem 3.1 for GWGL-LR
Proof
By the optimality condition associated with formulation (7), \(\hat{\varvec{\beta }}\) satisfies:
where the \(\text {sgn}(\cdot )\) function is applied to a vector elementwise. Subtracting (29) from (28), we obtain:
Using the Cauchy–Schwarz inequality and \(\Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2^2=2(1-\rho )\), we obtain
\(\square \)
1.4 Proof of Theorem 3.1 for GWGL-LG
Proof
By the optimality condition associated with formulation (11), \(\hat{\varvec{\beta }}\) satisfies:
where \(x_{k,i}\) and \(x_{k, j}\) denote the ith and jth elements of \({\mathbf {x}}_k\), respectively. Subtracting (31) from (30), we get:
Note that the LHS of 32 can be written as \({\mathbf {v}}_1'{\mathbf {v}}_2\), where
Using the Cauchy–Schwarz inequality and \(\Vert {\mathbf {x}}_{,i}- {\mathbf {x}}_{,j}\Vert _2^2=2(1-\rho )\), we obtain
\(\square \)
B Omitted Numerical Results
This section contains the experimental setup and results that are omitted in Sect. 4.
1.1 Omitted Results in Sect. 4.1
1.1.1 Hyperparameter Tuning
All the penalty parameters are tuned using a separate validation dataset. Specifically, we divide all the N training samples into two sets, dataset 1 and dataset 2 (validation set). For a pre-specified range of values for the penalty parameters, dataset 1 is used to train the models and derive \(\hat{\varvec{\beta }}\), and the performance of \(\hat{\varvec{\beta }}\) is evaluated on dataset 2. We choose the penalty parameter that yields the minimum unpenalized loss of the respective approaches on the validation set. As to the range of values for the tuned parameters, we borrow ideas from [16], where the LASSO was tuned over 50 values ranging from \(\lambda _m \triangleq \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }\) to a small fraction of \(\lambda _m\) on a log scale. In our experiments, this range is properly adjusted for the GLASSO estimators. Specifically, for GWGL and GSRL, the tuning range is: \(\sqrt{\exp (\text {lin}(\log (0.005\cdot \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),\log (\Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),50)) /\max (p_1, \ldots , p_L)},\) where the function \(\text {lin}(a, b, n)\) takes in scalars a, b and n (integer) and outputs a set of n values equally spaced between a and b; the \(\exp \) function is applied elementwise to a vector. Compared to LASSO, the values are scaled by \(\max (p_1, \ldots , p_L)\), and the square-root operation is due to the \(\ell _1\)-loss function, or the square root of the \(\ell _2\)-loss used in these formulations. For the GLASSO with \(\ell _2\)-loss, the range is: \(\exp (\text {lin}(\log (0.005\cdot \Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),\log (\Vert {\mathbf {X}}'{\mathbf {y}}\Vert _{\infty }),50)) /\sqrt{\max (p_1, \ldots , p_L)}.\)
1.1.2 Implementation of Spectral Clustering
In our implementation, the k-nearest neighbor similarity graph is constructed, where we connect \({\mathbf {x}}_{,i}\) and \({\mathbf {x}}_{,j}\) with an undirected edge if \({\mathbf {x}}_{,i}\) is among the k-nearest neighbors of \({\mathbf {x}}_{,j}\) (in the sense of Euclidean distance) or if \({\mathbf {x}}_{,j}\) is among the k-nearest neighbors of \({\mathbf {x}}_{,i}\). The parameter k is chosen such that the resulting graph is connected. Recall that we use the Gaussian similarity function
to construct the graph. The scale parameter \(\sigma _s\) in (33) is set to the mean distance of a point to its kth nearest neighbor [30]. We assume that the number of clusters is known in order to perform spectral clustering, but in case it is unknown, the eigengap heuristic [30] can be used, where the goal is to choose the number of clusters c such that all eigenvalues \(\lambda _1, \ldots , \lambda _c\) of the graph Laplacian are very small, but \(\lambda _{c+1}\) is relatively large.
1.1.3 The Number of Dropped Groups/Features on the Surgery Data
These results are in Table 5.
1.1.4 The Impact on the Performance Metrics when \(q = 20\%\)
1.2 Omitted Results in Sect. 4.2
1.2.1 Preprocessing the Dataset
Data were preprocessed as follows: (i) categorical variables (such as race, discharge destination, insurance type) were numerically encoded and units homogenized; (ii) missing values were replaced by the mode; (iii) all variables were normalized by subtracting the mean and divided by the standard deviation; and (iv) patients who died within 30 days of discharge or had a postoperative length of stay greater than 30 days were excluded.
Rights and permissions
About this article
Cite this article
Chen, R., Paschalidis, I.C. Robust Grouped Variable Selection Using Distributionally Robust Optimization. J Optim Theory Appl 194, 1042–1071 (2022). https://doi.org/10.1007/s10957-022-02065-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-022-02065-4