Abstract
PolieDRO is a novel analytics framework for classification and regression that harnesses the power and flexibility of data-driven distributionally robust optimization (DRO) to circumvent the need for regularization hyperparameters. Recent literature shows that traditional machine learning methods such as SVM and (square-root) LASSO can be written as Wasserstein-based DRO problems. Inspired by those results we propose a hyperparameter-free ambiguity set that explores the polyhedral structure of data-driven convex hulls, generating computationally tractable regression and classification methods for any convex loss function. Numerical results based on 100 real-world databases and an extensive experiment with synthetically generated data show that our methods consistently outperform their traditional counterparts.
Similar content being viewed by others
Data availability
The so-called real-world experiments use publicly available data that can be found on https://archive.ics.uci.edu/ml/index.php. Synthetic data was generated with a Julia script available upon request.
Code availability
The current version of the code is available upon request. An open-source Julia package is under development, namely PolieDRO.jl which will allow users to apply different loss functions to regression and classification problems using the PolieDRO framework.
Notes
Note that we enforce \(\underline{p}_0=\overline{p}_0\)=1 to ensure that P is a probability measure.
References
Anthony, L.F.W., Kanding, B., & Selvan, R. (2020). Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051,
Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4), 791–806.
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization. Princeton University Press.
Bertsimas, D., Brown, D. B., & Caramanis, C. (2011). Theory and applications of robust optimization. SIAM Review, 53(3), 464–501.
Bertsimas, D., Dunn, J., Pawlowski, C., & Zhuo, Y. D. (2019). Robust classification. INFORMS Journal on Optimization, 1(1), 2–34.
Blanchet, J., Kang, Y., & Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3), 830–857.
Bradford Barber, C., Dobkin, D. P., & Huhdanpaa, H. (1996). The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4), 469–483.
Bykat, A. (1978). Convex hull of a finite set of points in two dimensions. Information Processing Letters, 7(6), 296–298.
Casella, G., & Berger, R. L. (2001). Statistical inference (2nd ed.). Cengage Learning.
Chen, R., & Paschalidis, I. C. (2018). A robust learning approach for regression models based on distributionally robust optimization. Journal of Machine Learning Research, 19(13), 1–48.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Eddy, W. F. (1977). A new convex hull algorithm for planar sets. ACM Transactions on Mathematical Software, 3(4), 398–403.
Esfahani, P. M., & Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 115–166.
Fernandes, B., Street, A., Valladão, D., & Fernandes, C. (2016). An adaptive robust portfolio optimization model with loss constraints based on data-driven polyhedral uncertainty sets. European Journal of Operational Research, 255(3), 961–970.
Gamboa, C. A., Valladão, D. M., Street, A., & Mello, T. H. (2021). Decomposition methods for Wasserstein-based data-driven distributionally robust problems. Operations Research Letters, 49(5), 696–702.
Gao, R., Chen, X., & Kleywegt, A. J. (2020). Wasserstein distributionally robust optimization and variation regularization. Operations Research. https://doi.org/10.1287/opre.2022.2383
Goh, J., & Sim, M. (2010). Distributionally robust optimization and its tractable approximations. Operations Research, 58(4–part–1), 902–917.
Greenfield, J. S. (1990). A proof for a quickhull algorithm.
Green, P. J., & Silverman, B. W. (1979). Constructing the convex hull of a set of points in the plane. The Computer Journal, 22(3), 262–266.
Hao, K. (2019). Training a single AI model can emit as much carbon as five cars in their lifetimes. MIT Technology Review, 75, 103.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer Series in StatisticsSpringer.
Kuhn, D., Esfahani, P. M., Nguyen, V. A., & Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics (pp. 130–166). INFORMS
Lacoste, A., Luccioni, A., Schmidt, V. & Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700
Shafieezadeh Abadeh, S., Mohajerin Esfahani, P. M., & Kuhn, D. (2015). Distributionally robust logistic regression. In Proceedings of the 28th international conference on neural information processing systems, NIPS’15 (Vol. 1, pp. 1576–1584). MIT Press.
Shafieezadeh-Abadeh, S., Kuhn, D., & Esfahani, P. M. (2019). Regularization via mass transportation. Journal of Machine Learning Research, 20(103), 1–68.
Shapiro, A., Dentcheva, D., & Ruszczynski, A. (2021). Lectures on stochastic programming: Modeling and theory. SIAM.
Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020). Optimizer benchmarking needs to account for hyperparameter tuning. In International conference on machine learning (pp. 9036–9045). PMLR.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58(1), 267–288.
Van Parys, B. P. G., Esfahani, P. M., & Kuhn, D. (2021). From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6), 3387–3402.
Wiesemann, W., Kuhn, D., & Sim, M. (2014). Distributionally robust convex optimization. Operations Research, 62(6), 1358–1376.
Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, 295–316.
Funding
Gutierrez research was funded by FAPERJ grant 26/200.598/2021, CAPES grant and CNPq grant 141185/2019-8. Valladão research was funded by FAPERJ grant E-26/201.287/2021 and CNPq grant 309456/2020-7.
Author information
Authors and Affiliations
Contributions
Valladão devised the project and the main conceptual ideas. Gutierrez, Valladão, Pagnoncelli worked out the technical details and proofs. Valladão and Pagnoncelli conceived and planned the experiments, while Gutierrez performed the implementation and ran the algorithm for all the numerical results in the paper. Gutierrez analyzed the results and organized them in the paper, and Gutierrez, Valladão, and Pagnoncelli wrote the script.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Problem reformulation
Consider the problem formulation in (4). Naturally, this is not a solvable problem in its current form. We then write the problem’s Lagrangian formulation:
Reorganizing the terms, we can then write:
The Lagrange dual function \(g(\lambda , \kappa )\) can then be written as:
Notice that the term within parenthesis can be analyzed as:
Since we are only interested in the finite cost case, the dual problem \(\displaystyle \min _{\lambda \ge 0,\kappa \ge 0} g(\lambda ,\kappa )\) is given by:
Although we were able to remove the integrals by writing the dual version of the problem, the problem is still not tractable, as the first constraint implies an infinite amount of points to evaluate.
Proposition 2
Let \({\mathcal {R}}(w)\) be a constraint valid \(\forall w \in {\mathcal {C}}_0\), and \(\bigcup _{i \in {\mathcal {F}}}\) \(\overline{{\mathcal {C}}}_i\) be a partition of \({\mathcal {C}}_0\). Then, the following are equivalent:
Based on Proposition 2, we can rewrite the constraint:
as
Proposition 3
Consider the sets \({\mathcal {F}} = \{0, 1,\ldots , I\}\), \(\{{\mathcal {V}}_i\}_{i \in {\mathcal {F}}}\) and \(\{{\mathcal {C}}_i\}_{i \in {\mathcal {F}}}\) obtained from a procedure as defined in Algorithm 1. In addition, let \(w' \in {\mathcal {C}}_0\) and \(\bigcup _{i \in {\mathcal {F}}}\) \(\overline{{\mathcal {C}}}_i\) a partition of \({\mathcal {C}}_0\). Therefore, \(w' \in \overline{{\mathcal {C}}}_i\) for some \(i \in {\mathcal {F}}\). In addition, we define the index sets of all supersets (antecedents of \({\mathcal {C}}_i\)) by \({\mathcal {A}}(i)= \{i\} \cap \{ i' \in {\mathcal {F}}: {\mathcal {C}}_i \subsetneq {\mathcal {C}}_{i'}\}\) and the index sets of all subsets (descendants of \({\mathcal {C}}_i\)) as \({\mathcal {D}}(i)= {\mathcal {F}} {\setminus } {\mathcal {A}}(i)\).
Thus we can write:
-
1.
\(w'\in \{{\mathcal {C}}_j\}_{j \in {\mathcal {A}}(i)}\), \(w' \notin \{{\mathcal {C}}_j\}_{j \in {\mathcal {D}}(j)}\) for some \(j \in {\mathcal {F}}\)
-
2.
\(\displaystyle \sum _{i \in {\mathcal {F}}} {\mathbb {I}}_{\{w' \in {\mathcal {C}}_i\}} \displaystyle = \sum _{i' \in {\mathcal {A}}(i)} 1\)
Based on Proposition 3, we can rewrite the constraint:
as
Since the constraint above is valid \(\forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}}\), we can write:
Concluding the proof \(\blacksquare\).
Appendix 2: The estimation of empirical probabilities
During the process of estimating the data-driven ambiguity set, we must obtain two results: (i) the polyhedral convex sets and (ii) their associated probability coverage intervals. In Sect. 2.2, we describe the data-driven procedure and present the approximation used to estimate the intervals. In this section, we present some results to endorse the approximation methodology usage.
Consider the outcome of the Algorithm 1 application and the coverage intervals associated to each convex hull \({\mathcal {C}}_i\). Ideally, the probability interval \([\underline{p}_i, \hspace{1mm} \overline{p}_i]\) should include the true probability of a random point obtained from the original data distribution to fall within the convex hull \({\mathcal {C}}_i\), considering the specified significance level \(\alpha\). In this context, we refer to accuracy as the percentage of probability intervals \([\underline{p}_i, \hspace{1mm} \overline{p}_i]\) that includes the true probability \(p_i\). In addition, another interesting property would be how well the approximation that the outside hull \({\mathcal {C}}_0\) replicates the true distribution support—which we refer to as coverage.
To assess such properties, we ran the following experiment considering a random variable X that follows a Multivariate Normal distribution as the data-generating process:
-
1.
Let \(X \sim N(\mu , \Sigma )\), where \(\mu\) is the unit vector of dimension \(d = 3\) and \(\Sigma\) is the \(d \times d\) identity matrix;
-
2.
Let N be the sample size used to construct the ambiguity set;
-
3.
For a given N, we generate a sample and apply Algorithm 1, obtaining the convex hulls \(\{{\mathcal {C}}_i\}_{i = 0}^{{\mathcal {I}}}\) and probability intervals \(\{[\underline{p}_i, \hspace{1mm} \overline{p}_i]\}_{i = 0}^{{\mathcal {I}}}\);
-
4.
For each convex hull \({\mathcal {C}}_i\) we approximate the true probability coverage value \(p_i\) considering the data generating process distribution by generating an extremely large sample size \(S = 1.000.000\) and verify whether \(p_i \in [\underline{p}_i, \hspace{1mm} \overline{p}_i]\). Such interval is calculated in step 3 using the sample with size N;
-
5.
We calculate the experiment’s accuracy (percentage of estimated intervals that contain the true probability) and coverage (percentage of the distribution support within
We vary the sample size N from 10 to 10.000 and repeat the experiment 5.000 times for each value. We consider the significance level \(\alpha = 10\%\). The average accuracy is presented in Fig. 10 and the average coverage is presented in Fig. 11.
One can observe that as the sample size grows, both accuracy and coverage converge to the expected values considering the significance level of the experiment. Naturally, the quality of such approximation methodology grows as the sample size grows. However, we argue that for a relatively small sample size, one can obtain a decent approximation. In addition, our empirical results validate the method as the experiments in Sects. 4.2 and 4.3 show.
Appendix 3: The choice of significance level
To apply the PolieDRO framework in classification or regression models, one should define the value of the significance level \(\alpha\). Such value can be interpreted as the flexibility of the probability coverage of each convex hull that defines the hyperspace of distributions considered, that is, the ambiguity set. The idea is that such convex hulls imply some structure that arises from the data. The added flexibility controls the degree to which the resulting ambiguity set considers possible distributions.
Those values should be considered statistical significance parameters, using typical values such as \(\alpha = 10\%\), \(\alpha = 5\%\), or \(\alpha = 1\%\). We repeat the experiment considering the different values and display them in Tables 11, 12, 13, 14 and 15 for the real world data sets and in Tables 16, 17 and 18 for the synthetic data.
In Tables 11, 12, 13, 14 and 15, we have highlighted in bold the cases where the PolieDRO version outperformed its nominal counterpart for each value of \(\alpha\). We have summarized the results in Table 19.
For the synthetic datasets, we followed the same criteria as in Sect. 4.3. We identified the highest number of wins (W), ties (T), or losses (L) for each experiment in Tables 16, 17 and 18, and provided a summary of the results in Table 20.
Our results indicate that the choice of the statistical parameter \(\alpha\) has little impact on the study results. In most cases, it does not alter the performance of the PolieDRO models, and in the few cases where it does, the change is not substantial.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gutierrez, T., Valladão, D. & Pagnoncelli, B.K. PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06544-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-024-06544-9