Skip to main content
Log in

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

PolieDRO is a novel analytics framework for classification and regression that harnesses the power and flexibility of data-driven distributionally robust optimization (DRO) to circumvent the need for regularization hyperparameters. Recent literature shows that traditional machine learning methods such as SVM and (square-root) LASSO can be written as Wasserstein-based DRO problems. Inspired by those results we propose a hyperparameter-free ambiguity set that explores the polyhedral structure of data-driven convex hulls, generating computationally tractable regression and classification methods for any convex loss function. Numerical results based on 100 real-world databases and an extensive experiment with synthetically generated data show that our methods consistently outperform their traditional counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The so-called real-world experiments use publicly available data that can be found on https://archive.ics.uci.edu/ml/index.php. Synthetic data was generated with a Julia script available upon request.

Code availability

The current version of the code is available upon request. An open-source Julia package is under development, namely PolieDRO.jl which will allow users to apply different loss functions to regression and classification problems using the PolieDRO framework.

Notes

  1. Note that we enforce \(\underline{p}_0=\overline{p}_0\)=1 to ensure that P is a probability measure.

References

  • Anthony, L.F.W., Kanding, B., & Selvan, R. (2020). Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051,

  • Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4), 791–806.

    Article  MathSciNet  Google Scholar 

  • Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization. Princeton University Press.

    Book  Google Scholar 

  • Bertsimas, D., Brown, D. B., & Caramanis, C. (2011). Theory and applications of robust optimization. SIAM Review, 53(3), 464–501.

    Article  MathSciNet  Google Scholar 

  • Bertsimas, D., Dunn, J., Pawlowski, C., & Zhuo, Y. D. (2019). Robust classification. INFORMS Journal on Optimization, 1(1), 2–34.

    Article  MathSciNet  Google Scholar 

  • Blanchet, J., Kang, Y., & Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3), 830–857.

    Article  MathSciNet  Google Scholar 

  • Bradford Barber, C., Dobkin, D. P., & Huhdanpaa, H. (1996). The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4), 469–483.

    Article  MathSciNet  Google Scholar 

  • Bykat, A. (1978). Convex hull of a finite set of points in two dimensions. Information Processing Letters, 7(6), 296–298.

    Article  MathSciNet  Google Scholar 

  • Casella, G., & Berger, R. L. (2001). Statistical inference (2nd ed.). Cengage Learning.

    Google Scholar 

  • Chen, R., & Paschalidis, I. C. (2018). A robust learning approach for regression models based on distributionally robust optimization. Journal of Machine Learning Research, 19(13), 1–48.

    MathSciNet  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Article  Google Scholar 

  • Eddy, W. F. (1977). A new convex hull algorithm for planar sets. ACM Transactions on Mathematical Software, 3(4), 398–403.

    Article  Google Scholar 

  • Esfahani, P. M., & Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 115–166.

    Article  MathSciNet  Google Scholar 

  • Fernandes, B., Street, A., Valladão, D., & Fernandes, C. (2016). An adaptive robust portfolio optimization model with loss constraints based on data-driven polyhedral uncertainty sets. European Journal of Operational Research, 255(3), 961–970.

    Article  MathSciNet  Google Scholar 

  • Gamboa, C. A., Valladão, D. M., Street, A., & Mello, T. H. (2021). Decomposition methods for Wasserstein-based data-driven distributionally robust problems. Operations Research Letters, 49(5), 696–702.

    Article  MathSciNet  Google Scholar 

  • Gao, R., Chen, X., & Kleywegt, A. J. (2020). Wasserstein distributionally robust optimization and variation regularization. Operations Research. https://doi.org/10.1287/opre.2022.2383

    Article  Google Scholar 

  • Goh, J., & Sim, M. (2010). Distributionally robust optimization and its tractable approximations. Operations Research, 58(4–part–1), 902–917.

    Article  MathSciNet  Google Scholar 

  • Greenfield, J. S. (1990). A proof for a quickhull algorithm.

  • Green, P. J., & Silverman, B. W. (1979). Constructing the convex hull of a set of points in the plane. The Computer Journal, 22(3), 262–266.

    Article  Google Scholar 

  • Hao, K. (2019). Training a single AI model can emit as much carbon as five cars in their lifetimes. MIT Technology Review, 75, 103.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer Series in StatisticsSpringer.

    Book  Google Scholar 

  • Kuhn, D., Esfahani, P. M., Nguyen, V. A., & Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics (pp. 130–166). INFORMS

  • Lacoste, A., Luccioni, A., Schmidt, V. & Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700

  • Shafieezadeh Abadeh, S., Mohajerin Esfahani, P. M., & Kuhn, D. (2015). Distributionally robust logistic regression. In Proceedings of the 28th international conference on neural information processing systems, NIPS’15 (Vol. 1, pp. 1576–1584). MIT Press.

  • Shafieezadeh-Abadeh, S., Kuhn, D., & Esfahani, P. M. (2019). Regularization via mass transportation. Journal of Machine Learning Research, 20(103), 1–68.

    MathSciNet  Google Scholar 

  • Shapiro, A., Dentcheva, D., & Ruszczynski, A. (2021). Lectures on stochastic programming: Modeling and theory. SIAM.

    Book  Google Scholar 

  • Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020). Optimizer benchmarking needs to account for hyperparameter tuning. In International conference on machine learning (pp. 9036–9045). PMLR.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58(1), 267–288.

    Article  MathSciNet  Google Scholar 

  • Van Parys, B. P. G., Esfahani, P. M., & Kuhn, D. (2021). From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6), 3387–3402.

    Article  Google Scholar 

  • Wiesemann, W., Kuhn, D., & Sim, M. (2014). Distributionally robust convex optimization. Operations Research, 62(6), 1358–1376.

    Article  MathSciNet  Google Scholar 

  • Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, 295–316.

    Article  Google Scholar 

Download references

Funding

Gutierrez research was funded by FAPERJ grant 26/200.598/2021, CAPES grant and CNPq grant 141185/2019-8. Valladão research was funded by FAPERJ grant E-26/201.287/2021 and CNPq grant 309456/2020-7.

Author information

Authors and Affiliations

Authors

Contributions

Valladão devised the project and the main conceptual ideas. Gutierrez, Valladão, Pagnoncelli worked out the technical details and proofs. Valladão and Pagnoncelli conceived and planned the experiments, while Gutierrez performed the implementation and ran the algorithm for all the numerical results in the paper. Gutierrez analyzed the results and organized them in the paper, and Gutierrez, Valladão, and Pagnoncelli wrote the script.

Corresponding author

Correspondence to Tomás Gutierrez.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Problem reformulation

Consider the problem formulation in (4). Naturally, this is not a solvable problem in its current form. We then write the problem’s Lagrangian formulation:

$$\begin{aligned} \begin{array}{ll} {\mathcal {L}}(P, \kappa , \lambda ) =&{} \displaystyle \int _{{\mathcal {C}}_0} h(w;\beta ) dP \\ &{}\displaystyle - \sum _{i \in {\mathcal {F}}}\ \big [ \big ( \underline{p_i} - \int _{{\mathcal {C}}_0}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}}dP\big ) \lambda _i \big ]\\ &{}\displaystyle - \sum _{i \in {\mathcal {F}}}\ \big [ \big ( \int _{{\mathcal {C}}_0}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}}dP - \overline{p_i} \big ) \kappa _i \big ]\\ \end{array} \end{aligned}$$
(22)

Reorganizing the terms, we can then write:

$$\begin{aligned} \begin{array}{ll} {\mathcal {L}}(P, \kappa , \lambda ) =&{} \displaystyle \int _{{\mathcal {C}}_0}\big [ h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i)\big ] dP \\ &{}\displaystyle - \sum _{i \in {\mathcal {F}}}(\lambda _i \underline{p_i} - \kappa _i\overline{p_i}) \end{array} \end{aligned}$$
(23)

The Lagrange dual function \(g(\lambda , \kappa )\) can then be written as:

$$\begin{aligned} \begin{array}{ll} g(\lambda , \kappa ) &{}= \displaystyle \sup _{P \in {\mathcal {P}}} {\mathcal {L}}(P, \kappa , \lambda )\\ &{}= \displaystyle \sup _{P \in {\mathcal {P}}} \bigg \{ \displaystyle \int _{{\mathcal {C}}_0}\big [ h(W;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i)\big ] dP\bigg \} \\ &{} \hspace{15mm} \displaystyle - \sum _{i \in {\mathcal {F}}}(\lambda _i \underline{p_i} - \kappa _i\overline{p_i}) \end{array} \end{aligned}$$
(24)

Notice that the term within parenthesis can be analyzed as:

$$\begin{aligned} \begin{array}{ll} \displaystyle \sup _{P \in {\mathcal {P}}} \bigg \{ \displaystyle \int _{{\mathcal {C}}_0}\big [ h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i)\big ] dP\bigg \} &{} \\ = {\left\{ \begin{array}{ll} 0, \hspace{10mm} \text {if } h(W;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i) \le 0\\ \infty , \hspace{9mm} \text {otherwise} \end{array}\right. }&\end{array} \end{aligned}$$
(25)

Since we are only interested in the finite cost case, the dual problem \(\displaystyle \min _{\lambda \ge 0,\kappa \ge 0} g(\lambda ,\kappa )\) is given by:

$$\begin{aligned} \begin{array}{llc} \displaystyle \min _{\lambda ,\kappa }&{} \displaystyle \sum _{i \in {\mathcal {F}}} (\kappa _i \overline{p_i} - \lambda _i \underline{p_i}) \\ \text {s.t.} &{}\displaystyle h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i) \le 0,&{} \forall w \in {\mathcal {C}}_0\\ &{} \lambda _i \ge 0, &{} \forall i \in {\mathcal {F}}\\ &{} \kappa _i \ge 0, &{} \forall i \in {\mathcal {F}}\\ \end{array} \end{aligned}$$
(26)

Although we were able to remove the integrals by writing the dual version of the problem, the problem is still not tractable, as the first constraint implies an infinite amount of points to evaluate.

Proposition 2

Let \({\mathcal {R}}(w)\) be a constraint valid \(\forall w \in {\mathcal {C}}_0\), and \(\bigcup _{i \in {\mathcal {F}}}\) \(\overline{{\mathcal {C}}}_i\) be a partition of \({\mathcal {C}}_0\). Then, the following are equivalent:

$$\begin{aligned} {\mathcal {R}}(w), \forall w \in {\mathcal {C}}_0 \iff {\mathcal {R}}(w), \forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}} \end{aligned}$$
(27)

Based on Proposition 2, we can rewrite the constraint:

$$\begin{aligned} h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i) \le 0, \forall w \in {\mathcal {C}}_0 \end{aligned}$$
(28)

as

$$\begin{aligned} h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i) \le 0, \forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}} \end{aligned}$$
(29)

Proposition 3

Consider the sets \({\mathcal {F}} = \{0, 1,\ldots , I\}\), \(\{{\mathcal {V}}_i\}_{i \in {\mathcal {F}}}\) and \(\{{\mathcal {C}}_i\}_{i \in {\mathcal {F}}}\) obtained from a procedure as defined in Algorithm 1. In addition, let \(w' \in {\mathcal {C}}_0\) and \(\bigcup _{i \in {\mathcal {F}}}\) \(\overline{{\mathcal {C}}}_i\) a partition of \({\mathcal {C}}_0\). Therefore, \(w' \in \overline{{\mathcal {C}}}_i\) for some \(i \in {\mathcal {F}}\). In addition, we define the index sets of all supersets (antecedents of \({\mathcal {C}}_i\)) by \({\mathcal {A}}(i)= \{i\} \cap \{ i' \in {\mathcal {F}}: {\mathcal {C}}_i \subsetneq {\mathcal {C}}_{i'}\}\) and the index sets of all subsets (descendants of \({\mathcal {C}}_i\)) as \({\mathcal {D}}(i)= {\mathcal {F}} {\setminus } {\mathcal {A}}(i)\).

Thus we can write:

  1. 1.

    \(w'\in \{{\mathcal {C}}_j\}_{j \in {\mathcal {A}}(i)}\), \(w' \notin \{{\mathcal {C}}_j\}_{j \in {\mathcal {D}}(j)}\) for some \(j \in {\mathcal {F}}\)

  2. 2.

    \(\displaystyle \sum _{i \in {\mathcal {F}}} {\mathbb {I}}_{\{w' \in {\mathcal {C}}_i\}} \displaystyle = \sum _{i' \in {\mathcal {A}}(i)} 1\)

Based on Proposition 3, we can rewrite the constraint:

$$\begin{aligned} h(w;\beta ) - \sum _{i \in {\mathcal {F}}}{\mathbb {I}}_{\{w \in {\mathcal {C}}_i\}} (\kappa _i - \lambda _i) \le 0, \forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}} \end{aligned}$$
(30)

as

$$\begin{aligned} h(w;\beta ) - \sum _{i' \in {\mathcal {A}}(i)} (\kappa _i - \lambda _i) \le 0, \forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}} \end{aligned}$$
(31)

Since the constraint above is valid \(\forall w \in \overline{{\mathcal {C}}}_i, \forall i \in {\mathcal {F}}\), we can write:

$$\begin{aligned} \underset{w \in \overline{{\mathcal {C}}}_i}{\min } \bigg \{h(w;\beta )\bigg \} - \sum _{i' \in {\mathcal {A}}(i)} (\kappa _i - \lambda _i) \le 0, \forall i \in {\mathcal {F}} \end{aligned}$$
(32)

Concluding the proof \(\blacksquare\).

Appendix 2: The estimation of empirical probabilities

During the process of estimating the data-driven ambiguity set, we must obtain two results: (i) the polyhedral convex sets and (ii) their associated probability coverage intervals. In Sect. 2.2, we describe the data-driven procedure and present the approximation used to estimate the intervals. In this section, we present some results to endorse the approximation methodology usage.

Consider the outcome of the Algorithm 1 application and the coverage intervals associated to each convex hull \({\mathcal {C}}_i\). Ideally, the probability interval \([\underline{p}_i, \hspace{1mm} \overline{p}_i]\) should include the true probability of a random point obtained from the original data distribution to fall within the convex hull \({\mathcal {C}}_i\), considering the specified significance level \(\alpha\). In this context, we refer to accuracy as the percentage of probability intervals \([\underline{p}_i, \hspace{1mm} \overline{p}_i]\) that includes the true probability \(p_i\). In addition, another interesting property would be how well the approximation that the outside hull \({\mathcal {C}}_0\) replicates the true distribution support—which we refer to as coverage.

To assess such properties, we ran the following experiment considering a random variable X that follows a Multivariate Normal distribution as the data-generating process:

  1. 1.

    Let \(X \sim N(\mu , \Sigma )\), where \(\mu\) is the unit vector of dimension \(d = 3\) and \(\Sigma\) is the \(d \times d\) identity matrix;

  2. 2.

    Let N be the sample size used to construct the ambiguity set;

  3. 3.

    For a given N, we generate a sample and apply Algorithm 1, obtaining the convex hulls \(\{{\mathcal {C}}_i\}_{i = 0}^{{\mathcal {I}}}\) and probability intervals \(\{[\underline{p}_i, \hspace{1mm} \overline{p}_i]\}_{i = 0}^{{\mathcal {I}}}\);

  4. 4.

    For each convex hull \({\mathcal {C}}_i\) we approximate the true probability coverage value \(p_i\) considering the data generating process distribution by generating an extremely large sample size \(S = 1.000.000\) and verify whether \(p_i \in [\underline{p}_i, \hspace{1mm} \overline{p}_i]\). Such interval is calculated in step 3 using the sample with size N;

  5. 5.

    We calculate the experiment’s accuracy (percentage of estimated intervals that contain the true probability) and coverage (percentage of the distribution support within

We vary the sample size N from 10 to 10.000 and repeat the experiment 5.000 times for each value. We consider the significance level \(\alpha = 10\%\). The average accuracy is presented in Fig. 10 and the average coverage is presented in Fig. 11.

Fig. 10
figure 10

Average accuracy for each sample size

Fig. 11
figure 11

Average coverage for each sample size

One can observe that as the sample size grows, both accuracy and coverage converge to the expected values considering the significance level of the experiment. Naturally, the quality of such approximation methodology grows as the sample size grows. However, we argue that for a relatively small sample size, one can obtain a decent approximation. In addition, our empirical results validate the method as the experiments in Sects. 4.2 and 4.3 show.

Appendix 3: The choice of significance level

To apply the PolieDRO framework in classification or regression models, one should define the value of the significance level \(\alpha\). Such value can be interpreted as the flexibility of the probability coverage of each convex hull that defines the hyperspace of distributions considered, that is, the ambiguity set. The idea is that such convex hulls imply some structure that arises from the data. The added flexibility controls the degree to which the resulting ambiguity set considers possible distributions.

Those values should be considered statistical significance parameters, using typical values such as \(\alpha = 10\%\), \(\alpha = 5\%\), or \(\alpha = 1\%\). We repeat the experiment considering the different values and display them in Tables 11, 121314 and 15 for the real world data sets and in Tables 1617 and 18 for the synthetic data.

In Tables 111213, 14 and 15, we have highlighted in bold the cases where the PolieDRO version outperformed its nominal counterpart for each value of \(\alpha\). We have summarized the results in Table 19.

For the synthetic datasets, we followed the same criteria as in Sect. 4.3. We identified the highest number of wins (W), ties (T), or losses (L) for each experiment in Tables 1617 and 18, and provided a summary of the results in Table 20.

Our results indicate that the choice of the statistical parameter \(\alpha\) has little impact on the study results. In most cases, it does not alter the performance of the PolieDRO models, and in the few cases where it does, the change is not substantial.

Table 11 Mean out of sample accuracy for different \(\alpha\)
Table 12 Mean out of sample accuracy for different \(\alpha\)
Table 13 Mean out of sample accuracy for different \(\alpha\)
Table 14 Average out of sample MSE for different \(\alpha\)
Table 15 Average out of sample MSE for different \(\alpha\)
Table 16 Experiments with synthetically generated data sets, \(\alpha = 0.10\)
Table 17 Experiments with synthetically generated data sets, \(\alpha = 0.05\)
Table 18 Experiments with synthetically generated data sets, \(\alpha = 0.01\)
Table 19 Pairwise performance of the PolieDRO models against their nominal counterparts using real-world data sets, for varying \(\alpha\)
Table 20 Pairwise performance of the PolieDRO models against their nominal counterparts using synthetic data sets, for varying \(\alpha\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gutierrez, T., Valladão, D. & Pagnoncelli, B.K. PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06544-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-024-06544-9

Keywords

Navigation