Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts

Nodet, Pierre; Lemaire, Vincent; Bondu, Alexis; Cornuéjols, Antoine

doi:10.1007/s10994-023-06372-3

Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts

Published: 21 September 2023

Volume 112, pages 4663–4692, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

Pierre Nodet ORCID: orcid.org/0000-0001-5118-8072^1,3,
Vincent Lemaire²,
Alexis Bondu¹ &
…
Antoine Cornuéjols³

362 Accesses
2 Altmetric
Explore all metrics

Abstract

Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such algorithms. Biquality Learning assumes that two datasets are available at training time: a trusted dataset sampled from the distribution of interest and the untrusted dataset with dataset shifts and weaknesses of supervision (aka distribution shifts). The trusted and untrusted datasets available at training time make designing algorithms dealing with any distribution shifts possible. We propose two methods, one inspired by the label noise literature and another by the covariate shift literature for biquality learning. We experiment with two novel methods to synthetically introduce concept drift and class-conditional shifts in real-world datasets across many of them. We opened some discussions and assessed that developing biquality learning algorithms robust to distributional changes remains an interesting problem for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-Enhancer: A Self-supervised Framework for Low-Supervision, Drifted Data with Significant Missing Values

Safe semi-supervised learning: a brief introduction

Article 18 June 2019

Nuclear discrepancy for single-shot batch active learning

Article Open access 26 June 2019

Data availability

All data used in this study are available publicly online on openML: https://www.openml.org/

Code availability

Code is available at the following url: https://github.com/pierrenodet/blds

References

Bickel, S., Brückner, M., Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on Machine learning, pp 81–88.
Boser, B.E., Guyon, I.M., Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 144–152.
Breiman, L. (1984). Classification and regression trees. Routledge.
MATH Google Scholar
Chang, C. C., & Lin, C. J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.
Article Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Cormen, T. H., Leiserson, C. E., Rivest, R. L., et al. (2022). Introduction to algorithms. MIT press.
David, S.B., Lu, T., Luu, T., et al. (2010). Impossibility theorems for domain adaptation. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp 129–136.
Dua, D., Graff, C. (2017). Uci machine learning repository. http://archive.ics.uci.edu/ml.
Fang, T., Lu, N., Niu, G., et al. (2020). Rethinking importance weighting for deep learning under distribution shift. Advances in Neural Information Processing Systems, 33, 11,996-12,007.
Google Scholar
Gama, J., Žliobaitė, I., Bifet, A., et al. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 1–37.
Article MATH Google Scholar
Gretton, A., Smola, A., Huang, J., et al. (2009). Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, 3(4), 5.
Google Scholar
Guyon, I. (2010). Datasets of the active learning challenge. Tech. rep.: University of Wisconsin-Madison Department of Computer Sciences.
Hastie, T., Tibshirani, R., Friedman, J. H., et al. (2009). The elements of statistical learning: data mining, inference, and prediction, (Vol. 2). Springer.
Book MATH Google Scholar
Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems, 31, 10456–10465.
Google Scholar
Huang, J., Gretton, A., Borgwardt, K., et al. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
Article MATH Google Scholar
Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML.
Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
Kreuzberger, D., Kühl, N., Hirschl, S. (2022). Machine learning operations (mlops): Overview, definition, and architecture. arXiv preprint arXiv:2205.02302.
Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2), 129–137.
Article MathSciNet MATH Google Scholar
Miao, Y.Q., Farahat, A.K., Kamel, M.S. (2015). Ensemble kernel mean matching. In: 2015 IEEE International Conference on Data Mining, IEEE, pp 330–338.
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., et al. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530.
Article Google Scholar
Nemenyi, P. (1962). Distribution-free multiple comparisons. Biometrics, 18(2), 263.
Google Scholar
Nikodym, O. (1930). Sur une généralisation des intégrales de m. j. radon. Fundamenta Mathematicae, 15(1),131–179.
Nodet, P., Lemaire, V., Bondu, A., et al. (2021a). From Weakly Supervised Learning to Biquality Learning: an Introduction. In: International Joint Conference on Neural Networks (IJCNN). IEEE.
Nodet, P., Lemaire, V., Bondu, A., et al. (2021b). Importance reweighting for biquality learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Ratner, A., Bach, S. H., Ehrenberg, H., et al. (2020). Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29(2), 709–730.
Article Google Scholar
Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: International conference on machine learning, PMLR, pp 4334–4343.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Rudin, W. (1975). Analyse réelle et complexe. Dunod.
MATH Google Scholar
Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in Neural Information Processing Systems 32.
Steinhardt, J., Koh, P.W.W., Liang, P.S. (2017) Certified defenses for data poisoning attacks. Advances in Neural Information Processing Systems 30.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.
Book MATH Google Scholar
Vanschoren, J., van Rijn, J. N., Bischl, B., et al. (2014). Openml: Networked science in machine learning. SIGKDD Explor Newsl, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198
Article Google Scholar
Veeramachaneni, K., Arnaldo, I., Korrapati, V., et al (2016) Ai2: Training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), pp 49–54.
Vert, J. P., Tsuda, K., & Schölkopf, B. (2004). A primer on kernel methods. Kernel Methods in Computational Biology, 47, 35–70.
Article Google Scholar
Wilcoxon, F. (1992). Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, p 196–202.
Wright, S.J. (1999). Continuous optimization (nonlinear and linear programming). Foundations of Computer-Aided Process Design.
Yang, J., Zhou, K., Li, Y., et al. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
Ye, Y., & Tse, E. (1989). An extension of karmarkar’s projective algorithm for convex quadratic programming. Mathematical Programming, 44(1), 157–179.
Article MathSciNet MATH Google Scholar
Yuen, M.C., King, I., Leung, K.S. (2011). A survey of crowdsourcing systems. In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, IEEE, pp 766–773.
Zadrozny, B., Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’01, p 609–616
Zadrozny, B., Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 694–699.
Zheng, G., Awadallah, A.H., Dumais, S. (2021). Meta label correction for noisy label learning. Proceedings of the AAAI Conference on Artificial Intelligence 35
Zhou, Z. H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53.
Article MathSciNet Google Scholar

Download references

Funding

Pierre Nodet, Vincent Lemaire, Alexis Bondu, and Antoine Cornuéjols received funding from Orange SA.

Author information

Authors and Affiliations

Orange Labs, Châtillon, 92320, France
Pierre Nodet & Alexis Bondu
Orange Labs, Lannion, 22300, France
Vincent Lemaire
UMR MIA-Paris, AgroParisTech, INRAe, Université Paris-Saclay, Palaiseau, 91120, France
Pierre Nodet & Antoine Cornuéjols

Authors

Pierre Nodet
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Lemaire
View author publications
You can also search for this author in PubMed Google Scholar
Alexis Bondu
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Cornuéjols
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

PN, VL, AB, and AC contributed to the manuscript equally.

Corresponding author

Correspondence to Pierre Nodet.

Ethics declarations

Conflict of interest

Pierre Nodet, Vincent Lemaire, and Alexis Bondu have received research support from Orange SA. Antoine Cornuéjols has received research support from AgroParisTech and Orange SA.

Ethics approval

Not Applicable.

Consent to participate

All authors have read and approved the final manuscript.

Consent for publication

Not Applicable.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 4 Table of conversion between p and the actual ratio of trusted data (actual)

Full size table

Table 5 Table of conversion between \(\rho\) and the actual ratio between the number of samples after and before subsampling (actual)

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nodet, P., Lemaire, V., Bondu, A. et al. Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts. Mach Learn 112, 4663–4692 (2023). https://doi.org/10.1007/s10994-023-06372-3

Download citation

Received: 23 November 2022
Revised: 17 May 2023
Accepted: 27 July 2023
Published: 21 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10994-023-06372-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts

Abstract

Access this article

Similar content being viewed by others

Self-Enhancer: A Self-supervised Framework for Low-Supervision, Drifted Data with Significant Missing Values

Safe semi-supervised learning: a brief introduction

Nuclear discrepancy for single-shot batch active learning

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts

Abstract

Access this article

Similar content being viewed by others

Self-Enhancer: A Self-supervised Framework for Low-Supervision, Drifted Data with Significant Missing Values

Safe semi-supervised learning: a brief introduction

Nuclear discrepancy for single-shot batch active learning

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation