Skip to main content
Log in

Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such algorithms. Biquality Learning assumes that two datasets are available at training time: a trusted dataset sampled from the distribution of interest and the untrusted dataset with dataset shifts and weaknesses of supervision (aka distribution shifts). The trusted and untrusted datasets available at training time make designing algorithms dealing with any distribution shifts possible. We propose two methods, one inspired by the label noise literature and another by the covariate shift literature for biquality learning. We experiment with two novel methods to synthetically introduce concept drift and class-conditional shifts in real-world datasets across many of them. We opened some discussions and assessed that developing biquality learning algorithms robust to distributional changes remains an interesting problem for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

All data used in this study are available publicly online on openML: https://www.openml.org/

Code availability

Code is available at the following url: https://github.com/pierrenodet/blds

References

  • Bickel, S., Brückner, M., Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on Machine learning, pp 81–88.

  • Boser, B.E., Guyon, I.M., Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 144–152.

  • Breiman, L. (1984). Classification and regression trees. Routledge.

    MATH  Google Scholar 

  • Chang, C. C., & Lin, C. J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., et al. (2022). Introduction to algorithms. MIT press.

  • David, S.B., Lu, T., Luu, T., et al. (2010). Impossibility theorems for domain adaptation. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp 129–136.

  • Dua, D., Graff, C. (2017). Uci machine learning repository. http://archive.ics.uci.edu/ml.

  • Fang, T., Lu, N., Niu, G., et al. (2020). Rethinking importance weighting for deep learning under distribution shift. Advances in Neural Information Processing Systems, 33, 11,996-12,007.

    Google Scholar 

  • Gama, J., Žliobaitė, I., Bifet, A., et al. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 1–37.

    Article  MATH  Google Scholar 

  • Gretton, A., Smola, A., Huang, J., et al. (2009). Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, 3(4), 5.

    Google Scholar 

  • Guyon, I. (2010). Datasets of the active learning challenge. Tech. rep.: University of Wisconsin-Madison Department of Computer Sciences.

  • Hastie, T., Tibshirani, R., Friedman, J. H., et al. (2009). The elements of statistical learning: data mining, inference, and prediction, (Vol. 2). Springer.

    Book  MATH  Google Scholar 

  • Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems, 31, 10456–10465.

    Google Scholar 

  • Huang, J., Gretton, A., Borgwardt, K., et al. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608.

  • Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.

    Article  MATH  Google Scholar 

  • Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML.

  • Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

  • Kreuzberger, D., Kühl, N., Hirschl, S. (2022). Machine learning operations (mlops): Overview, definition, and architecture. arXiv preprint arXiv:2205.02302.

  • Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2), 129–137.

    Article  MathSciNet  MATH  Google Scholar 

  • Miao, Y.Q., Farahat, A.K., Kamel, M.S. (2015). Ensemble kernel mean matching. In: 2015 IEEE International Conference on Data Mining, IEEE, pp 330–338.

  • Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., et al. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530.

    Article  Google Scholar 

  • Nemenyi, P. (1962). Distribution-free multiple comparisons. Biometrics, 18(2), 263.

    Google Scholar 

  • Nikodym, O. (1930). Sur une généralisation des intégrales de m. j. radon. Fundamenta Mathematicae, 15(1),131–179.

  • Nodet, P., Lemaire, V., Bondu, A., et al. (2021a). From Weakly Supervised Learning to Biquality Learning: an Introduction. In: International Joint Conference on Neural Networks (IJCNN). IEEE.

  • Nodet, P., Lemaire, V., Bondu, A., et al. (2021b). Importance reweighting for biquality learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Ratner, A., Bach, S. H., Ehrenberg, H., et al. (2020). Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29(2), 709–730.

    Article  Google Scholar 

  • Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: International conference on machine learning, PMLR, pp 4334–4343.

  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  • Rudin, W. (1975). Analyse réelle et complexe. Dunod.

    MATH  Google Scholar 

  • Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in Neural Information Processing Systems 32.

  • Steinhardt, J., Koh, P.W.W., Liang, P.S. (2017) Certified defenses for data poisoning attacks. Advances in Neural Information Processing Systems 30.

  • Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.

    Book  MATH  Google Scholar 

  • Vanschoren, J., van Rijn, J. N., Bischl, B., et al. (2014). Openml: Networked science in machine learning. SIGKDD Explor Newsl, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198

    Article  Google Scholar 

  • Veeramachaneni, K., Arnaldo, I., Korrapati, V., et al (2016) Ai2: Training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), pp 49–54.

  • Vert, J. P., Tsuda, K., & Schölkopf, B. (2004). A primer on kernel methods. Kernel Methods in Computational Biology, 47, 35–70.

    Article  Google Scholar 

  • Wilcoxon, F. (1992). Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, p 196–202.

  • Wright, S.J. (1999). Continuous optimization (nonlinear and linear programming). Foundations of Computer-Aided Process Design.

  • Yang, J., Zhou, K., Li, Y., et al. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.

  • Ye, Y., & Tse, E. (1989). An extension of karmarkar’s projective algorithm for convex quadratic programming. Mathematical Programming, 44(1), 157–179.

    Article  MathSciNet  MATH  Google Scholar 

  • Yuen, M.C., King, I., Leung, K.S. (2011). A survey of crowdsourcing systems. In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, IEEE, pp 766–773.

  • Zadrozny, B., Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’01, p 609–616

  • Zadrozny, B., Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 694–699.

  • Zheng, G., Awadallah, A.H., Dumais, S. (2021). Meta label correction for noisy label learning. Proceedings of the AAAI Conference on Artificial Intelligence 35

  • Zhou, Z. H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53.

    Article  MathSciNet  Google Scholar 

Download references

Funding

Pierre Nodet, Vincent Lemaire, Alexis Bondu, and Antoine Cornuéjols received funding from Orange SA.

Author information

Authors and Affiliations

Authors

Contributions

PN, VL, AB, and AC contributed to the manuscript equally.

Corresponding author

Correspondence to Pierre Nodet.

Ethics declarations

Conflict of interest

Pierre Nodet, Vincent Lemaire, and Alexis Bondu have received research support from Orange SA. Antoine Cornuéjols has received research support from AgroParisTech and Orange SA.

Ethics approval

Not Applicable.

Consent to participate

All authors have read and approved the final manuscript.

Consent for publication

Not Applicable.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Fig. 7
figure 7

Additional results of the Wilcoxon signed rank test computed on all datasets. Each figure compares one competitor versus another for a given trusted data ratio. Figures in the same row are the same competitors against different cases of trusted data ratio: \(p=0.25\), \(p=0.5\), \(p=0.75\). In each figure “\(\circ\)”, “\(\cdot\)” and “\(\bullet\)” indicate respectively a win, a tie, or a loss of the first competitor compared to the second competitor, the vertical axis is r, and the horizontal axis is \(\rho\)

Table 4 Table of conversion between p and the actual ratio of trusted data (actual)
Table 5 Table of conversion between \(\rho\) and the actual ratio between the number of samples after and before subsampling (actual)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nodet, P., Lemaire, V., Bondu, A. et al. Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts. Mach Learn 112, 4663–4692 (2023). https://doi.org/10.1007/s10994-023-06372-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06372-3

Keywords

Navigation