Abstract
Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such algorithms. Biquality Learning assumes that two datasets are available at training time: a trusted dataset sampled from the distribution of interest and the untrusted dataset with dataset shifts and weaknesses of supervision (aka distribution shifts). The trusted and untrusted datasets available at training time make designing algorithms dealing with any distribution shifts possible. We propose two methods, one inspired by the label noise literature and another by the covariate shift literature for biquality learning. We experiment with two novel methods to synthetically introduce concept drift and class-conditional shifts in real-world datasets across many of them. We opened some discussions and assessed that developing biquality learning algorithms robust to distributional changes remains an interesting problem for future research.
Similar content being viewed by others
Data availability
All data used in this study are available publicly online on openML: https://www.openml.org/
Code availability
Code is available at the following url: https://github.com/pierrenodet/blds
References
Bickel, S., Brückner, M., Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on Machine learning, pp 81–88.
Boser, B.E., Guyon, I.M., Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 144–152.
Breiman, L. (1984). Classification and regression trees. Routledge.
Chang, C. C., & Lin, C. J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., et al. (2022). Introduction to algorithms. MIT press.
David, S.B., Lu, T., Luu, T., et al. (2010). Impossibility theorems for domain adaptation. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp 129–136.
Dua, D., Graff, C. (2017). Uci machine learning repository. http://archive.ics.uci.edu/ml.
Fang, T., Lu, N., Niu, G., et al. (2020). Rethinking importance weighting for deep learning under distribution shift. Advances in Neural Information Processing Systems, 33, 11,996-12,007.
Gama, J., Žliobaitė, I., Bifet, A., et al. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 1–37.
Gretton, A., Smola, A., Huang, J., et al. (2009). Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, 3(4), 5.
Guyon, I. (2010). Datasets of the active learning challenge. Tech. rep.: University of Wisconsin-Madison Department of Computer Sciences.
Hastie, T., Tibshirani, R., Friedman, J. H., et al. (2009). The elements of statistical learning: data mining, inference, and prediction, (Vol. 2). Springer.
Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems, 31, 10456–10465.
Huang, J., Gretton, A., Borgwardt, K., et al. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML.
Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
Kreuzberger, D., Kühl, N., Hirschl, S. (2022). Machine learning operations (mlops): Overview, definition, and architecture. arXiv preprint arXiv:2205.02302.
Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2), 129–137.
Miao, Y.Q., Farahat, A.K., Kamel, M.S. (2015). Ensemble kernel mean matching. In: 2015 IEEE International Conference on Data Mining, IEEE, pp 330–338.
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., et al. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530.
Nemenyi, P. (1962). Distribution-free multiple comparisons. Biometrics, 18(2), 263.
Nikodym, O. (1930). Sur une généralisation des intégrales de m. j. radon. Fundamenta Mathematicae, 15(1),131–179.
Nodet, P., Lemaire, V., Bondu, A., et al. (2021a). From Weakly Supervised Learning to Biquality Learning: an Introduction. In: International Joint Conference on Neural Networks (IJCNN). IEEE.
Nodet, P., Lemaire, V., Bondu, A., et al. (2021b). Importance reweighting for biquality learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Ratner, A., Bach, S. H., Ehrenberg, H., et al. (2020). Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29(2), 709–730.
Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: International conference on machine learning, PMLR, pp 4334–4343.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Rudin, W. (1975). Analyse réelle et complexe. Dunod.
Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in Neural Information Processing Systems 32.
Steinhardt, J., Koh, P.W.W., Liang, P.S. (2017) Certified defenses for data poisoning attacks. Advances in Neural Information Processing Systems 30.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.
Vanschoren, J., van Rijn, J. N., Bischl, B., et al. (2014). Openml: Networked science in machine learning. SIGKDD Explor Newsl, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198
Veeramachaneni, K., Arnaldo, I., Korrapati, V., et al (2016) Ai2: Training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), pp 49–54.
Vert, J. P., Tsuda, K., & Schölkopf, B. (2004). A primer on kernel methods. Kernel Methods in Computational Biology, 47, 35–70.
Wilcoxon, F. (1992). Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, p 196–202.
Wright, S.J. (1999). Continuous optimization (nonlinear and linear programming). Foundations of Computer-Aided Process Design.
Yang, J., Zhou, K., Li, Y., et al. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
Ye, Y., & Tse, E. (1989). An extension of karmarkar’s projective algorithm for convex quadratic programming. Mathematical Programming, 44(1), 157–179.
Yuen, M.C., King, I., Leung, K.S. (2011). A survey of crowdsourcing systems. In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, IEEE, pp 766–773.
Zadrozny, B., Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’01, p 609–616
Zadrozny, B., Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 694–699.
Zheng, G., Awadallah, A.H., Dumais, S. (2021). Meta label correction for noisy label learning. Proceedings of the AAAI Conference on Artificial Intelligence 35
Zhou, Z. H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53.
Funding
Pierre Nodet, Vincent Lemaire, Alexis Bondu, and Antoine Cornuéjols received funding from Orange SA.
Author information
Authors and Affiliations
Contributions
PN, VL, AB, and AC contributed to the manuscript equally.
Corresponding author
Ethics declarations
Conflict of interest
Pierre Nodet, Vincent Lemaire, and Alexis Bondu have received research support from Orange SA. Antoine Cornuéjols has received research support from AgroParisTech and Orange SA.
Ethics approval
Not Applicable.
Consent to participate
All authors have read and approved the final manuscript.
Consent for publication
Not Applicable.
Additional information
Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nodet, P., Lemaire, V., Bondu, A. et al. Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts. Mach Learn 112, 4663–4692 (2023). https://doi.org/10.1007/s10994-023-06372-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06372-3