Advertisement

Machine Learning

, Volume 107, Issue 8–10, pp 1561–1595 | Cite as

Learning from binary labels with instance-dependent noise

  • Aditya Krishna Menon
  • Brendan van Rooyen
  • Nagarajan Natarajan
Article
  • 292 Downloads
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track

Abstract

Supervised learning has seen numerous theoretical and practical advances over the last few decades. However, its basic assumption of identical train and test distributions often fails to hold in practice. One important example of this is when the training instances are subject to label noise: that is, where the observed labels do not accurately reflect the underlying ground truth. While the impact of simple noise models has been extensively studied, relatively less attention has been paid to the practically relevant setting of instance-dependent label noise. It is thus unclear whether one can learn, both in theory and in practice, good models from data subject to such noise, with no access to clean labels. We provide a theoretical analysis of this issue, with three contributions. First, we prove that for instance-dependent (but label-independent) noise, any algorithm that is consistent for classification on the noisy distribution is also consistent on the noise-free distribution. Second, we prove that consistency also holds for the area under the ROC curve, assuming the noise scales (in a precise sense) with the inherent difficulty of an instance. Third, we show that the Isotron algorithm can efficiently and provably learn from noisy samples when the noise-free distribution is a generalised linear model. We empirically confirm our theoretical findings, which we hope may stimulate further analysis of this important learning setting.

Keywords

Label noise Instance-dependent noise Consistency 

References

  1. Agarwal, S. (2014). Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal of Machine Learning Research, 15, 1653–1674.MathSciNetzbMATHGoogle Scholar
  2. Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. In Conference on learning theory (COLT), Springer (pp. 32–47).Google Scholar
  3. Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.Google Scholar
  4. Awasthi, P., Balcan, M. F., & Long, P. M. (2014). The power of localization for efficiently learning linear separators with noise. In Symposium on the theory of computing (STOC) (pp. 449–458).Google Scholar
  5. Awasthi, P., Balcan, M. F., Haghtalab, N., & Urner, R. (2015). Efficient learning of linear separators under bounded noise. Conference on Learning Theory (COLT), 40, 167–190.Google Scholar
  6. Awasthi, P., Balcan, M., Haghtalab, N., & Zhang, H. (2016). Learning and 1-bit compressed sensing under asymmetric noise. In Conference on learning theory (COLT) (pp. 152–192).Google Scholar
  7. Awasthi, P., Balcan, M., & Long, P. M. (2017). The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6), 50.MathSciNetCrossRefzbMATHGoogle Scholar
  8. Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics, 26(4), 641–647.MathSciNetCrossRefzbMATHGoogle Scholar
  9. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Bhatia, K., Jain, P., & Kar, P. (2015). Robust regression via hard thresholding. In Advances in neural information processing systems (NIPS) (pp. 721–729).Google Scholar
  11. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Conference on learning theory (COLT) (pp. 92–100).Google Scholar
  12. Blum, A., Frieze, A., Kannan, R., & Vempala, S.(1996). A polynomial-time algorithm for learning noisy linear threshold functions. In Foundations of computer science (FOCS) (pp. 330–338).Google Scholar
  13. Bootkrajang, J. (2016). A generalised label noise model for classification in the presence of annotation errors. Neurocomputing, 192, 61–71.CrossRefGoogle Scholar
  14. Bootkrajang, J., & Kabán, A. (2014). Learning kernel logistic regression in the presence of class label noise. Pattern Recognition, 47(11), 3641–3655.CrossRefzbMATHGoogle Scholar
  15. Bylander, T. (1994). Learning linear threshold functions in the presence of classification noise. In Conference on learning theory (COLT) (pp. 340–347).Google Scholar
  16. Bylander, T. (1997). Learning probabilistically consistent linear threshold functions. In Conference on learning theory (COLT) (pp. 62–71).Google Scholar
  17. Bylander, T. (1998). Learning noisy linear threshold functions (unpublished manuscript). http://www.cs.utsa.edu/~bylander/pubs/learning-noisy-ltfs.ps.gz.
  18. Clémençon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2), 844–874.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Decatur, S. E. (1997). PAC learning with constant-partition classification noise and applications to decision tree induction. In International conference on machine learning (ICML) (pp. 83–91).Google Scholar
  20. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  21. Du, J., & Cai, Z. (2015). Modelling class noise with symmetric and asymmetric distributions. In Conference on artificial intelligence (AAAI) (pp. 2589–2595).Google Scholar
  22. Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In International conference on knowledge discovery and data mining (KDD) (pp. 213–220).Google Scholar
  23. Frénay, B., & Kabán, A. (2014). A comprehensive introduction to label noise. In European symposium on artificial neural networks (ESANN) (pp. 667—676).Google Scholar
  24. Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.CrossRefGoogle Scholar
  25. Ghosh, A., Manwani, N., & Sastry, P. S. (2015). Making risk minimization tolerant to label noise. Neurocomputing, 160, 93–107.CrossRefGoogle Scholar
  26. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).Google Scholar
  27. Jain, S., White, M., & Radivojac, P. (2016). Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in neural information processing systems (NIPS) (pp. 2685–2693).Google Scholar
  28. Kakade, S., Kanade, V., Shamir, O., & Kalai, A.(2011). Efficient learning of generalized linear and single index models with isotonic regression. In Advances in neural information processing systems (NIPS) (pp. 927–935).Google Scholar
  29. Kalai, A., & Sastry, R. (2009). The Isotron algorithm: High-dimensional isotonic regression. In Conference on learning theory (COLT).Google Scholar
  30. Kalai, A., Klivans, A., Mansour, Y., & Servedio, R. (2005). Agnostically learning halfspaces. In Foundations of computer systems (FOCS) (pp. 11–20).Google Scholar
  31. Koyejo, O. O., Natarajan, N., Ravikumar, P. K., & Dhillon, I. S. (2014). Consistent binary classification with generalized performance metrics. In Advances in neural information processing systems (NIPS) (pp. 2744–2752).Google Scholar
  32. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1106–1114).Google Scholar
  33. Le, Q. V., Smola, A. J., & Canu, S. (2005). Heteroscedastic gaussian process regression. In International conference on machine learning (ICML) (pp. 489–496).Google Scholar
  34. Ling, C. X., & Li, C. (1998). Data mining for direct marketing: Problems and solutions. In Knowledge discovery and data mining (KDD) (pp. 73–79).Google Scholar
  35. Liu, T., & Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 447–461.Google Scholar
  36. Long, P., & Servedio, R. (2008). Random classification noise defeats all convex potential boosters. In International conference on machine learning (ICML) (pp. 608–615).Google Scholar
  37. Manwani, N., & Sastry, P. S. (2013). Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 43(3), 1146–1151.CrossRefGoogle Scholar
  38. Massart, P., & Nédélec, E. (2006). Risk bounds for statistical learning. The Annals of Statistics, 34(5), 2326–2366.MathSciNetCrossRefzbMATHGoogle Scholar
  39. Menon, A. K., van Rooyen, B., Ong, C. S., & Williamson, B. (2015). Learning from corrupted binary labels via class-probability estimation. In International conference on machine learning (ICML) (pp. 125–134).Google Scholar
  40. Narasimhan, H., Vaish, R., & Agarwal, S. (2014). On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in neural information processing systems (NIPS) (pp. 1493–1501).Google Scholar
  41. Natarajan, N., Dhillon, I. S., Ravikumar, P. D., & Tewari, A. (2013). Learning with noisy labels. In Advances in neural information processing systems (NIPS) (pp. 1196–1204).Google Scholar
  42. Nguyen, N. H., & Tran, T. D. (2013). Exact recoverability from dense corrupted observations via \(\ell _1\)-minimization. IEEE Transactions on Information Theory, 59(4), 2017–2035.MathSciNetCrossRefzbMATHGoogle Scholar
  43. Patrini, G., Nielsen, F., Nock, R., & Carioni, M.(2016). Loss factorization, weakly supervised learning and label noise robustness. In International conference on machine learning (ICML) (pp. 708–717).Google Scholar
  44. Patrini, G., Rozza, A., Menon, A., Nock, R., & Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. In Computer vision and pattern recognition (CVPR) (pp. 2233–2241).Google Scholar
  45. Plessis, M. C., Niu, G., Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In International conference on machine learning (ICML) (pp. 1386–1394).Google Scholar
  46. Ralaivola, L., Denis, F., & Magnan, C. N.(2006). CN = CPCN. In International conference on machine learning (ICML) (pp. 721–728).Google Scholar
  47. Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., & Rabinovich, A. (2014). Training deep neural networks on noisy labels with bootstrapping. CoRR abs/1412.6596.Google Scholar
  48. Reid, M. D., & Williamson, R. C.(2009). Surrogate regret bounds for proper losses. In International conference on machine learning (ICML) (pp. 897–904).Google Scholar
  49. van Rooyen, B., Menon, A. K., & Williamson, R. C. (2015). Learning with symmetric label noise: the importance of being unhinged. In Advances in neural information processing systems (NIPS) (pp. 10–18).Google Scholar
  50. Schölkopf, B., & Smola, A. J. (2001). Learning with kernels. Cambridge: MIT Press.zbMATHGoogle Scholar
  51. Scott, C., Blanchard, G., & Handy, G. (2013). Classification with asymmetric label noise: Consistency and maximal denoising. In Conference on learning theory (COLT) (pp. 489–511).Google Scholar
  52. Servedio, R. (1999). On PAC learning using winnow, perceptron, and a perceptron-like algorithm. In Conference on learning theory (COLT) (pp. 296–307).Google Scholar
  53. Shalizi, C. R. (2017). Advanced data analysis from an elementary point of view (unpublished book draft). http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf.
  54. Steinwart, I., & Scovel, C. (2005). Fast rates for support vector machines. In Conference on learning theory (COLT) (pp. 279–294).Google Scholar
  55. Stempfel, G., & Ralaivola, L. (2007). Learning kernel perceptrons on noisy data using random projections. In Algorithmic learning theory (ALT) (pp. 328–342).Google Scholar
  56. Stempfel, G., & Ralaivola, L. (2009). Learning SVMs from sloppily labeled data. In International conference on artificial neural networks (ICANN) (pp. 884–893).Google Scholar
  57. Wright, J., & Ma, Y. (2010). Dense error correction via \(\ell _1\)-minimization. IEEE Transactions on Information Theory, 56(7), 3540–3560.MathSciNetCrossRefzbMATHGoogle Scholar
  58. Xiao, T., Xia, T., Yang, Y., Huang, C., & Wang, X.(2015). Learning from massive noisy labeled data for image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2691–2699).Google Scholar
  59. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Aditya Krishna Menon
    • 1
  • Brendan van Rooyen
    • 1
  • Nagarajan Natarajan
    • 2
  1. 1.The Australian National UniversityCanberraAustralia
  2. 2.Microsoft ResearchBangaloreIndia

Personalised recommendations