Skip to main content

Rethinking Confidence Calibration for Failure Prediction

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13685)

Abstract

Reliable confidence estimation for the predictions is important in many safety-critical applications. However, modern deep neural networks are often overconfident for their incorrect predictions. Recently, many calibration methods have been proposed to alleviate the overconfidence problem. With calibrated confidence, a primary and practical purpose is to detect misclassification errors by filtering out low-confidence predictions (known as failure prediction). In this paper, we find a general, widely-existed but actually-neglected phenomenon that most confidence calibration methods are useless or harmful for failure prediction. We investigate this problem and reveal that popular confidence calibration methods often lead to worse confidence separation between correct and incorrect samples, making it more difficult to decide whether to trust a prediction or not. Finally, inspired by the natural connection between flat minima and confidence separation, we propose a simple hypothesis: flat minima is beneficial for failure prediction. We verify this hypothesis via extensive experiments and further boost the performance by combining two different flat minima techniques. Our code is available at https://github.com/Impression2805/FMFP.

Keywords

  • Failure prediction
  • Confidence calibration
  • Flat minima
  • Uncertainty
  • Misclassification detection
  • Selective classification

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Achille, A., Soatto, S.: Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 19, 50:1–50:34 (2018)

    Google Scholar 

  2. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)

  3. Brier, G.W., et al.: Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78(1), 1–3 (1950)

    CrossRef  Google Scholar 

  4. Cha, J., et al.: SWAD: domain generalization by seeking flat minima. In: NeurIPS (2021)

    Google Scholar 

  5. Chaudhari, P., et al.: Entropy-SGD: biasing gradient descent into wide valleys. J. Stat. Mech. Theory Exp. 2019(12), 124018 (2019)

    CrossRef  MathSciNet  Google Scholar 

  6. Chen, T., Zhang, Z., Liu, S., Chang, S., Wang, Z.: Robust overfitting may be mitigated by properly learned smoothening. In: ICLR (2021)

    Google Scholar 

  7. Corbière, C., Thome, N., Bar-Hen, A., Cord, M., Pérez, P.: Addressing failure prediction by learning model confidence. In: NeurIPS, pp. 2898–2909 (2019)

    Google Scholar 

  8. Corbière, C., Thome, N., Saporta, A., Vu, T.H., Cord, M., Perez, P.: Confidence estimation via auxiliary models. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6043–6055 (2021)

    Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  10. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  11. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)

    CrossRef  Google Scholar 

  12. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: ICLR (2020)

    Google Scholar 

  13. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: ICML, vol. 48, pp. 1050–1059 (2016)

    Google Scholar 

  14. Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: NeurIPS, pp. 4878–4887 (2017)

    Google Scholar 

  15. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: ICML, pp. 1321–1330 (2017)

    Google Scholar 

  16. Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., Hartley, R.: Calibration of neural networks using splines. In: ICLR (2020)

    Google Scholar 

  17. Havasi, M., et al.: Training independent subnetworks for robust prediction. In: ICLR (2020)

    Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    CrossRef  Google Scholar 

  20. Hebbalaguppe, R., Prakash, J., Madan, N., Arora, C.: A stitch in time saves nine: a train-time regularizing loss for improved neural network calibration. In: CVPR, pp. 16081–16090, June 2022

    Google Scholar 

  21. Hendrycks, D., Dietterich, T.G.: Benchmarking neural network robustness to common corruptions and perturbations. In: ICLR (2019)

    Google Scholar 

  22. Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: ICLR (2017)

    Google Scholar 

  23. Hendrycks, D., Mazeika, M., Dietterich, T.G.: Deep anomaly detection with outlier exposure. In: ICLR (2019)

    Google Scholar 

  24. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2020)

    Google Scholar 

  25. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  26. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)

    Google Scholar 

  27. Huang, W.R., et al.: Understanding generalization through visualizations. In: “I Can’t Believe It’s Not Better!” NeurIPS 2020 Workshop (2020)

    Google Scholar 

  28. Izmailov, P., Wilson, A., Podoprikhin, D., Vetrov, D., Garipov, T.: Averaging weights leads to wider optima and better generalization. In: UAI, pp. 876–885 (2018)

    Google Scholar 

  29. Janai, J., Güney, F., Behl, A., Geiger, A., et al.: Computer vision for autonomous vehicles: Problems, datasets and state of the art. Found. Trends® Comput. Graph. Vis. 12(1–3), 1–308 (2020)

    Google Scholar 

  30. Jiang, H., Kim, B., Gupta, M.R.: To trust or not to trust a classifier. In: NeurIPS (2018)

    Google Scholar 

  31. Joo, T., Chung, U.: Revisiting explicit regularization in neural networks for well-calibrated predictive uncertainty. arXiv preprint arXiv:2006.06399 (2020)

  32. Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: NeurIPS, pp. 5574–5584 (2017)

    Google Scholar 

  33. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

    Google Scholar 

  34. Kull, M., Perelló-Nieto, M., Kängsepp, M., de Menezes e Silva Filho, T., Song, H., Flach, P.A.: Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In: NeurIPS, pp. 12295–12305 (2019)

    Google Scholar 

  35. Kull, M., de Menezes e Silva Filho, T., Flach, P.A.: Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In: AISTATS, pp. 623–631 (2017)

    Google Scholar 

  36. Kumar, A., Liang, P.S., Ma, T.: Verified uncertainty calibration. In: NeurIPS (2019)

    Google Scholar 

  37. Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: NeurIPS, pp. 7167–7177 (2018)

    Google Scholar 

  38. Leidner, D., Borst, C., Dietrich, A., Beetz, M., Albu-Schäffer, A.: Classifying compliant manipulation tasks for automated planning in robotics. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1769–1776 (2015)

    Google Scholar 

  39. Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: ICLR (2018)

    Google Scholar 

  40. Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327 (2020)

    CrossRef  Google Scholar 

  41. Liu, B., Ben Ayed, I., Galdran, A., Dolz, J.: The devil is in the margin: margin-based label smoothing for network calibration. In: CVPR, pp. 80–88, June 2022

    Google Scholar 

  42. Luo, Y., Wong, Y., Kankanhalli, M.S., Zhao, Q.: Learning to predict trustworthiness with steep slope loss. NeurIPS (2021)

    Google Scholar 

  43. Minderer, M., et al.: Revisiting the calibration of modern neural networks. In: NeurIPS (2021)

    Google Scholar 

  44. Miotto, R., Li, L., Kidd, B.A., Dudley, J.T.: Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016)

    Google Scholar 

  45. Moon, J., Kim, J., Shin, Y., Hwang, S.: Confidence-aware learning for deep neural networks. In: ICML, pp. 7034–7044 (2020)

    Google Scholar 

  46. Mozafari, A.S., Gomes, H.S., Leão, W., Gagné, C.: Unsupervised temperature scaling: an unsupervised post-processing calibration method of deep networks. ar\({\rm {Xiv}}\): Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  47. Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P.H.S., Dokania, P.K.: Calibrating deep neural networks using focal loss. In: NeurIPS (2020)

    Google Scholar 

  48. Müller, R., Kornblith, S., Hinton, G.: When does label smoothing help? In: NeurIPS, pp. 4696–4705 (2019)

    Google Scholar 

  49. Murphy, K.P.: Probabilistic Machine Learning: An introduction. MIT Press, Cambridge (2022). probml.ai

    MATH  Google Scholar 

  50. Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: AAAI, pp. 2901–2907 (2015)

    Google Scholar 

  51. Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring calibration in deep learning. In: CVPR Workshops, vol. 2 (2019)

    Google Scholar 

  52. Ovadia, Y., et al.: Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In: NeurIPS (2019)

    Google Scholar 

  53. Patel, K., Beluch, W.H., Yang, B., Pfeiffer, M., Zhang, D.: Multi-class uncertainty calibration via mutual information maximization-based binning. In: ICLR (2020)

    Google Scholar 

  54. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 (2017)

  55. Pittorino, F., et al.: Entropic gradient descent algorithms and wide flat minima. J. Stat. Mech. Theory Exp. 2021(12), 124015 (2021)

    Google Scholar 

  56. Rahimi, A., Shaban, A., Cheng, C., Hartley, R., Boots, B.: Intra order-preserving functions for calibration of multi-class neural networks. In: NeurIPS (2020)

    Google Scholar 

  57. Rice, L., Wong, E., Kolter, Z.: Overfitting in adversarially robust deep learning. In: ICML, pp. 8093–8104 (2020)

    Google Scholar 

  58. Shehzad, M.N., et al.: Threshold temperature scaling: Heuristic to address temperature and power issues in MPSoCs. Microprocess. Microsyst. 77, 103124 (2020)

    Google Scholar 

  59. Shen, Z., Liu, Z., Xu, D., Chen, Z., Cheng, K.T., Savvides, M.: Is label smoothing truly incompatible with knowledge distillation: an empirical study. In: ICLR (2020)

    Google Scholar 

  60. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114 (2019)

    Google Scholar 

  61. Thulasidasan, S., Chennupati, G., Bilmes, J., Bhattacharya, T., Michalak, S.: On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In: NeurIPS, pp. 13888–13899 (2019)

    Google Scholar 

  62. Tolstikhin, I.O., et al.: MLP-mixer: an all-MLP architecture for vision. In: NeurIPS (2021)

    Google Scholar 

  63. Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)

  64. Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., Schön, T.: Evaluating model calibration in classification. In: AISTATS, pp. 3459–3467 (2019)

    Google Scholar 

  65. Wang, D., Feng, L., Zhang, M.: Rethinking calibration of deep neural networks: do not be afraid of overconfidence. In: NeurIPS (2021)

    Google Scholar 

  66. Wen, Y., et al.: Combining ensembles and data augmentation can harm your calibration. In: ICLR (2020)

    Google Scholar 

  67. Wu, D., Xia, S., Wang, Y.: Adversarial weight perturbation helps robust generalization. In: NeurIPS (2020)

    Google Scholar 

  68. Xing, C., Arik, S.Ö., Zhang, Z., Pfister, T.: Distance-based learning from errors for confidence calibration. In: ICLR (2020)

    Google Scholar 

  69. Yao, L., Miller, J.: Tiny ImageNet classification with convolutional neural networks. CS 231N

    Google Scholar 

  70. Yun, S., Park, J., Lee, K., Shin, J.: Regularizing class-wise predictions via self-knowledge distillation. In: CVPR, pp. 13873–13882 (2020)

    Google Scholar 

  71. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)

    Google Scholar 

  72. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. In: ICLR (2018)

    Google Scholar 

  73. Zhang, L., Deng, Z., Kawaguchi, K., Zou, J.: When and how mixup improves calibration. In: ICML, pp. 26135–26160 (2022)

    Google Scholar 

  74. Zhang, W., Vaidya, I.: Mixup training leads to reduced overfitting and improved calibration for the transformer architecture. CoRR (2021)

    Google Scholar 

  75. Zhong, Z., Cui, J., Liu, S., Jia, J.: Improving calibration for long-tailed recognition. In: CVPR, pp. 16489–16498 (2021)

    Google Scholar 

Download references

Acknowledgement

This work has been supported by the National Key Research and Development Program under Grant No. 2018AAA0100400, the National Natural Science Foundation of China grants U20A20223, 62076236, 61721004, the Key Research Program of Frontier Sciences of CAS under Grant ZDBS-LY-7004, and the Youth Innovation Promotion Association of CAS under Grant 2019141.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xu-Yao Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3569 KB)

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, F., Cheng, Z., Zhang, XY., Liu, CL. (2022). Rethinking Confidence Calibration for Failure Prediction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13685. Springer, Cham. https://doi.org/10.1007/978-3-031-19806-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19806-9_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19805-2

  • Online ISBN: 978-3-031-19806-9

  • eBook Packages: Computer ScienceComputer Science (R0)