Skip to main content

Clinical Explainability Failure (CEF) & Explainability Failure Ratio (EFR) – Changing the Way We Validate Classification Algorithms


Adoption of Artificial Intelligence (AI) algorithms into the clinical realm will depend on their inherent trustworthiness, which is built not only by robust validation studies but is also deeply linked to the explainability and interpretability of the algorithms. Most validation studies for medical imaging AI report the performance of algorithms on study-level labels and lay little emphasis on measuring the accuracy of explanations generated by these algorithms in the form of heat maps or bounding boxes, especially in true positive cases. We propose a new metric – Explainability Failure Ratio (EFR) – derived from Clinical Explainability Failure (CEF) to address this gap in AI evaluation. We define an Explainability Failure as a case where the classification generated by an AI algorithm matches with study-level ground truth but the explanation output generated by the algorithm is inadequate to explain the algorithm's output. We measured EFR for two algorithms that automatically detect consolidation on chest X-rays to determine the applicability of the metric and observed a lower EFR for the model that had lower sensitivity for identifying consolidation on chest X-rays, implying that the trustworthiness of a model should be determined not only by routine statistical metrics but also by novel ‘clinically-oriented’ models.

This is a preview of subscription content, access via your institution.

Fig. 1

Data availability

Chest X-Ray data with bounding boxes drawn by AI and humans is available at


  1. Qin ZZ, Sander MS, Rai B, Titahong CN, Sudrungrot S, Laah SN, Adhikari LM, Carter EJ, Puri L, Codlin AJ, Creswell J (2019) Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems. Sci Rep 9:15000.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Hurt B, Kligerman S, Hsiao A (2020) Deep learning localization of pneumonia: 2019 coronavirus (COVID-19) outbreak. J Thorac Imaging 35:W87–W89.

    Article  PubMed  Google Scholar 

  3. Crosby J, Chen S, Li F, MacMahon H, Giger M (2020) Network output visualization to uncover limitations of deep learning detection of pneumothorax. 11316:113160O.

    Article  Google Scholar 

  4. Pan I, Cadrin-Chênevert A, Cheng PM (2019) Tackling the radiological society of North America pneumonia detection challenge. AJR Am J Roentgenol 213:568–574.

    Article  PubMed  Google Scholar 

  5. Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz CP, Patel BN, Yeom KW, Shpanskaya K, Blankenberg FG, Seekins J, Amrhein TJ, Mong DA, Halabi SS, Zucker EJ, Ng AY, Lungren MP (2018) Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Medicine 15:e1002686.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2020) Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int J Comput Vis 128:336–359.

    Article  Google Scholar 

  7. Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Arun N, Gaw N, Singh P, Chang K, Aggarwal M, Chen B, Hoebel K, Gupta S, Patel J, Gidwani M, Adebayo J, Li MD, Kalpathy-Cramer J (2021) Assessing the (un)trustworthiness of saliency maps for localizing abnormalities in medical imaging

  9. Venugopal VK, Vaidhya K, Murugavel M, Chunduru A, Mahajan V, Vaidya S, Mahra D, Rangasai A, Mahajan H (2020) Unboxing AI - radiological insights into a deep neural network for lung nodule characterization. Acad Radiol 27:88–95.

    Article  PubMed  Google Scholar 

  10. Yala A, Lehman C, Schuster T, Portnoi T, Barzilay R (2019) A deep learning mammography-based model for improved breast cancer risk prediction. Radiology 292:60–66.

    Article  PubMed  Google Scholar 

  11. Ding Y, Sohn JH, Kawczynski MG, Trivedi H, Harnish R, Jenkins NW, Lituiev D, Copeland TP, Aboian MS, Mari Aparici C, Behr SC, Flavell RR, Huang S-Y, Zalocusky KA, Nardo L, Seo Y, Hawkins RA, Hernandez Pampaloni M, Hadley D, Franc BL (2019) A deep learning model to predict a diagnosis of alzheimer disease by using 18F-FDG PET of the brain. Radiology 290:456–464.

    Article  PubMed  Google Scholar 

  12. Mahajan V, Venugopal VK, Murugavel M, Mahajan H (2020) The algorithmic audit: Working with vendors to validate radiology-ai algorithms-how we do it. Acad Radiol 27:132–135.

    Article  PubMed  Google Scholar 

Download references


We would like to thank members of the CovBase ( initiative for helping spark discussion about this concept during its initial stages. We also thank Dr. Alexandre Cadrin-Chenevert for providing his Pneumonia Detection & Classification algorithm which was the winner of the RSNA- 2018 Kaggle Chest X-Ray challenge. We would also like to thank Ms. Nisha Syed Nasser who helped critically revise the paper in keeping with important intellectual content.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations



All authors whose names appear on the submission: (1) made substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data; or the creation of new software used in the work; (2) drafted the work/ revised it critically for important intellectual content and approved the version to be published.

Corresponding author

Correspondence to Vasantha Kumar Venugopal.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Conflict of interest

All authors deny any potential conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Image & Signal Processing

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Venugopal, V.K., Takhar, R., Gupta, S. et al. Clinical Explainability Failure (CEF) & Explainability Failure Ratio (EFR) – Changing the Way We Validate Classification Algorithms. J Med Syst 46, 20 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: