Abstract
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. Our model focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. A sampling and reinforcement learning based loss function learns to generate sentences that realize a global sentence property, such as class specificity. Our results on a fine-grained bird species classification dataset show that this model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods. In this work, we emphasize the importance of producing an explanation for an observed action, which could be applied to a black-box decision agent, akin to what one human produces when asked to explain the actions of a second human.
Keywords
- Explainable AI
- Rationalizations
- Fine-grained classification
This is a preview of subscription content, access via your institution.
Buying options









References
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Learning to compose neural networks for question answering. In: NAACL
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol 29, pp 65–72
Berg T, Belhumeur P (2013) How do you tell a blackbird from a crow? In: ICCV, pp 9–16
Biran O, McKeown K (2014) Justification narratives for individual classifications. In: Proceedings of the AutoML workshop at ICML 2014
Core MG, Lane HC, Van Lent M, Gomboc D, Solomon S, Rosenberg M (2006) Building explainable artificial intelligence systems. In: Proceedings of the national conference on artificial intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, vol 21, p 1766
Doersch C, Singh S, Gupta A, Sivic J, Efros A (2012) What makes paris look like paris? ACM Transactions on Graphics 31(4)
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. ICML
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR
Escorcia V, Niebles JC, Ghanem B (2015) On the relationship between visual attributes and convolutional networks. In: CVPR
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: CVPR, pp 1473–1482
Fong RC, Vedaldi A (2017) Interpretable explanations of black boxes by meaningful perturbation. arXiv preprint arXiv:170403296
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: CVPR
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp 2712–2719
Hendricks LA, Akata Z, Rohrbach M, Donahue J, Schiele B, Darrell T (2016) Generating visual explanations. In: ECCV
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. ICCV
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, ACM, pp 675–678
Johnson WL (1994) Agents that learn to explain themselves. In: AAAI, pp 1257–1263
Karpathy A, Li F (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp 595–603
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Kulkarni G, Premraj V, Dhar S, Li S, choi Y, Berg A, Berg T (2011) Baby talk: understanding and generating simple image descriptions. In: CVPR
Lacave C, Díez FJ (2002) A review of explanation methods for bayesian networks. The Knowledge Engineering Review 17(02):107–127
Lampert C, Nickisch H, Harmeling S (2013) Attribute-based classification for zero-shot visual object categorization. In: TPAMI
Lane HC, Core MG, Van Lent M, Solomon S, Gomboc D (2005) Explainable artificial intelligence for training and tutoring. Tech. rep., DTIC Document
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2016) Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:161200370
Lomas M, Chevalier R, Cross II EV, Garrett RC, Hoare J, Kopack M (2012) Explaining robot actions. In: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, ACM, pp 187–188
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. NIPS Deep Learning Workshop
Mao J, Huang J, Toshev A, Camburu O, Yuille A, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: CVPR
Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) Introduction to wordnet: An on-line lexical database*. International journal of lexicography 3(4):235–244
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL, pp 311–318
Park DH, Hendricks LA, Akata Z, Rohrbach A, Schiele B, Darrell T, Rohrbach M (2018) Multimodal explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:180208129
Ranzato M, Chopra S, Auli M, Zaremba W (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:151106732
Reed S, Akata Z, Lee H, Schiele B (2016a) Learning deep representations of fine-grained visual descriptions. In: CVPR
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016b) Generative adversarial text to image synthesis. ICML
Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:170403899
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. arXiv preprint arXiv:161200563
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2016) Grad-cam: Visual explanations from deep networks via gradient-based localization. See https://arxivorg/abs/161002391v37(8)
Shortliffe EH, Buchanan BG (1975) A model of inexact reasoning in medicine. Mathematical biosciences 23(3):351–379
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Teach RL, Shortliffe EH (1981) An analysis of physician attitudes regarding computer-based clinical consultation systems. In: Use and impact of computers in clinical medicine, Springer, pp 68–85
Van Lent M, Fisher W, Mancuso M (2004) An explainable artificial intelligence system for small-unit tactical behavior. In: Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, pp 900–907
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: CVPR
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning
Xu K, Ba J, Kiros R, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. ICML
Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2016) Every moment counts: Dense detailed labeling of actions in complex videos. In: CVPR
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Object detectors emerge in deep scene cnns
Zintgraf LM, Cohen TS, Adel T, Welling M (2017) Visualizing deep neural network decisions: Prediction difference analysis. arXiv preprint arXiv:170204595
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Akata, Z., Hendricks, L.A., Alaniz, S., Darrell, T. (2018). Generating Post-Hoc Rationales of Deep Visual Classification Decisions. In: , et al. Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-98131-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-98131-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98130-7
Online ISBN: 978-3-319-98131-4
eBook Packages: Computer ScienceComputer Science (R0)