Advertisement

Ensembling Visual Explanations

  • Nazneen Fatema RajaniEmail author
  • Raymond J. Mooney
Chapter
Part of the The Springer Series on Challenges in Machine Learning book series (SSCML)

Abstract

Many machine learning systems deployed for real-world applications such as recommender systems, image captioning, object detection, etc. are ensembles of multiple models. Also, the top-ranked systems in many data-mining and computer vision competitions use ensembles. Although ensembles are popular, they are opaque and hard to interpret. Explanations make AI systems more transparent and also justify their predictions. However, there has been little work on generating explanations for ensembles. In this chapter, we propose two new methods for ensembling visual explanations for VQA using the localization maps for the component systems. Our novel approach is scalable with the number of component models in the ensemble. Evaluating explanations is also a challenging research problem. We introduce two new approaches to evaluate explanations—the comparison metric and the uncovering metric. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is significantly qualitatively outperform each of the individual system’s visual explanation. Overall, our ensemble explanation is better 61% of the time when compared to any individual system’s explanation and is also sufficient for humans to arrive at the correct answer, just based on the explanation, at least 64% of the time.

Keywords

Stacking with auxiliary features Visual question answering Visualization techniques for deep networks 

References

  1. Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016)Google Scholar
  2. Aha DW, Darrell T, Pazzani M, Reid D, Sammut C, (Eds) PS (2017) Explainable Artificial Intelligence (XAI) Workshop at IJCAI. URL http://home.earthlink.net/~dwaha/research/meetings/ijcai17-xai/
  3. Andreas J, Rohrbach M, Darrell T, Klein D (2016a) Learning to compose neural networks for question answering. In: Proceedings of NAACL2016Google Scholar
  4. Andreas J, Rohrbach M, Darrell T, Klein D (2016b) Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 39–48Google Scholar
  5. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) VQA: Visual Question Answering. In: Proceedings of ICCV2015Google Scholar
  6. Bau D, Zhou B, Khosla A, Oliva A, Torralba A (2017) Network dissection: Quantifying interpretability of deep visual representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3319–3327Google Scholar
  7. Berg T, Belhumeur PN (2013) How do you tell a blackbird from a crow? In: Proceedings of ICCV2013Google Scholar
  8. Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for Visual Question Answering. arXiv preprint arXiv:151105960Google Scholar
  9. Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding 163:90–100CrossRefGoogle Scholar
  10. Dietterich T (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer-Verlag, pp 1–15Google Scholar
  11. Fridman L, Jenik B, Terwilliger J (2018) DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning. arXiv preprint arXiv:180102805Google Scholar
  12. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal Compact Bilinear pooling for Visual Question Answering and Visual Grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016)Google Scholar
  13. Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 317–326Google Scholar
  14. Goyal Y, Mohapatra A, Parikh D, Batra D (2016) Towards Transparent AI Systems: Interpreting Visual Question Answering Models. In: International Conference on Machine Learning (ICML) Workshop on Visualization for Deep LearningGoogle Scholar
  15. Gunning D (2016) Explainable Artificial Intelligence (XAI), DARPA Broad Agency Announcement, URL https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf
  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  17. Hendricks LA, Akata Z, Rohrbach M, Donahue J, Schiele B, Darrell T (2016) Generating Visual Explanations. In: Proceedings of the European Conference on Computer Vision (ECCV2016)Google Scholar
  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780CrossRefGoogle Scholar
  19. Johns E, Mac Aodha O, Brostow GJ (2015) Becoming the expert-interactive multi-class machine teaching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015)Google Scholar
  20. Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016)Google Scholar
  21. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV2014), Springer, pp 740–755Google Scholar
  22. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems (NIPS2016), pp 289–297Google Scholar
  23. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems (NIPS2014), pp 1682–1690Google Scholar
  24. Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 30–38Google Scholar
  25. Park DH, Hendricks LA, Akata Z, Schiele B, Darrell T, Rohrbach M (2016) Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:161204757Google Scholar
  26. Rajani NF, Mooney RJ (2016) Combining Supervised and Unsupervised Ensembles for Knowledge Base Population. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016), URL http://www.cs.utexas.edu/users/ai-lab/pub-view.php?PubID=127566
  27. Rajani NF, Mooney RJ (2017) Stacking With Auxiliary Features. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI2017), Melbourne, AustraliaGoogle Scholar
  28. Rajani NF, Mooney RJ (2018) Stacking With Auxiliary Features for Visual Question Answering. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesGoogle Scholar
  29. Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2016)Google Scholar
  30. Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: Training differentiable models by constraining their explanations. In: Proceedings of IJCAI2017Google Scholar
  31. Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning SystemsGoogle Scholar
  32. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: The IEEE International Conference on Computer Vision (ICCV2017)Google Scholar
  33. Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-scale Image Recognition. In: Proceedings of ICLR2015Google Scholar
  34. Viswanathan V, Rajani NF, Bentor Y, Mooney RJ (2015) Stacked Ensembles of Information Extractors for Knowledge-Base Population. In: Association for Computational Linguistics (ACL2015), Beijing, China, pp 177–187Google Scholar
  35. Wolpert DH (1992) Stacked Generalization. Neural Networks 5:241–259CrossRefGoogle Scholar
  36. Xu H, Saenko K (2016) Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering. In: Proceedings of ECCV2016Google Scholar
  37. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015a) Object detectors emerge in deep scene CNNs. In: Proceedings of the International Conference on Learning Representations (ICLR2015)Google Scholar
  38. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015b) Simple baseline for visual question answering. arXiv preprint arXiv:151202167Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceThe University of Texas at AustinAustinUSA

Personalised recommendations