Textual Explanations for Self-Driving Vehicles

  • Jinkyu Kim
  • Anna Rohrbach
  • Trevor Darrell
  • John Canny
  • Zeynep AkataEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)


Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller’s output, namely rationalizations. We propose a new approach to introspective explanations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i.e., acceleration and change of course. The controller’s attention identifies image regions that potentially influence the network’s output. Second, we use an attention-based video-to-text model to produce textual explanations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at


Explainable deep driving BDD-X dataset 



This work was supported by DARPA XAI program and Berkeley DeepDrive.

Supplementary material

474176_1_En_35_MOESM1_ESM.pdf (3.8 mb)
Supplementary material 1 (pdf 3909 KB)


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Conference on Learning Representations (2014)Google Scholar
  2. 2.
    Bojarski, M., et al.: VisualBackProp: visualizing CNNs for autonomous driving. CoRR, vol. abs/1611.05418 (2016)Google Scholar
  3. 3.
    Bojarski, M., et al.: End to end learning for self-driving cars. CoRR abs/1604.07316 (2016)Google Scholar
  4. 4.
    Buehler, M., Iagnemma, K., Singh, S.: The DARPA Urban Challenge: Autonomous Vehicles in City Traffic. Springer, Heidelberg (2009). Scholar
  5. 5.
    Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: learning affordance for direct perception in autonomous driving. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730. IEEE (2015)Google Scholar
  6. 6.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Aistats, vol. 9, pp. 249–256 (2010)Google Scholar
  7. 7.
    Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). Scholar
  8. 8.
    Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Advances in Neural Information Processing Systems, pp. 473–479 (1997)Google Scholar
  10. 10.
    Hyndman, R., Koehler, A.B., Ord, J.K., Snyder, R.D.: Forecasting with Exponential Smoothing: the State Space Approach. Springer, Heidelberg (2008). Scholar
  11. 11.
    Kim, J., Canny, J.: Interpretable learning for self-driving cars by visualizing causal attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2942–2950 (2017)Google Scholar
  12. 12.
    Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  13. 13.
    Lavie, A., Agarwal, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation, pp. 65–72 (2005)Google Scholar
  14. 14.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  15. 15.
    Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp. 609–616. ACM (2009)Google Scholar
  16. 16.
    Levinson, J., et al.: Towards fully autonomous driving: systems and algorithms. In: Intelligent Vehicles Symposium (IV), pp. 163–168. IEEE (2011)Google Scholar
  17. 17.
    Lombrozo, T.: Explanation and abductive inference. In: The Oxford Handbook of Thinking And Reasoning (2012)Google Scholar
  18. 18.
    Lombrozo, T.: The structure and function of explanations. Trends Cogn. Sci. 10(10), 464–470 (2006)CrossRefGoogle Scholar
  19. 19.
    Paden, B., Čáp, M., Yong, S.Z., Yershov, D., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)CrossRefGoogle Scholar
  20. 20.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  21. 21.
    Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: justifying decisions and pointing to the evidence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  22. 22.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Urmson, C.: Autonomous driving in urban environments: boss and the urban challenge. J. Field Robot. 25(8), 425–466 (2008)CrossRefGoogle Scholar
  24. 24.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  25. 25.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  26. 26.
    Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models from large-scale video datasets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2174–2182 (2017)Google Scholar
  27. 27.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  28. 28.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.EECSUniversity of CaliforniaBerkeleyUSA
  2. 2.MPI for InformaticsSaarland Informatics CampusSaarbrückenGermany
  3. 3.AMLabUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations