Explainable Deep Driving by Visualizing Causal Attention

  • Jinkyu Kim
  • John Canny
Part of the The Springer Series on Challenges in Machine Learning book series (SSCML)


Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable—they should provide easy-to-interpret rationales for their behavior—so that passengers, insurance companies, law enforcement, developers etc., can understand what triggered a particular behavior. Here, we explore the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network’s output (steering control). Our approach is two-stage. In the first stage, we use a visual attention model to train a convolutional network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network’s output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network’s behavior. We demonstrate the effectiveness of our model on three datasets totaling 16 h of driving. We first show that training with attention does not degrade the performance of the end-to-end network. Then we show that the network highlights interpretable features that are used by humans while driving, and causal filtering achieves a useful reduction in explanation complexity by removing features which do not significantly affect the output.


Explainable AI Self-driving vehicles Visual attention 



This work was supported by Berkeley DeepDrive and Samsung Scholarship.


  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2015) Tensorflow: Large-scale machine learning on heterogeneous systems, 2015Google Scholar
  2. Akata Z, Hendricks LA, Alaniz S, Darrell T (2018) Generating post-hoc rationales of deep visual classification decisions. Chapter in Explainable and Interpretable Models in Computer Vision and Machine Learning (The Springer Series on Challenges in Machine Learning)Google Scholar
  3. Alletto S, Palazzi A, Solera F, Calderara S, Cucchiara R (2016) Dr (eye) ve: A dataset for attention-based tasks with applications to autonomous and assisted driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 54–60Google Scholar
  4. Bazzani L, Larochelle H, Torresani L (2016) Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:160308199Google Scholar
  5. Bojarski M, Choromanska A, Choromanski K, Firner B, Jackel L, Muller U, Zieba K (2016a) Visualbackprop: visualizing cnns for autonomous driving. CoRR, vol abs/161105418Google Scholar
  6. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, et al (2016b) End to end learning for self-driving cars. CoRR abs/160407316Google Scholar
  7. Buehler M, Iagnemma K, Singh S (2009) The DARPA urban challenge: autonomous vehicles in city traffic, vol 56. SpringerGoogle Scholar
  8. Burt P, Adelson E (1983) The laplacian pyramid as a compact image code. IEEE Transactions on communications 31(4):532–540CrossRefGoogle Scholar
  9. Chen C, Seff A, Kornhauser A, Xiao J (2015) Deepdriving: Learning affordance for direct perception in autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2722–2730Google Scholar
  10. Chi L, Mu Y (2017) Learning end-to-end driving model from spatial and temporal visual cues. Proceedings of the Workshop on Visual Analysis in Smart and Connected CommunitiesGoogle Scholar
  11. Commaai (2017) Public driving dataset., [Online; accessed 07-Mar-2017]
  12. Cornia M, Baraldi L, Serra G, Cucchiara R (2016) Predicting human eye fixations via an lstm-based saliency attentive model. arXiv preprint arXiv:161109571Google Scholar
  13. Ester M, Kriegel HP, Sander J, Xu X, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231Google Scholar
  14. Fernando T, Denman S, Sridharan S, Fookes C (2017) Going deeper: Autonomous steering with neural memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 214–221Google Scholar
  15. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Aistats, vol 9, pp 249–256Google Scholar
  16. Hendricks LA, Akata Z, Rohrbach M, Donahue J, Schiele B, Darrell T (2016) Generating visual explanations. In: European Conference on Computer Vision, Springer, pp 3–19CrossRefGoogle Scholar
  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8): 1735–1780CrossRefGoogle Scholar
  18. Hyndman R, Koehler AB, Ord JK, Snyder RD (2008) Forecasting with exponential smoothing: the state space approach. Springer Science & Business MediaGoogle Scholar
  19. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4565–4574Google Scholar
  20. Kim J, Canny J (2017) Interpretable learning for self-driving cars by visualizing causal attention. Proceedings of the IEEE International Conference on Computer VisionGoogle Scholar
  21. Kinga D, Adam JB (2015) A method for stochastic optimization. In: International Conference on Learning RepresentationsGoogle Scholar
  22. Kümmerer M, Theis L, Bethge M (2014) Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:14111045Google Scholar
  23. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRefGoogle Scholar
  24. Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 609–616Google Scholar
  25. Levinson J, Askeland J, Becker J, Dolson J, Held D, Kammel S, Kolter JZ, Langer D, Pink O, Pratt V, et al (2011) Towards fully autonomous driving: Systems and algorithms. In: Intelligent Vehicles Symposium, IEEE, pp 163–168Google Scholar
  26. Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 362–370Google Scholar
  27. Palazzi A, Abati D, Calderara S, Solera F, Cucchiara R (2017) Predicting the driver’s focus of attention: the dr (eye) ve project. arXiv preprint arXiv:170503854Google Scholar
  28. Pomerleau DA (1989) Alvinn, an autonomous land vehicle in a neural network. Tech. rep., Carnegie Mellon University, Computer Science DepartmentGoogle Scholar
  29. Rajamani R (2011) Vehicle dynamics and control. Springer Science & Business MediaGoogle Scholar
  30. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:151104119Google Scholar
  31. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning RepresentationsGoogle Scholar
  32. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  33. Udacity (2017) Public driving dataset., [Online; accessed 07-Mar-2017]
  34. Urmson C, Anhalt J, Bagnell D, Baker C, Bittner R, Clark M, Dolan J, Duggins D, Galatali T, Geyer C, et al (2008) Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics 25(8):425–466CrossRefGoogle Scholar
  35. Xu H, Gao Y, Yu F, Darrell T (2017) End-to-end learning of driving models from large-scale video datasets. Proceedings of the IEEE International Conference on Computer VisionGoogle Scholar
  36. Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML, vol 14, pp 77–81Google Scholar
  37. Yu Y, Choi J, Kim Y, Yoo K, Lee SH, Kim G (2017) Supervising neural attention models for video captioning by human gaze data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii, pp 2680–8Google Scholar
  38. Zahavy T, Zrihem NB, Mannor S (2016) Graying the black box: Understanding dqns. arXiv preprint arXiv:160202658Google Scholar
  39. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, Springer, pp 818–833Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Electrical Engineering and Computer SciencesUC BerkeleyBerkeleyUSA

Personalised recommendations