VIENA\(^2\): A Driving Anticipation Dataset

  • Mohammad Sadegh AliakbarianEmail author
  • Fatemeh Sadat Saleh
  • Mathieu Salzmann
  • Basura Fernando
  • Lars Petersson
  • Lars Andersson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11361)


Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single dataset or framework that addresses them all in a consistent manner. In this paper, we therefore introduce a new, large-scale dataset, called VIENA\(^2\), covering 5 generic driving scenarios, with a total of 25 distinct action classes. It contains more than 15K full HD, 5 s long videos acquired in various driving conditions, weathers, daytimes and environments, complemented with a common and realistic set of sensor measurements. This amounts to more than 2.25M frames, each annotated with an action label, corresponding to 600 samples per action class. We discuss our data acquisition strategy and the statistics of our dataset, and benchmark state-of-the-art action anticipation techniques, including a new multi-modal LSTM architecture with an effective loss function for action anticipation in driving scenarios.

Supplementary material

484511_1_En_28_MOESM1_ESM.pdf (11.3 mb)
Supplementary material 1 (pdf 11594 KB)


  1. 1.
    Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: ICCV (2017)Google Scholar
  2. 2.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR (2016)Google Scholar
  3. 3.
    Chan, F.H., Chen, Y.T., Xiang, Y., Sun, M.: Anticipating accidents in dashcam videos. In: ACCV (2016)Google Scholar
  4. 4.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  5. 5.
    Dong, C., Dolan, J.M., Litkouhi, B.: Intention estimation for ramp merging control in autonomous driving. In: IV (2017)Google Scholar
  6. 6.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: CVPR (2017)Google Scholar
  7. 7.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  8. 8.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Going deeper: autonomous steering with neural memory networks. In: CVPR (2017)Google Scholar
  9. 9.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML (2015)Google Scholar
  10. 10.
    Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
  11. 11.
    Jain, A., Koppula, H.S., Raghavan, B., Soh, S., Saxena, A.: Car that knows before you do: anticipating maneuvers via learning temporal driving models. In: IV (2015)Google Scholar
  12. 12.
    Jain, A., Koppula, H.S., Soh, S., Raghavan, B., Singh, A., Saxena, A.: Brain4cars: car that knows before you do via sensory-fusion deep learning architecture. arXiv preprint arXiv:1601.00740 (2016)
  13. 13.
    Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: ICRA (2016)Google Scholar
  14. 14.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  15. 15.
    Klingelschmitt, S., Damerow, F., Willert, V., Eggert, J.: Probabilistic situation assessment framework for multiple, interacting traffic participants in generic traffic scenes. In: IV (2016)Google Scholar
  16. 16.
    Kooij, J.F.P., Schneider, N., Flohr, F., Gavrila, D.M.: Context-based pedestrian path prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 618–633. Springer, Cham (2014). Scholar
  17. 17.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. TPAMI 38, 14–29 (2016)CrossRefGoogle Scholar
  18. 18.
    Li, X., et al.: A unified framework for concurrent pedestrian and cyclist detection. T-ITS 18, 269–281 (2017)Google Scholar
  19. 19.
    Liebner, M., Ruhhammer, C., Klanner, F., Stiller, C.: Generic driver intent inference based on parametric models. In: ITSC (2013)Google Scholar
  20. 20.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: CVPR (2016)Google Scholar
  21. 21.
    Morris, B., Doshi, A., Trivedi, M.: Lane change intent prediction for driver assistance: on-road design and evaluation. In: IV (2011)Google Scholar
  22. 22.
    Ohn-Bar, E., Martin, S., Tawari, A., Trivedi, M.M.: Head, eye, and hand patterns for driver activity recognition. In: ICPR (2014)Google Scholar
  23. 23.
    Olabiyi, O., Martinson, E., Chintalapudi, V., Guo, R.: Driver action prediction using deep (bidirectional) recurrent neural network. arXiv preprint arXiv:1706.02257 (2017)
  24. 24.
    Pentland, A., Liu, A.: Modeling and prediction of human behavior. Neural Comput. 11, 229–242 (1999)CrossRefGoogle Scholar
  25. 25.
    Pool, E.A., Kooij, J.F., Gavrila, D.M.: Using road topology to improve cyclist path prediction. In: IV (2017)Google Scholar
  26. 26.
    Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: CVPR (2016)Google Scholar
  27. 27.
    Rasouli, A., Kotseruba, I., Tsotsos, J.K.: Agreeing to cross: how drivers and pedestrians communicate. arXiv preprint arXiv:1702.03555 (2017)
  28. 28.
    Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: ICCV (2017)Google Scholar
  29. 29.
    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). Scholar
  30. 30.
    Rockstar-Games: Grand Theft Auto V: PC single-player mods (2018).
  31. 31.
    Rockstar-Games: Policy on posting copyrighted Rockstar Games material (2018).
  32. 32.
    Ros, G., et al.: Semantic segmentation of urban scenes via domain adaptation of SYNTHIA. In: Csurka, G. (ed.) Domain Adaptation in Computer Vision Applications. ACVPR, pp. 227–241. Springer, Cham (2017). Scholar
  33. 33.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV (2011)Google Scholar
  34. 34.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV (2009)Google Scholar
  35. 35.
    Saleh, F.S., Aliakbarian, M.S., Salzmann, M., Petersson, L., Alvarez, J.M.: Effective use of synthetic data for urban scene semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 86–103. Springer, Cham (2018). Scholar
  36. 36.
    Schulz, A.T., Stiefelhagen, R.: A controlled interactive multiple model filter for combined pedestrian intention recognition and path prediction. In: ITSC (2015)Google Scholar
  37. 37.
    Smith, R.: An overview of the tesseract OCR engine. In: ICDAR. IEEE (2007)Google Scholar
  38. 38.
    Soomro, K., Idrees, H., Shah, M.: Online localization and prediction of actions and interactions. arXiv preprint arXiv:1612.01194 (2016)
  39. 39.
    Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization. In: CVPR (2016)Google Scholar
  40. 40.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  41. 41.
    Tawari, A., Sivaraman, S., Trivedi, M.M., Shannon, T., Tippelhofer, M.: Looking-in and looking-out vision for urban intelligent assistance: estimation of driver attentive state and dynamic surround for safe merging and braking. In: IV (2014)Google Scholar
  42. 42.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)Google Scholar
  43. 43.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  44. 44.
    Wang, X., Ji, Q.: Hierarchical context modeling for video event recognition. TPAMI 39, 1770–1782 (2017)CrossRefGoogle Scholar
  45. 45.
    Zyner, A., Worrall, S., Ward, J., Nebot, E.: Long short term memory for driver intent prediction. In: IV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohammad Sadegh Aliakbarian
    • 1
    • 2
    • 4
    Email author
  • Fatemeh Sadat Saleh
    • 1
    • 4
  • Mathieu Salzmann
    • 3
  • Basura Fernando
    • 2
  • Lars Petersson
    • 1
    • 4
  • Lars Andersson
    • 4
  1. 1.ANUCanberraAustralia
  2. 2.ACRVCanberraAustralia
  3. 3.CVLabEPFLLausanneSwitzerland
  4. 4.Data61-CSIROCanberraAustralia

Personalised recommendations