Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers

  • Ardhendu BeheraEmail author
  • Alexander Keidel
  • Bappaditya Debnath
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11269)


Automatic recognition of in-vehicle activities has significant impact on the next generation intelligent vehicles. In this paper, we present a novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities. We bring together ideas from recent works on LSTMs, transfer learning for object detection and body pose by exploring the use of deep convolutional neural networks (CNN). Recent work has also shown that representations such as hand-object interactions are important cues in characterizing human activities. The proposed M-LSTM integrates these ideas under one framework, where two streams focus on appearance information with two different levels of abstractions. The other two streams analyze the contextual information involving configuration of body parts and body-object interactions. The proposed contextual descriptor is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating. We validate this on two challenging datasets consisting driver activities.



The research is supported by the Edge Hill University’s Research Investment Fund (RIF). We would like to thank Taylor Smith in State Farm Corporation for providing information about their dataset. The GPU used in this research is generously donated by the NVIDIA Corporation.

Supplementary material

Supplementary material 1 (mp4 16870 KB)

480455_1_En_21_MOESM2_ESM.pdf (1.2 mb)
Supplementary material 2 (pdf 1242 KB)


  1. 1.
    Abouelnaga, Y., Eraqi, H.M., Moustafa, M.N.: Real-time distracted driver posture classification. arXiv preprint arXiv:1706.09498 (2017)
  2. 2.
    Aggarwal, J., Ryoo, M.: Human activity analysis: a review. ACM Comput. Surv. 43(3), 16:1–16:43 (2011)CrossRefGoogle Scholar
  3. 3.
    Behera, A., Hogg, D.C., Cohn, A.G.: Egocentric activity monitoring and recovery. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7726, pp. 519–532. Springer, Heidelberg (2013). Scholar
  4. 4.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV, pp. 1395–1402 (2005)Google Scholar
  5. 5.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE CVPR (2017)Google Scholar
  6. 6.
    Carsten, O.: From driver models to modelling the driver: what do we really need to know about the driver? In: Cacciabue, P.C. (ed.) Modelling Driver Behaviour in Automotive Environments, pp. 105–120. Springer, London (2007). Scholar
  7. 7.
    State Farm Corporate: State farm distracted driver detection (2016).
  8. 8.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. PAMI 39(4), 677–691 (2017)CrossRefGoogle Scholar
  9. 9.
    Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)Google Scholar
  10. 10.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE CVPR, pp. 1933–1941 (2016)Google Scholar
  11. 11.
    Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: Advances in NIPS, pp. 33–44 (2017)Google Scholar
  12. 12.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV, pp. 1080–1088 (2015)Google Scholar
  13. 13.
    Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)Google Scholar
  14. 14.
    Heide, A., Henning, K.: The “cognitive car”: a roadmap for research issues in the automotive sector. Ann. Rev. Control 30(2), 197–203 (2006)CrossRefGoogle Scholar
  15. 15.
    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)CrossRefGoogle Scholar
  16. 16.
    Hssayeni, M., Saxena, S., Ptucha, R., Savakis, A.: Distracted driver detection: deep learning vs handcrafted features. Electron. Imaging 10, 20–26 (2017)CrossRefGoogle Scholar
  17. 17.
    Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: IEEE CVPR, pp. 3296–3297 (2017)Google Scholar
  18. 18.
    Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: ICML, pp. 2342–2350 (2015)Google Scholar
  19. 19.
    Kaplan, S., Guvensan, M.A., Yavuz, A.G., Karalurt, Y.: Driver behavior analysis for safe driving: a survey. IEEE Trans. Int. Transp. Syst. 16(6), 3017–3032 (2015). Scholar
  20. 20.
    Kim, H.J., Yang, J.H.: Takeover requests in simulated partially autonomous vehicles considering human factors. IEEE Trans. Hum.-Mach. Syst. 47(5), 735–740 (2017). Scholar
  21. 21.
    Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010)Google Scholar
  22. 22.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, pp. 432–439 (2003)Google Scholar
  23. 23.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  24. 24.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE CVPR, pp. 1996–2003 (2009)Google Scholar
  25. 25.
    Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, vol. 2 (2017)
  26. 26.
    Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). Scholar
  27. 27.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  28. 28.
    Ranft, B., Stiller, C.: The role of machine vision for intelligent vehicles. IEEE Trans. Int. Veh. 1(1), 8–19 (2016). Scholar
  29. 29.
    Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE CVPRW, pp. 512–519 (2014)Google Scholar
  30. 30.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: IEEE CVPR, pp. 1194–1201, June 2012Google Scholar
  31. 31.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV (2009)Google Scholar
  33. 33.
    Ryoo, M.S., Rothrock, B., Matthies, L.H.: Pooled motion features for first-person videos. In: IEEE CVPR (2014)Google Scholar
  34. 34.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  35. 35.
    Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE CVPR, pp. 1961–1970 (2016)Google Scholar
  36. 36.
    Singh, D.: Using convolutional neural networks to perform classification on state farm insurance driver images. Technical report. Stanford University, Stanford, CA (2016)Google Scholar
  37. 37.
    Tieleman, T., Hinton, G.: Lecture 65-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Mach. Learn. 4(2), 26–31 (2012)Google Scholar
  38. 38.
    Trivedi, M.M., Gandhi, T., McCall, J.: Looking-in and looking-out of a vehicle: computer-vision-based enhanced vehicle safety. IEEE Trans. Int. Transp. Syst. 8(1), 108–120 (2007). Scholar
  39. 39.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE CVPR (2015)Google Scholar
  41. 41.
    Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086 (2015)
  42. 42.
    Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in NIPS, pp. 802–810 (2015)Google Scholar
  43. 43.
    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS, pp. 3320–3328 (2014)Google Scholar
  44. 44.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE CVPR, pp. 4694–4702 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceEdge Hill UniversityOrmskirkUK

Personalised recommendations