Advertisement

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)

Abstract

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments (Code/models: https://github.com/srijandas07/VPN) show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

Keywords

Action recognition Video Pose Embedding Attention 

Supplementary material

504446_1_En_5_MOESM1_ESM.pdf (1.8 mb)
Supplementary material 1 (pdf 1863 KB)

References

  1. 1.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  2. 2.
    Baradel, F., Wolf, C., Mille, J.: Human action recognition: pose-based attention draws focus to hands. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 604–613, October 2017.  https://doi.org/10.1109/ICCVW.2017.77
  3. 3.
    Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to RGB. In: The British Machine Vision Conference (BMVC), September 2018Google Scholar
  4. 4.
    Baradel, F., Wolf, C., Mille, J., Taylor, G.W.: Glimpse clouds: human activity recognition from unstructured feature points. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  5. 5.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  6. 6.
    Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR (2019)Google Scholar
  7. 7.
    Das, S., Chaudhary, A., Bremond, F., Thonnat, M.: Where to focus on for human action recognition? In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 71–80, January 2019.  https://doi.org/10.1109/WACV.2019.00015
  8. 8.
    Das, S., et al.: Toyota smarthome: real-world activities of daily living. In: ICCV (2019)Google Scholar
  9. 9.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  10. 10.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  11. 11.
    Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583, November 2015.  https://doi.org/10.1109/ACPR.2015.7486569
  12. 12.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  13. 13.
    Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. CoRR abs/1812.02707 (2018). arxiv.org/abs/1812.02707
  14. 14.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  15. 15.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)Google Scholar
  16. 16.
    Hussein, N., Gavves, E., Smeulders, A.W.M.: Timeception for complex action recognition. CoRR abs/1812.01289 (2018). http://arxiv.org/abs/1812.01289
  17. 17.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  18. 18.
    Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018).  https://doi.org/10.1109/TIP.2018.2812099MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Kim, S., Seltzer, M., Li, J., Zhao, R.: Improved training for online end-to-end speech recognition systems. In: Proceedings of Interspeech 2018, pp. 2913–2917 (2018).  https://doi.org/10.21437/Interspeech.2018-2517. http://dx.doi.org/10.21437/Interspeech.2018-2517
  20. 20.
    Krzanowski, W.J.: Principles of Multivariate Analysis: A User’s Perspective. Oxford University Press Inc., USA (1988)zbMATHGoogle Scholar
  21. 21.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)Google Scholar
  22. 22.
    Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: Proceedings of the IEEE International Conference on Computer Vision (2017)Google Scholar
  23. 23.
    Liu, G., Qian, J., Wen, F., Zhu, X., Ying, R., Liu, P.: Action recognition based on 3D skeleton and RGB frame fusion. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 258–264 (2019).  https://doi.org/10.1109/IROS40897.2019.8967570
  24. 24.
    Liu, J., Wang, G., Hu, P., Duan, L., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3671–3680, July 2017.  https://doi.org/10.1109/CVPR.2017.391
  25. 25.
    Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019).  https://doi.org/10.1109/TPAMI.2019.2916873CrossRefGoogle Scholar
  26. 26.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_50CrossRefGoogle Scholar
  27. 27.
    Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017).  https://doi.org/10.1016/j.patcog.2017.02.030. http://www.sciencedirect.com/science/article/pii/S0031320317300936CrossRefGoogle Scholar
  28. 28.
    Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  29. 29.
    Liu, Y., Guo, Y., Bakker, E.M., Lew, M.S.: Learning a recurrent residual fusion network for multimodal matching. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4127–4136, October 2017.  https://doi.org/10.1109/ICCV.2017.442
  30. 30.
    Luo, Z., Hsieh, J.T., Jiang, L., Carlos Niebles, J., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  31. 31.
    Mahasseni, B., Todorovic, S.: Regularizing long short term memory with 3D human-skeleton sequences for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3054–3062 (2016)Google Scholar
  32. 32.
    Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. CoRR abs/1804.02516 (2018). http://arxiv.org/abs/1804.02516
  33. 33.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  34. 34.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_11CrossRefGoogle Scholar
  35. 35.
    Rahmani, H., Mian, A.: Learning a non-linear knowledge transfer model for cross-view action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2458–2466, June 2015.  https://doi.org/10.1109/CVPR.2015.7298860
  36. 36.
    Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1506–1515, June 2016.  https://doi.org/10.1109/CVPR.2016.167
  37. 37.
    Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. (2019).  https://doi.org/10.1109/TPAMI.2019.2916873
  38. 38.
    Shahroudy, A., Wang, G., Ng, T.: Multi-modal feature fusion for action recognition in RGB-D sequences. In: 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP), pp. 1–4, May 2014.  https://doi.org/10.1109/ISCCSP.2014.6877819
  39. 39.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  40. 40.
    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  41. 41.
    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019)Google Scholar
  42. 42.
    Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1227–1236, June 2019.  https://doi.org/10.1109/CVPR.2019.00132
  43. 43.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  44. 44.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)Google Scholar
  45. 45.
    Soomro, K., Roshan Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)Google Scholar
  46. 46.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014). http://dl.acm.org/citation.cfm?id=2627435.2670313MathSciNetzbMATHGoogle Scholar
  47. 47.
    Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5323–5332, June 2018.  https://doi.org/10.1109/CVPR.2018.00558
  48. 48.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 2015, pp. 4489–4497. IEEE Computer Society, Washington, DC (2015).  https://doi.org/10.1109/ICCV.2015.510
  49. 49.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition, pp. 3169–3176. Colorado Springs, USA, June 2011. http://hal.inria.fr/inria-00583818/en
  50. 50.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, Sydney, Australia (2013). http://hal.inria.fr/hal-00873267
  51. 51.
    Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning, and recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656, June 2014.  https://doi.org/10.1109/CVPR.2014.339
  52. 52.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013, June 2016.  https://doi.org/10.1109/CVPR.2016.541
  53. 53.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  54. 54.
    Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)Google Scholar
  55. 55.
    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)Google Scholar
  56. 56.
    Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  57. 57.
    Zhang, S., Liu, X., Xiao, J.: On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 148–157, March 2017.  https://doi.org/10.1109/WACV.2017.24
  58. 58.
    Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: ECCV (2018)Google Scholar
  59. 59.
    Zhao, J., Snoek, C.G.M.: Dance with flow: two-in-one stream action detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.INRIA Université Nice Côte d’AzurNiceFrance

Personalised recommendations