Advertisement

Directional Temporal Modeling for Action Recognition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)

Abstract

Many current activity recognition models use 3D convolutional neural networks (e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such features do not encode clip-level ordered temporal information. In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. By applying multiple CIDC units we construct a light-weight network that models the clip-level temporal evolution across multiple spatial scales. Our CIDC network can be attached to any activity recognition backbone network. We evaluate our method on four popular activity recognition datasets and consistently improve upon state-of-the-art techniques. We further visualize the activation map of our CIDC network and show that it is able to focus on more meaningful, action related parts of the frame.

Keywords

Action recognition Temporal modeling Directional convolution 

Supplementary material

504443_1_En_17_MOESM1_ESM.pdf (5.7 mb)
Supplementary material 1 (pdf 5862 KB)

References

  1. 1.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042 (2016)Google Scholar
  2. 2.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733, July 2017.  https://doi.org/10.1109/CVPR.2017.502
  3. 3.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition. A new model and the kinetics dataset. CoRR, abs/1705.07750 2, 3 (2017)Google Scholar
  4. 4.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  5. 5.
    Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by temporal aggregation modules. In: Advances in Neural Information Processing Systems, vol. 33 (2019)Google Scholar
  6. 6.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)Google Scholar
  7. 7.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  8. 8.
    Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 773–787 (2016)CrossRefGoogle Scholar
  9. 9.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5378–5387 (2015)Google Scholar
  10. 10.
    Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)Google Scholar
  11. 11.
    Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: ICCV, vol. 1, p. 3 (2017)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  13. 13.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  14. 14.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information processing Systems, pp. 1097–1105 (2012)Google Scholar
  16. 16.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)Google Scholar
  17. 17.
    Li, X., et al.: Progress estimation and phase detection for sequential processes. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 1(3), 73 (2017)Google Scholar
  18. 18.
    Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)CrossRefGoogle Scholar
  19. 19.
    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision (2019)Google Scholar
  20. 20.
    Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 4898–4906 (2016)Google Scholar
  21. 21.
    Martinez, B., Modolo, D., Xiong, Y., Tighe, J.: Action recognition with spatial-temporal discriminative filter banks. In: Proceedings of the IEEE International Conference on Computer Vision (2019)Google Scholar
  22. 22.
    Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: Tv-l1 optical flow estimation. Image Process. On Line 2013, 137–150 (2013)CrossRefGoogle Scholar
  23. 23.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  24. 24.
    Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., Zhang, C.: DiSAN: directional self-attention network for RNN/CNN-free language understanding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  26. 26.
    Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision (2012)Google Scholar
  27. 27.
    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)Google Scholar
  28. 28.
    Sun, K., et al.: High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514 (2019)Google Scholar
  29. 29.
    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15567-3_11CrossRefGoogle Scholar
  30. 30.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  31. 31.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)Google Scholar
  32. 32.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  33. 33.
    Wang, F., et al.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)Google Scholar
  34. 34.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  35. 35.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)Google Scholar
  36. 36.
    Wang, X., Gupta, A.: Videos as space-time region graphs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 399–417 (2018)Google Scholar
  37. 37.
    Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293 (2019)Google Scholar
  38. 38.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)Google Scholar
  39. 39.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)Google Scholar
  40. 40.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)Google Scholar
  41. 41.
    Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Amazon Web ServiceSeattleUSA

Personalised recommendations