Advertisement

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Conference paper
  • 753 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)

Abstract

Video action recognition is a complex task dependent on modeling spatial and temporal context. Standard approaches rely on 2D or 3D convolutions to process such context, resulting in expensive operations with millions of parameters. Recent efficient architectures leverage a channel-wise shift-based primitive as a replacement for temporal convolutions, but remain bottlenecked by spatial convolution operations to maintain strong accuracy and a fixed-shift scheme. Naively extending such developments to a 3D setting is a difficult, intractable goal. To this end, we introduce RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead. We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design. We benchmark our approach on several standard video recognition datasets, and observe that our method achieves comparable or better accuracy than prior work on efficient video action recognition at a fraction of the performance cost, with 2.9–5.9\(\times \) fewer parameters and 2.1–3.7\(\times \) fewer FLOPs. We also perform a series of controlled ablation studies to verify our significant boost in the efficiency-accuracy tradeoff curve is rooted in the core contributions of our RubiksNet architecture.

Keywords

Efficient action recognition Spatiotemporal Learnable shift Budget-constrained Video understanding 

Notes

Acknowledgements

L. Fan and S. Buch are supported by SGF and NDSEG fellowships, respectively. This research was sponsored in part by grants from Toyota Research Institute (TRI). Some computational support for experiments was provided by Google Cloud and NVIDIA. This article reflects the authors’ opinions and conclusions, and not any other entity. We thank Ji Lin, Song Han, De-An Huang, Danfei Xu, the general Stanford Vision Lab (SVL) community, and our anonymous reviewers for helpful comments and discussion.

Supplementary material

504475_1_En_30_MOESM1_ESM.pdf (40 kb)
Supplementary material 1 (pdf 39 KB)

References

  1. 1.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  2. 2.
    Chen, W., Xie, D., Zhang, Y., Pu, S.: All you need is a few shifts: designing efficient convolutional neural networks for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7241–7250 (2019)Google Scholar
  3. 3.
    Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)Google Scholar
  4. 4.
    Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. In: Advances in Neural Information Processing Systems, pp. 2261–2270 (2019)Google Scholar
  5. 5.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)Google Scholar
  6. 6.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  7. 7.
    Hacene, G.B., Lassance, C., Gripon, V., Courbariaux, M., Bengio, Y.: Attention based pruning for shift networks. arXiv:1905.12300 [cs] (May 2019)
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  9. 9.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  10. 10.
    Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and\(<\)0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
  11. 11.
    Jeon, Y., Kim, J.: Constructing fast network through deconstruction of convolution. In: Advances in Neural Information Processing Systems, pp. 5951–5961 (2018)Google Scholar
  12. 12.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRefGoogle Scholar
  13. 13.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  14. 14.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  15. 15.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2556–2563. IEEE (2011)Google Scholar
  16. 16.
    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding (2018)Google Scholar
  17. 17.
    Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)
  18. 18.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)Google Scholar
  19. 19.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-yMathSciNetCrossRefGoogle Scholar
  20. 20.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)Google Scholar
  21. 21.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  22. 22.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  23. 23.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  24. 24.
    Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5552–5561 (2019)Google Scholar
  25. 25.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)Google Scholar
  26. 26.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  27. 27.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)Google Scholar
  28. 28.
    Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_25CrossRefGoogle Scholar
  29. 29.
    Wu, B., et al.: Shift: A zero flop, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9127–9135 (2018)Google Scholar
  30. 30.
    Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6026–6035 (2018)Google Scholar
  31. 31.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_19CrossRefGoogle Scholar
  32. 32.
    Yao, G., Lei, T., Zhong, J.: A review of convolutional-neural-network-based action recognition. Pattern Recogn. Lett. 118, 14–22 (2019)CrossRefGoogle Scholar
  33. 33.
    Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)Google Scholar
  34. 34.
    Zhao, Y., Xiong, Y., Lin, D.: Trajectory convolution for action recognition. In: Advances in Neural Information Processing Systems, pp. 2204–2215 (2018)Google Scholar
  35. 35.
    Zhong, H., Liu, X., He, Y., Ma, Y., Kitani, K.: Shift-based primitives for efficient convolutional neural networks. arXiv preprint arXiv:1809.08458 (2018)
  36. 36.
    Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_49CrossRefGoogle Scholar
  37. 37.
    Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_43CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Stanford Vision and Learning LabStanfordUSA
  2. 2.UT AustinAustinUSA
  3. 3.NVIDIASanta ClaraUSA

Personalised recommendations