Skip to main content

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12353))

Abstract

Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets.

X. Wang—Work done while an intern at Google.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. In: ICLR (2017)

    Google Scholar 

  2. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: ICCV (2019)

    Google Scholar 

  3. Cao, S., Wang, X., Kitani, K.M.: Learnable embedding space for efficient neural architecture compression. In: ICLR (2019)

    Google Scholar 

  4. Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  6. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)

    Google Scholar 

  7. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)

    Google Scholar 

  8. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

    Google Scholar 

  9. He, D., et al.: StNET: local and global spatial-temporal modeling for action recognition. In: AAAI (2019)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  11. Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., Xing, E.P.: Neural architecture search with Bayesian optimisation and optimal transport. In: NeurIPS (2018)

    Google Scholar 

  12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS (2012)

    Google Scholar 

  13. Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: UAI (2019)

    Google Scholar 

  14. Liu, C., et al.: Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In: CVPR (2019)

    Google Scholar 

  15. Liu, C.: Progressive neural architecture search. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_2

    Chapter  Google Scholar 

  16. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)

    Google Scholar 

  17. Liu, X., Lee, J.Y., Jin, H.: Learning video representations from correspondence proposals. In: CVPR (2019)

    Google Scholar 

  18. Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. TPAMI 42, 502–508 (2019)

    Article  Google Scholar 

  19. Park, J., Woo, S., Lee, J.Y., Kweon, I.S.: Bam: bottleneck attention module. In: BMVC (2018)

    Google Scholar 

  20. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV (2017)

    Google Scholar 

  21. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)

    Google Scholar 

  22. Real, E., et al.: Large-scale evolution of image classifiers. In: ICML (2017)

    Google Scholar 

  23. Ryoo, M.S., Piergiovanni, A., Tan, M., Angelova, A.: Assemblenet: searching for multi-stream neural connectivity in video architectures. In: ICLR (2020)

    Google Scholar 

  24. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)

    Google Scholar 

  25. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: NeurIPS (2012)

    Google Scholar 

  26. Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: ICML (2009)

    Google Scholar 

  27. Stroud, J., Ross, D., Sun, C., Deng, J., Sukthankar, R.: D3d: Distilled 3D networks for video action recognition. In: WACV (2020)

    Google Scholar 

  28. Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)

    Google Scholar 

  29. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  31. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  32. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

    Google Scholar 

  33. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1

    Chapter  Google Scholar 

  34. Xie, L., Yuille, A.: Genetic CNN. In: ICCV (2017)

    Google Scholar 

  35. Xie, S., Kirillov, A., Girshick, R., He, K.: Exploring randomly wired neural networks for image recognition. In: ICCV (2019)

    Google Scholar 

  36. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19

    Chapter  Google Scholar 

  37. Yu, K., Sciuto, C., Jaggi, M., Musat, C., Salzmann, M.: Evaluating the search phase of neural architecture search. In: ICLR (2020)

    Google Scholar 

  38. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: CVPR (2015)

    Google Scholar 

  39. Zhong, Z., Yan, J., Wu, W., Shao, J., Liu, C.L.: Practical block-wise neural network architecture generation. In: CVPR (2018)

    Google Scholar 

  40. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49

    Chapter  Google Scholar 

  41. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)

    Google Scholar 

  42. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgement

We thank Guanhang Wu and Yinxiao Li for insightful discussions and the larger Google Cloud Video AI team for the support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofang Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 263 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, X. et al. (2020). AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12353. Springer, Cham. https://doi.org/10.1007/978-3-030-58598-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58598-3_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58597-6

  • Online ISBN: 978-3-030-58598-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics