NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification

  • Rongcheng LinEmail author
  • Jing Xiao
  • Jianping Fan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube-8M video understanding challenge, a single NeXtVLAD model with less than 80M parameters achieves a GAP score of 0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in 0.88722, which is ranked 3rd over 394 teams. The code is publicly available at


Neural network VLAD Video classification Youtube8M 



The authors would like to thank Kaggle and the Google team for hosting the Youtube-8M video understanding challenge and providing the Youtube-8M Tensorflow Starter Code.


  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)
  2. 2.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  3. 3.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  4. 4.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) EMNLP, ACL, pp. 1724–1734 (2014)Google Scholar
  5. 5.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. CoRR (2017)Google Scholar
  6. 6.
    Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. CoRR (2016)Google Scholar
  7. 7.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1470–1477 (2003)Google Scholar
  8. 8.
    Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR IEEE Computer Society (2007)Google Scholar
  9. 9.
    Jegou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR IEEE Computer Society, pp. 3304–3311 (2010)Google Scholar
  10. 10.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  11. 11.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 17th International Conference on Proceedings of the Pattern Recognition, (ICPR 2004) Volume 3 - Volume 03. ICPR 2004, IEEE Computer Society, pp. 32–36 (2004)Google Scholar
  12. 12.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  13. 13.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  14. 14.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  15. 15.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS 2014, pp. 568–576. MIT Press (2014)Google Scholar
  17. 17.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)CrossRefGoogle Scholar
  18. 18.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). ICCV 2015, IEEE Computer Society, pp. 4489–4497 (2015)Google Scholar
  19. 19.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. CoRR (2016)Google Scholar
  20. 20.
    Wu, Z., Jiang, Y., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. CoRR (2015)Google Scholar
  21. 21.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  22. 22.
    Ballas, N., Yao, L., Pal, C., Courville, A.C.: Delving deeper into convolutional networks for learning video representations. CoRR (2015)Google Scholar
  23. 23.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  24. 24.
    Wang, X., Farhadi, A., Gupta, A.: Actions   transformations. In: CVPR (2016)Google Scholar
  25. 25.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. CoRR (2016)Google Scholar
  26. 26.
    Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. Technical report, arXiv (2017)Google Scholar
  27. 27.
    Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1578–1585 (2013)Google Scholar
  28. 28.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  29. 29.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)Google Scholar
  30. 30.
    Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. CoRR (2017)Google Scholar
  31. 31.
    Li, Z., Hoiem, D.: Learning without forgetting. CoRR (2016)Google Scholar
  32. 32.
    Lan, X., Zhu, X., Gong, S.: Knowledge distillation by on-the-fly native ensemble. arXiv:1806.04606 (2018)
  33. 33.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML 2015, pp. 448–456. (2015)Google Scholar
  34. 34.
    Deng, J., Dong, W., Socher, R., Li-jia, L., Li, K., Fei-fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  35. 35.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. CoRR (2016)Google Scholar
  36. 36.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI 2016, USENIX Association, pp. 265–283 (2016)Google Scholar
  37. 37.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of North Carolina at CharlotteCharlotteUSA

Personalised recommendations