Building A Size Constrained Predictive Models for Video Classification

  • Miha SkalicEmail author
  • David Austin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Herein we present the solution to the \(2^\mathrm{nd}\) YouTube-8M video understanding challenge which placed \(1^\mathrm{st}\). Competition participants were tasked with building a size constrained video labeling model with a model size of less than 1 GB. Our final solution consists of several submodels belonging to Fisher vectors, NetVlad, Deep Bag of Frames and Recurrent neural networks model families. To make the classifier efficient under size constraints we introduced model distillation, partial weights quantization and training with exponential moving average.


Deep learning Multi-label classification Video processing 


  1. 1.
    LeCun, Y.A., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521, 436–444 (2015)CrossRefGoogle Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)Google Scholar
  3. 3.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Icassp, pp. 6645–6649 (2013)Google Scholar
  4. 4.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  5. 5.
    Abu-El-Haija, S., et al.: YouTube-8m: A large-scale video classification benchmark. CoRR abs/1609.08675 (2016)Google Scholar
  6. 6.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. CoRR abs/1706.06905 (2017)Google Scholar
  7. 7.
    Wang, H., Zhang, T., Wu, J.: The monkeytyping solution to the Youtube-8m video understanding challenge. CoRR abs/1706.05150 (2017)Google Scholar
  8. 8.
    Li, F., et al.: Temporal modeling approaches for large-scale youtube-8m video understanding. CoRR abs/1707.04555 (2017)Google Scholar
  9. 9.
    Chen, S., Wang, X., Tang, Y., Chen, X., Wu, Z., Jiang, Y.: Aggregating frame-level features for large-scale video classification. CoRR abs/1707.00803 (2017)Google Scholar
  10. 10.
    Skalic, M., Pekalski, M., Pan, X.E.: Deep learning methods for efficient large scale video labeling. CoRR abs/1706.04572 (2017)Google Scholar
  11. 11.
    Abadi, M., et al.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). Software available from
  12. 12.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  13. 13.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  14. 14.
    Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)Google Scholar
  15. 15.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2008)Google Scholar
  16. 16.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  17. 17.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009 – British Machine Vision Conference (2009)Google Scholar
  18. 18.
    Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3304–3311 (2010)Google Scholar
  19. 19.
    Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)Google Scholar
  20. 20.
    Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 5297–5307 (2018)Google Scholar
  21. 21.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)Google Scholar
  22. 22.
    Ruppert, D.: Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering (2018)Google Scholar
  23. 23.
    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University Pompeu FabraBarcelonaSpain
  2. 2.Intel CorporationChandlerUSA

Personalised recommendations