Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-label Video Classification

  • Eun-Sol KimEmail author
  • Kyoung-Woon On
  • Jongseok Kim
  • Yu-Jung Heo
  • Seong-Ho Choi
  • Hyun-Dong Lee
  • Byoung-Tak Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Here we show neural network based methods, which combine multimodal sequential inputs effectively and classify the inputs into multiple categories. Two key ideas are (1) to select informative frames among a sequence using attention mechanism and (2) to utilize correlation information between labels to solve multi-label classification problems. The attention mechanism is used in both modality (spatio) and sequential (temporal) dimensions to ignore noisy and meaningless frames. Furthermore, to tackle fundamental problems induced by independently predicting each label in conventional multi-label classification methods, the proposed method considers the dependencies among the labels by decomposing joint probability of labels into conditional terms. From the experimental results (5th in the Kaggle competition), we discuss how the suggested methods operate in the YouTube-8M Classification Task, what insights they have, and why they succeed or fail.


Multimodal sequential learning Attention Multi-label classification Video understanding 



This work was partly supported by the Institute for Information & Communications Technology Promotion (R0126-16-1072-SW.StarLab, 2017-0-01772-VTT , 2018-0-00622-RMI) and Korea Evaluation Institute of Industrial Technology (10060086-RISF) grant funded by the Korea government (MSIP, DAPA).


  1. 1.
    Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
  2. 2.
    Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)Google Scholar
  3. 3.
    Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1578–1585 (2013)Google Scholar
  4. 4.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  5. 5.
    Chung, J., Ahn, S., Bengio, Y.: Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704 (2016)
  6. 6.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
  7. 7.
    Dembczynski, K., Cheng, W., Hüllermeier, E.: Bayes optimal multilabel classification via probabilistic classifier chains. ICML. 10, 279–286 (2010)Google Scholar
  8. 8.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
  9. 9.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017).
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
  12. 12.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  13. 13.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)
  14. 14.
    Na, S., Yu, Y., Lee, S., Kim, J., Kim, G.: Encoding video and label priors for multi-label video classification on youtube-8m dataset. arXiv preprint arXiv:1706.07960 (2017)
  15. 15.
    Nam, J., Mencía, E.L., Kim, H.J., Fürnkranz, J.: Maximizing subset accuracy with recurrent neural networks in multi-label classification. In: Advances in Neural Information Processing Systems, pp. 5413–5423 (2017)Google Scholar
  16. 16.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)Google Scholar
  17. 17.
    Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS (LNAI), vol. 5782, pp. 254–269. Springer, Heidelberg (2009). Scholar
  18. 18.
    Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333 (2011)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: null, p. 1470. IEEE (2003)Google Scholar
  20. 20.
    Tensorflow: Tensorflow: image recognition.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Eun-Sol Kim
    • 1
    Email author
  • Kyoung-Woon On
    • 2
  • Jongseok Kim
    • 1
  • Yu-Jung Heo
    • 2
  • Seong-Ho Choi
    • 2
  • Hyun-Dong Lee
    • 2
  • Byoung-Tak Zhang
    • 2
  1. 1.Kakao BrainSeongnamSouth Korea
  2. 2.Seoul National UniversitySeoulSouth Korea

Personalised recommendations