A Multi-tier Fusion Strategy for Event Classification in Unconstrained Videos

  • Prithwish Jana
  • Swarnabja Bhaumik
  • Partha Pratim MohantaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11942)


In this paper, we propose a novel fusion strategy of prediction vectors obtained from two different deep neural networks designed for the task of event recognition in unconstrained videos. Videos are suitably represented by a set of key-frames. Two types of features, namely spatial and temporal, are computed from a hybrid pre-trained CNN-RNN (Convolutional Neural Networks - Recurrent Neural Networks) framework for each video. These features are able to capture both the transient and long-term dependencies for understanding of the events. Frame-level and video-level prediction vectors are generated from two separate CNN-RNN (ResNet50-LSTM) frameworks exploiting spatial and temporal features respectively. The fusion is performed on these prediction vectors at different levels. The entire fusion framework relies on a concept of consolidation of probability distributions. This consolidation is implemented using conflation and a biasing technique. A multi-level fusion is achieved and at each level, a significant amount of classification accuracy is observed as improving. The experiment is performed on four benchmark datasets, namely, Columbia Consumer Video (CCV), Kodak’s Consumer Video, UCF-101 and Human Motion Database (HMDB). The increment in mAP values achieved by the proposed fusion strategy is much higher than the conventional fusion strategies in use. Also, the classification accuracies of all the four datasets are comparable to other state-of-the-art methods for event classification in unconstrained videos


Motion and video analysis Event classification Deep neural networks Spatio-temporal features Fusion Conflation 


  1. 1.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2018)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Kumar, P., Ranganath, S., Weimin, H., Sengupta, K.: Framework for real-time behavior interpretation from traffic video. IEEE Trans. Intell. Transp. Syst. 6(1), 43–53 (2005)CrossRefGoogle Scholar
  3. 3.
    Laptev, I., Pérez, P.: Retrieving actions in movies. In: 2007 IEEE 11th ICCV (2007)Google Scholar
  4. 4.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: IEEE Conference on CVPR, pp. 4694–4702 (2015)Google Scholar
  5. 5.
    Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimedia Inf. Retr. 2(2), 73–101 (2013)CrossRefGoogle Scholar
  6. 6.
    Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X.: Multi-stream multi-class fusion of deep networks for video classification. In: 24th ACM Multimedia Conference, pp. 791–800. ACM (2016)Google Scholar
  7. 7.
    Li, C., Ming, Y.: Three-stream convolution networks after background subtraction for action recognition. In: Bai, X., et al. (eds.) FFER/DLPR -2018. LNCS, vol. 11264, pp. 12–24. Springer, Cham (2019). Scholar
  8. 8.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on CVPR, pp. 1725–1732 (2014)Google Scholar
  9. 9.
    Lee, J., Abu-El-Haija, S., Varadarajan, B., Natsev, A.P.: Collaborative deep metric learning for video understanding. In: 24th ACM SIGKDD International Conference on KDDM, pp. 481–490. ACM (2018)Google Scholar
  10. 10.
    Emerson, P.: The original Borda count and partial voting. Soc. Choice Welf. 40(2), 353–358 (2013)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ye, G., Liu, D., Jhuo, I.H., Chang, S.F.: Robust late fusion with rank minimization. In: 2012 IEEE Conference on CVPR, pp. 3021–3028. IEEE (2012)Google Scholar
  12. 12.
    Umer, S., Ghorai, M., Mohanta, P.P.: Event recognition in unconstrained video using multi-scale deep spatial features. In: 2017 9th ICAPR, pp. 1–6. IEEE (2017)Google Scholar
  13. 13.
    Jana, P., Bhaumik, S., Mohanta, P.P.: Key-frame based event recognition in unconstrained videos using temporal features. In: 2019 IEEE Region 10 Symposium (TENSYMP). IEEE (2019)Google Scholar
  14. 14.
    Hill, T.: Conflations of probability distributions. Trans. Am. Math. Soc. 363(6), 3351–3372 (2011)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Hill, T.P., Miller, J.: How to combine independent data sets for the same quantity. Chaos: An Interdiscip. J. Nonlinear Sci. 21(3), 033102 (2011)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhyā: Indian J. Stat. 401–406 (1946)Google Scholar
  17. 17.
    Columbia Consumer Video (CCV) Database. Accessed May 2019
  18. 18.
    Kodak’s consumer video benchmark data set. Accessed May 2019
  19. 19.
    UCF101 - Action Recognition Data Set. Accessed May 2019
  20. 20.
  21. 21.
    Jiang, Y.G., Wu, Z., Wang, J., Xue, X., Chang, S.F.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 352–364 (2017)CrossRefGoogle Scholar
  22. 22.
    Zhang, J., Mei, K., Zheng, Y., Fan, J.: Exploiting mid-level semantics for large-scale complex video classification. IEEE Trans. Multimedia 21, 2518–2530 (2019)CrossRefGoogle Scholar
  23. 23.
    Duan, L., Xu, D., Chang, S.F.: Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach. In: 2012 IEEE Conference on CVPR, pp. 1338–1345. IEEE (2012)Google Scholar
  24. 24.
    Chen, L., Duan, L., Xu, D.: Event recognition in videos by learning from heterogeneous web sources. In: 2013 IEEE Conference on CVPR, pp. 2666–2673 (2013)Google Scholar
  25. 25.
    Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A.G., Zheng, Q.: An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans. Cybern. 48, 648–660 (2017)CrossRefGoogle Scholar
  26. 26.
    Cai, Y., Lin, W., See, J., Cheng, M.M., Liu, G., Xiong, H.: Multi-scale spatiotemporal information fusion network for video action recognition. In: IEEE VCIP, pp. 1–4 (2018)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  28. 28.
    Song, H., Tian, L., Li, C.: 3D convolutional network based foreground feature fusion. In: 2018 IEEE ISM, pp. 253–258. IEEE (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Prithwish Jana
    • 1
  • Swarnabja Bhaumik
    • 2
  • Partha Pratim Mohanta
    • 3
    Email author
  1. 1.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia
  2. 2.Department of Computer Science and EngineeringMeghnad Saha Institute of TechnologyKolkataIndia
  3. 3.Electronics and Communication Sciences UnitIndian Statistical InstituteKolkataIndia

Personalised recommendations