Skip to main content
Log in

Multimedia event detection with multimodal feature fusion and temporal concept localization

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript


We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others


  1. Note that the use of the terms, “mid-level” and “high-level” may be different from other work.

  2. TRECVID MED’12 dataset is larger; however, the ground truth will not be publicly released for several years.



  2. TRECVID 2011 Multimedia Event Detection Evaluation Plan Version 3.0.

  3. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the smo algorithm. In: ICML (2004)

  4. Bao, L., Cao, J., Zhang, Y., Li, J., yu Chen, M., Hauptmann, A.G.: Explicit and implicit concept-based video retrieval with bipartite graph propagation model. In: ACM Multimedia (2010)

  5. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: ACM SIGIR, pp. 127–134 (2003)

  6. Byun, B., Kim, I., Siniscalchi, S.M., Lee, C.H.: Consumer-level multimedia event detection through unsupervised audio signal modeling. In: InterSpeech (2012)

  7. Cao, L., Chang, S.F., Codella, N., Cotton, C., Ellis, D., Gong, L., Hill, M., Hua, G., Kender, J., Merler, M., Mu, Y., Smith, J.R., Yu, F.X.: IBM research and Columbia University TRECVID-2012 multimedia event detection (MED), multimedia event recounting (MER), and semantic indexing (SIN) systems (2012)

  8. Cao, L., Fei-Fei, L.: Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In: ICCV (2007)

  9. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)

    Article  Google Scholar 

  10. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)

  11. Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: ECCV (2010)

  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

  13. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  14. Feng, J., Zheng, Y., Yan, S.: Towards a universal detector by mining concepts with small semantic gaps. In: ACM Multimedia (2010)

  15. Feng, Y., Lapata, M.: Topic models for image annotation and text illustration. In: NAACL HLT (2010)

  16. Gao, S., Wu, W., Lee, C.H., Chua, T.S.: A mfom learning approach to robust multiclass multi-label text categorization. In: ICML (2004)

  17. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)

  18. Hauptmann, A.G., Christel, M.G., Yan, R.: Video retrieval based on semantic concepts. Proc. IEEE 96(4), 602–622 (2008)

    Article  Google Scholar 

  19. Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.J.: A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C 41(6), 797–819 (2011). URL:

  20. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)

    Article  Google Scholar 

  21. Jiang, L., Hauptmann, A.G., Xiang, G.: Leveraging high-level and low-level features for multimedia event detection. In: ACM-MM (2012)

  22. Jiang, W., Loui, A.C.: Audio-visual grouplet: temporal audio-visual interactions for general video concept classification. In: ACM Multimedia (2011)

  23. Jiang, Y.G., Zeng, X., Ye, G., Bhattacharya, S., Ellis, D., Shah, M., Chang, S.F.: Combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)

  24. Katagiri, S., Juang, B.H., Lee, C.H.: Pattern recognition using a family of design algorithm based upon the generalized probabilistic descent method. Proc. IEEE 86, 2345–2373 (1998)

    Article  Google Scholar 

  25. Kim, I., Lee, C.H.: Optimization of average precision with maximal figure-of-merit learning. In: MLSP (2011)

  26. Kim, I., Oh, S., Byun, B., Perera, A.G.A., Lee, C.H.: Explicit performance metric optimization for fusion-based video retrieval. In: ECCV Workshops, no. 3 (2012)

  27. Kim, I., Oh, S., Byun, B., Perera, A.G.A., Lee, C.H.: Explicit performance metric optimization for fusion-based video retrieval. In: ECCV Workshop (2012)

  28. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. PAMI 20, 226–239 (1998)

    Article  Google Scholar 

  29. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)

  30. Lan, Z.Z., Bao, L., Yu, S.I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: ICME (2012)

  31. Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In: CVPR (2011)

  32. Lee, C.H., Soong, F.K., Juang, B.H.: A segment model based approach to speech recognition. In: ICASSP (1988)

  33. Lee, K., Ellis, D.P.W.: Audio-based semantic concept classification for consumer video. IEEE Trans. Audio Speech Lang. Process. 18(6), 1406–1416 (2010)

    Google Scholar 

  34. Li, L.J., Su, H., Xing, E.P., Li, F.F.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS (2010)

  35. Liu, J., McCloskey, S., Liu, Y.: Local expert forest of score fusion for video event classification. In: ECCV (2012)

  36. Ma, A.J., Yuen, P.C.: Linear dependency modeling for feature fusion. In: ICCV, pp. 2041–2048 (2011)

  37. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: CVPR (2008)

  38. Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: ECCV (2008)

  39. Natarajan, P., Wu, S., Vitaladevuni, S.N.P., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., Natarajan, P.: Multimodal feature fusion for robust event detection in web videos. In: CVPR (2012)

  40. Niculescu-Mizil, A., Caruana, R.: Predicting good probabilities with supervised learning. In: ICML (2005)

  41. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  MATH  Google Scholar 

  42. Over, P., Awad, G., Michel, M., Fiscus, J., Antonishek, B., Smeaton, A.F., Kraaij, W., Quéenot, G.: TRECVID 2011—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2011. NIST, USA (2011)

  43. Over, P., Fiscus, J., Sanders, G., Shaw, B., Awad, G., Michel, M., Smeaton, A., Kraaij, W., Quéenot, G.: TRECVID 2012-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012. NIST, USA (2012)

  44. Putthividhya, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-model latent dirichlet allocation for image annotation. In: CVPR (2010)

  45. Reed, J., Lee, C.H.: On the importance of modeling temporal information in music tag annotation. In: ICASSP (2009)

  46. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9), 1582–1596 (2010)

    Article  Google Scholar 

  47. Scheirer, W., Rocha, A., Micheals, R., Boult, T.: Robust fusion: extreme value theory for recognition score normalization. In: ECCV, pp. 481–495 (2010)

  48. Smith, J., Naphade, M., Natsev, A.: Multimedia semantic indexing using model vectors. In: ICME (2003)

  49. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM Multimedia (2006)

  50. Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., Sawhney, H.S.: Evaluation of low-level features and their combinations for complex event detection in open source videos. In: CVPR (2012)

  51. Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classifier fusion in a non-bayesian probabilistic framework. PAMI 31(9), 1630–1644 (2009)

    Article  Google Scholar 

  52. Tsao, Y., Sun, H., Li, H., Lee, C.H.: An acoustic segment model approach to incorporating temporal information into speaker modeling for text-independent speaker recognition. In: ICASSP (2010)

  53. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. In: ICCV (2009)

  54. Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps (2011)

  55. Wang, C., Blei, D.M., Fei-Fei, L.: Simultaneous image classification and annotation. In: CVPR (2009)

  56. Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human action recognition. In: CVPR (2009)

  57. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)

  58. Yang, W., Wang, Y., Vahdat, A., Mori, G.: Kernel latent svm for visual recognition. In: Advances in Neural Information Processing Systems (NIPS) (2012)

  59. Ye, G., Liu, D., Jhuo, I.H., Chang, S.F.: Robust late fusion with rank minimization. In: CVPR (2012)

  60. Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: SIGIR (2005)

Download references


This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sangmin Oh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oh, S., McCloskey, S., Kim, I. et al. Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications 25, 49–69 (2014).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: