Multimedia event detection with multimodal feature fusion and temporal concept localization

Oh, Sangmin; McCloskey, Scott; Kim, Ilseo; Vahdat, Arash; Cannons, Kevin J.; Hajimirsadeghi, Hossein; Mori, Greg; Perera, A. G. Amitha; Pandey, Megha; Corso, Jason J.

doi:10.1007/s00138-013-0525-x

Multimedia event detection with multimodal feature fusion and temporal concept localization

Special Issue Paper
Published: 16 July 2013

Volume 25, pages 49–69, (2014)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Sangmin Oh¹,
Scott McCloskey²,
Ilseo Kim¹,
Arash Vahdat³,
Kevin J. Cannons³,
Hossein Hajimirsadeghi³,
Greg Mori³,
A. G. Amitha Perera¹,
Megha Pandey¹ &
…
Jason J. Corso⁴

1644 Accesses
37 Citations
Explore all metrics

Abstract

We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resource Constrained Multimedia Event Detection

Multi-modal video event recognition based on association rules and decision fusion

Article 11 February 2017

Mennan Güder & Nihan Kesim Çiçekli

Evaluating Multimedia Features and Fusion for Example-Based Event Detection

Notes

Note that the use of the terms, “mid-level” and “high-level” may be different from other work.
TRECVID MED’12 dataset is larger; however, the ground truth will not be publicly released for several years.

References

http://www.lscom.org/
TRECVID 2011 Multimedia Event Detection Evaluation Plan Version 3.0. http://www.nist.gov/itl/iad/mig/upload/MED11-EvalPlan-V03-20110801a.pdf
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the smo algorithm. In: ICML (2004)
Bao, L., Cao, J., Zhang, Y., Li, J., yu Chen, M., Hauptmann, A.G.: Explicit and implicit concept-based video retrieval with bipartite graph propagation model. In: ACM Multimedia (2010)
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: ACM SIGIR, pp. 127–134 (2003)
Byun, B., Kim, I., Siniscalchi, S.M., Lee, C.H.: Consumer-level multimedia event detection through unsupervised audio signal modeling. In: InterSpeech (2012)
Cao, L., Chang, S.F., Codella, N., Cotton, C., Ellis, D., Gong, L., Hill, M., Hua, G., Kender, J., Merler, M., Mu, Y., Smith, J.R., Yu, F.X.: IBM research and Columbia University TRECVID-2012 multimedia event detection (MED), multimedia event recounting (MER), and semantic indexing (SIN) systems (2012)
Cao, L., Fei-Fei, L.: Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In: ICCV (2007)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: ECCV (2010)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Feng, J., Zheng, Y., Yan, S.: Towards a universal detector by mining concepts with small semantic gaps. In: ACM Multimedia (2010)
Feng, Y., Lapata, M.: Topic models for image annotation and text illustration. In: NAACL HLT (2010)
Gao, S., Wu, W., Lee, C.H., Chua, T.S.: A mfom learning approach to robust multiclass multi-label text categorization. In: ICML (2004)
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)
Hauptmann, A.G., Christel, M.G., Yan, R.: Video retrieval based on semantic concepts. Proc. IEEE 96(4), 602–622 (2008)
Article Google Scholar
Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.J.: A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C 41(6), 797–819 (2011). URL: http://dx.doi.org/10.1109/TSMCC.2011.2109710
Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)
Article Google Scholar
Jiang, L., Hauptmann, A.G., Xiang, G.: Leveraging high-level and low-level features for multimedia event detection. In: ACM-MM (2012)
Jiang, W., Loui, A.C.: Audio-visual grouplet: temporal audio-visual interactions for general video concept classification. In: ACM Multimedia (2011)
Jiang, Y.G., Zeng, X., Ye, G., Bhattacharya, S., Ellis, D., Shah, M., Chang, S.F.: Combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)
Katagiri, S., Juang, B.H., Lee, C.H.: Pattern recognition using a family of design algorithm based upon the generalized probabilistic descent method. Proc. IEEE 86, 2345–2373 (1998)
Article Google Scholar
Kim, I., Lee, C.H.: Optimization of average precision with maximal figure-of-merit learning. In: MLSP (2011)
Kim, I., Oh, S., Byun, B., Perera, A.G.A., Lee, C.H.: Explicit performance metric optimization for fusion-based video retrieval. In: ECCV Workshops, no. 3 (2012)
Kim, I., Oh, S., Byun, B., Perera, A.G.A., Lee, C.H.: Explicit performance metric optimization for fusion-based video retrieval. In: ECCV Workshop (2012)
Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. PAMI 20, 226–239 (1998)
Article Google Scholar
Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)
Lan, Z.Z., Bao, L., Yu, S.I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: ICME (2012)
Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In: CVPR (2011)
Lee, C.H., Soong, F.K., Juang, B.H.: A segment model based approach to speech recognition. In: ICASSP (1988)
Lee, K., Ellis, D.P.W.: Audio-based semantic concept classification for consumer video. IEEE Trans. Audio Speech Lang. Process. 18(6), 1406–1416 (2010)
Google Scholar
Li, L.J., Su, H., Xing, E.P., Li, F.F.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS (2010)
Liu, J., McCloskey, S., Liu, Y.: Local expert forest of score fusion for video event classification. In: ECCV (2012)
Ma, A.J., Yuen, P.C.: Linear dependency modeling for feature fusion. In: ICCV, pp. 2041–2048 (2011)
Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: CVPR (2008)
Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: ECCV (2008)
Natarajan, P., Wu, S., Vitaladevuni, S.N.P., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., Natarajan, P.: Multimodal feature fusion for robust event detection in web videos. In: CVPR (2012)
Niculescu-Mizil, A., Caruana, R.: Predicting good probabilities with supervised learning. In: ICML (2005)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Article MATH Google Scholar
Over, P., Awad, G., Michel, M., Fiscus, J., Antonishek, B., Smeaton, A.F., Kraaij, W., Quéenot, G.: TRECVID 2011—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2011. NIST, USA (2011)
Over, P., Fiscus, J., Sanders, G., Shaw, B., Awad, G., Michel, M., Smeaton, A., Kraaij, W., Quéenot, G.: TRECVID 2012-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012. NIST, USA (2012)
Putthividhya, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-model latent dirichlet allocation for image annotation. In: CVPR (2010)
Reed, J., Lee, C.H.: On the importance of modeling temporal information in music tag annotation. In: ICASSP (2009)
van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9), 1582–1596 (2010)
Article Google Scholar
Scheirer, W., Rocha, A., Micheals, R., Boult, T.: Robust fusion: extreme value theory for recognition score normalization. In: ECCV, pp. 481–495 (2010)
Smith, J., Naphade, M., Natsev, A.: Multimedia semantic indexing using model vectors. In: ICME (2003)
Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM Multimedia (2006)
Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., Sawhney, H.S.: Evaluation of low-level features and their combinations for complex event detection in open source videos. In: CVPR (2012)
Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classifier fusion in a non-bayesian probabilistic framework. PAMI 31(9), 1630–1644 (2009)
Article Google Scholar
Tsao, Y., Sun, H., Li, H., Lee, C.H.: An acoustic segment model approach to incorporating temporal information into speaker modeling for text-independent speaker recognition. In: ICASSP (2010)
Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. In: ICCV (2009)
Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps (2011)
Wang, C., Blei, D.M., Fei-Fei, L.: Simultaneous image classification and annotation. In: CVPR (2009)
Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human action recognition. In: CVPR (2009)
Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)
Yang, W., Wang, Y., Vahdat, A., Mori, G.: Kernel latent svm for visual recognition. In: Advances in Neural Information Processing Systems (NIPS) (2012)
Ye, G., Liu, D., Jhuo, I.H., Chang, S.F.: Robust late fusion with rank minimization. In: CVPR (2012)
Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: SIGIR (2005)

Download references

Acknowledgments

This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.

Author information

Authors and Affiliations

Kitware Inc., Clifton Park, New York, USA
Sangmin Oh, Ilseo Kim, A. G. Amitha Perera & Megha Pandey
Honeywell Labs, Minneapolis, USA
Scott McCloskey
School of Computing Science, Simon Fraser University, Burnaby, Canada
Arash Vahdat, Kevin J. Cannons, Hossein Hajimirsadeghi & Greg Mori
Department of Computer Science and Engineering, SUNY at Buffalo, Buffalo, USA
Jason J. Corso

Authors

Sangmin Oh
View author publications
You can also search for this author in PubMed Google Scholar
Scott McCloskey
View author publications
You can also search for this author in PubMed Google Scholar
Ilseo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Arash Vahdat
View author publications
You can also search for this author in PubMed Google Scholar
Kevin J. Cannons
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Hajimirsadeghi
View author publications
You can also search for this author in PubMed Google Scholar
Greg Mori
View author publications
You can also search for this author in PubMed Google Scholar
A. G. Amitha Perera
View author publications
You can also search for this author in PubMed Google Scholar
Megha Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Jason J. Corso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sangmin Oh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oh, S., McCloskey, S., Kim, I. et al. Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications 25, 49–69 (2014). https://doi.org/10.1007/s00138-013-0525-x

Download citation

Received: 11 January 2013
Revised: 23 May 2013
Accepted: 30 May 2013
Published: 16 July 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s00138-013-0525-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimedia event detection with multimodal feature fusion and temporal concept localization

Abstract

Access this article

Similar content being viewed by others

Resource Constrained Multimedia Event Detection

Multi-modal video event recognition based on association rules and decision fusion

Evaluating Multimedia Features and Fusion for Example-Based Event Detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Resource Constrained Multimedia Event Detection

Multi-modal video event recognition based on association rules and decision fusion

Evaluating Multimedia Features and Fusion for Example-Based Event Detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation