Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio
Abstract
With the increasing use of audio sensors in user generated content (UGC) collection, semantic concept annotation using audio streams has become an important research problem. Huawei initiates a grand challenge in the International Conference on Multimedia & Expo (ICME) 2014: Huawei Accurate and Fast Mobile Video Annotation Challenge. In this paper, we present our semantic concept annotation system using audio stream only for the Huawei challenge. The system extracts audio stream from the video data and low-level acoustic features from the audio stream. Bag-of-feature representation is generated based on the low-level features and is used as input feature to train the support vector machine (SVM) concept classifier. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide important complementary information to the visual-based concept annotation system for performance boost.
Keywords
Semantic Concept Annotation Video Content Analysis Soundtrack AnalysisPreview
Unable to display preview. Download preview PDF.
References
- 1.Snoek, C., Worring, M.: Concept-based Video Retrieval. Foundations and Trends in Information Retrieval (2009)Google Scholar
- 2.Chang, S.F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A.C., Luo, J.: Large-Scale Multimodal Semantic Concept Detection for Consumer Video. In: International Workshop on Multimedia Information Retrieval (MIR) (2007)Google Scholar
- 3.Naphade, M.R., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-Scale Concept Ontology for Multimedia. IEEE Journal MultiMedia 13(3) (2006)Google Scholar
- 4.Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Kraaij, W., Smeaton, A.F., Quéenot, G.: TRECVID 2013 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics. In: Proceedings of TRECVID. NIST, USA (2013), http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf
- 5.Lee, K., Ellis, D.P.W.: Audio-Based Semantic Concept Classificationfor Consumer Video. IEEE Transactions on Audio, Speech, and Language Processing 18(6) (2010)Google Scholar
- 6.Atrey, P.K., Kankanhalli, M.S., Jain, R.: Information Assimilation Framework for Event Detection in Multimedia Surveillance Systems. In: Multimedia Systems, pp. 239–253 (2006)Google Scholar
- 7.Kolekar, M.H., Sengupta, S.: Semantic concept extraction from sports video for highlight generation. In: International Conference on Mobile Multimedia Communications (MobiMedia) (2006)Google Scholar
- 8.Luo, H., Fan, J.: Building Concept Ontology for Medical Video Annotation. In: ACM Multimedia (2006)Google Scholar
- 9.ICEM 2014 Huawei Accurate and Fast Mobile Video Annotation Challenge, http://www.icme2014.org/huawei-accurate-and-fast-mobile-video-annotation-challenge
- 10.Wold, E., Blum, T., Keislar, D., Wheaten, J.: Content-based Classification, Search, and Retrieval of Audio. IEEE Multimedia 3(3) (1996)Google Scholar
- 11.Saunders, J.: Real-time Discrimination of Broadcast Speech/Music. In: ICASSP (1996)Google Scholar
- 12.Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. In: ICASSP (1997)Google Scholar
- 13.Williams, G., Ellis, D.P.W.: Speech/Music Discrimination Based on Posterior Probability Features. In: Eurospeech (1999)Google Scholar
- 14.Ma, L., Milner, B., Smith, D.: Acoustic Environment Classification. ACM Transactions on Speech and Language Processing 3(2) (2006)Google Scholar
- 15.Eronen, A., Peltonen, V., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., Huopaniemi, J.: Audio-based Context Recognition. IEEE Trans. on Audio, Speech, and Language Processing 14(1) (2006)Google Scholar
- 16.Brown, L., et al.: IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems. In: TRECVID Workshop (2013)Google Scholar
- 17.Jin, Q., Schulam, F., Rawat, S., Burger, S., Ding, D., Metze, F.: Categorizing Consumer Videos Using Audio. In: Interspeech (2012)Google Scholar
- 18.Xue, X.B., Zhou, Z.H.: Distributional Features for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 21(3) (2008)Google Scholar
- 19.Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR 2007 (2007)Google Scholar
- 20.Li, X., Snoek, C., Worring, M., Koelma, D., Smeulders, A.: Bootstrapping Visual Categorization With Relevant Negatives. IEEE Transactions on Multimedia 15(4) (2013)Google Scholar
- 21.Maji, S., Berg, A., Malik, J.: Classification using international kernel support vector machines is efficient. In: CVPR 2008 (2008)Google Scholar
- 22.Zha, Z.-J., Wang, M., Zheng, Y.-T., Yang, Y., Hong, R., Chua, T.-S.: Interactive Video Indexing with Statistical Active Learning. IEEE Transactions on Multimedia 14(1), 17–27 (2012)CrossRefGoogle Scholar