Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

  • Junwei Liang
  • Qin Jin
  • Xixi He
  • Gang Yang
  • Jieping Xu
  • Xirong Li
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8879)


With the increasing use of audio sensors in user generated content (UGC) collection, semantic concept annotation using audio streams has become an important research problem. Huawei initiates a grand challenge in the International Conference on Multimedia & Expo (ICME) 2014: Huawei Accurate and Fast Mobile Video Annotation Challenge. In this paper, we present our semantic concept annotation system using audio stream only for the Huawei challenge. The system extracts audio stream from the video data and low-level acoustic features from the audio stream. Bag-of-feature representation is generated based on the low-level features and is used as input feature to train the support vector machine (SVM) concept classifier. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide important complementary information to the visual-based concept annotation system for performance boost.


Semantic Concept Annotation Video Content Analysis Soundtrack Analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Snoek, C., Worring, M.: Concept-based Video Retrieval. Foundations and Trends in Information Retrieval (2009)Google Scholar
  2. 2.
    Chang, S.F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A.C., Luo, J.: Large-Scale Multimodal Semantic Concept Detection for Consumer Video. In: International Workshop on Multimedia Information Retrieval (MIR) (2007)Google Scholar
  3. 3.
    Naphade, M.R., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-Scale Concept Ontology for Multimedia. IEEE Journal MultiMedia 13(3) (2006)Google Scholar
  4. 4.
    Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Kraaij, W., Smeaton, A.F., Quéenot, G.: TRECVID 2013 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics. In: Proceedings of TRECVID. NIST, USA (2013),
  5. 5.
    Lee, K., Ellis, D.P.W.: Audio-Based Semantic Concept Classificationfor Consumer Video. IEEE Transactions on Audio, Speech, and Language Processing 18(6) (2010)Google Scholar
  6. 6.
    Atrey, P.K., Kankanhalli, M.S., Jain, R.: Information Assimilation Framework for Event Detection in Multimedia Surveillance Systems. In: Multimedia Systems, pp. 239–253 (2006)Google Scholar
  7. 7.
    Kolekar, M.H., Sengupta, S.: Semantic concept extraction from sports video for highlight generation. In: International Conference on Mobile Multimedia Communications (MobiMedia) (2006)Google Scholar
  8. 8.
    Luo, H., Fan, J.: Building Concept Ontology for Medical Video Annotation. In: ACM Multimedia (2006)Google Scholar
  9. 9.
    ICEM 2014 Huawei Accurate and Fast Mobile Video Annotation Challenge,
  10. 10.
    Wold, E., Blum, T., Keislar, D., Wheaten, J.: Content-based Classification, Search, and Retrieval of Audio. IEEE Multimedia 3(3) (1996)Google Scholar
  11. 11.
    Saunders, J.: Real-time Discrimination of Broadcast Speech/Music. In: ICASSP (1996)Google Scholar
  12. 12.
    Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. In: ICASSP (1997)Google Scholar
  13. 13.
    Williams, G., Ellis, D.P.W.: Speech/Music Discrimination Based on Posterior Probability Features. In: Eurospeech (1999)Google Scholar
  14. 14.
    Ma, L., Milner, B., Smith, D.: Acoustic Environment Classification. ACM Transactions on Speech and Language Processing 3(2) (2006)Google Scholar
  15. 15.
    Eronen, A., Peltonen, V., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., Huopaniemi, J.: Audio-based Context Recognition. IEEE Trans. on Audio, Speech, and Language Processing 14(1) (2006)Google Scholar
  16. 16.
    Brown, L., et al.: IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems. In: TRECVID Workshop (2013)Google Scholar
  17. 17.
    Jin, Q., Schulam, F., Rawat, S., Burger, S., Ding, D., Metze, F.: Categorizing Consumer Videos Using Audio. In: Interspeech (2012)Google Scholar
  18. 18.
    Xue, X.B., Zhou, Z.H.: Distributional Features for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 21(3) (2008)Google Scholar
  19. 19.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR 2007 (2007)Google Scholar
  20. 20.
    Li, X., Snoek, C., Worring, M., Koelma, D., Smeulders, A.: Bootstrapping Visual Categorization With Relevant Negatives. IEEE Transactions on Multimedia 15(4) (2013)Google Scholar
  21. 21.
    Maji, S., Berg, A., Malik, J.: Classification using international kernel support vector machines is efficient. In: CVPR 2008 (2008)Google Scholar
  22. 22.
    Zha, Z.-J., Wang, M., Zheng, Y.-T., Yang, Y., Hong, R., Chua, T.-S.: Interactive Video Indexing with Statistical Active Learning. IEEE Transactions on Multimedia 14(1), 17–27 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Junwei Liang
    • 1
  • Qin Jin
    • 1
  • Xixi He
    • 1
  • Gang Yang
    • 1
  • Jieping Xu
    • 1
  • Xirong Li
    • 1
  1. 1.Multimedia Computing Lab, School of InformationRenmin University of ChinaChina

Personalised recommendations