Skip to main content
Log in

Multi-label audio concept detection using correlated-aspect Gaussian Mixture Model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As an essentially multi-label classification problem, audio concept detection is normally solved by treating concepts independently. Since in this process the original useful concept correlation information is missing, this paper proposes a new model named Correlated-Aspect Gaussian Mixture Model (C-AGMM) to take advantage of such a clue for enhancing multi-label audio concept detection. Originating from Aspect Gaussian Mixture Model (AGMM) which improves GMM by incorporating it into probabilistic Latent Semantic Analysis (pLSA), C-AGMM still learns a probabilistic model of the whole audio clip by regarding concepts as its component elements. However, different from AGMM that assumes concepts independent with each other, C-AGMM considers their distribution on a sub-manifold embedded in the ambient space. With an assumption that if two concepts are close in the intrinsic geometry of this distribution then their conditional probability distributions are likely to show similarity, a graph regularizer is exploited to model the correlation between these concepts. Following the Maximum Likelihood Estimate principle, model parameters of C-AGMM encoding the concept correlation clue are derived and used directly as the detection criterion. Experiments on two datasets show the effectiveness of our proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Ahrendt P, Larsen J, Goutte C (2005) Co-occurrence models in music genre classification. In: MLSP, pp 247–252

  2. Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol 14, pp 585–591

  3. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 2399–2434

  4. Bertin-Mahieux T, Eck D, Maillet F, Lamere P (2008) Autotagger: a model for predicting social tags from acoustic. J New Music Res 37(2):115–135

    Article  Google Scholar 

  5. Cai D, Mei Q, Han J, Zhai C (2008) Modeling hidden topics on document manifold. In: CIKM, pp 911–920

  6. Chu S, Narayanan S, Jay Kuo C-C (2006) Content analysis for acoustic environment classification in mobile robots. In: AAAI fall symposium aurally informed performance: integrating machine listening and auditory presentation in robotic system, pp 16–21

  7. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  8. Ellis DPW, Zeng X, McDermott JH (2011) Classifying soundtracks with audio texture features. In: ICASSP, pp 5880–5883

  9. Eronen AJ, Peltonen VT, Tuomi JT, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J (2006) Audio-based context recognition. IEEE Trans Audio Speech Lang Process 14(1):321–329

    Article  Google Scholar 

  10. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(2):177–196

    Article  MATH  Google Scholar 

  11. Jiang Y, Wang J, Chang S, Ngo C (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: ICCV, pp 1420–1427

  12. Juang BH, Rabiner LR (1991) Hidden Markov Models for speech recognition. Technometrics 33(3):251–272

    Article  MATH  MathSciNet  Google Scholar 

  13. Lee K, Ellis DPW, Loui AC (2010) Detecting local semantic concepts in environmental sounds using Markov model based clustering. In: ICASSP, pp 2278–2281

  14. Lee K, Ellis DPW (2010) Audio-based semantic concept classification for consumer video. IEEE Trans Audio Speech Lang Process 16(6):1406–1416

    Article  Google Scholar 

  15. Li Y, Tian Y, Duan L, Yang J, Huang T, Gao W (2010) Sequence multi-labeling: a unified video annotation scheme with spatial and temporal context. IEEE Trans Multimed 12(8):814–828

    Article  Google Scholar 

  16. Liu J, Cai D, He X (2010) Gaussian Mixture Model with local consistency. In: AAAI, pp 512–517

  17. Ma L, Milner B, Smith D (2006) Acoustic environment classification. ACM Trans Speech Lang Process 3(2):1–22

    Article  Google Scholar 

  18. Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83

    Article  Google Scholar 

  19. Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: CIKM, pp 623–632

  20. Tan CC, Jiang Y, Ngo C (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM multimedia, pp 655–658

  21. Wang ZY, Wang SF, He MH, Liu ZL, Ji Q (2013) Emotional tagging of videos by exploring multi-emotion coexistence. In: FG, pp 1–6

  22. Wang C, Jing F, Zhang L, Zhang H (2007) Content-based image annotation refinement. In: CVPR, pp 1–8

  23. Zha Z, Mei T, Hua X, Qi G, Wang Z (2007) Refining video annotation by exploiting pairwise concurrent relation. In: ACM multimedia, pp 345–348

  24. Zhang T, Kuo CJ (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans Speech Audio Process 9(4):441–457

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Science Foundation of China (61273274, 4123104), National 973 Key Research Program of China (2011CB302203), Ph.D. Programs Foundation of Ministry of Education of China (20100009110004), National Key Technology R&D Program of China (2012BAH01F03) and Tsinghua-Tencent Joint Lab for IIT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cencen Zhong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, C., Miao, Z. Multi-label audio concept detection using correlated-aspect Gaussian Mixture Model. Multimed Tools Appl 74, 4817–4832 (2015). https://doi.org/10.1007/s11042-013-1842-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1842-9

Keywords

Navigation