Skip to main content
Log in

Variability modelling for audio events detection in movies

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Detecting audio events in Hollywood movies is a complex task due to the presence of variability between the soundtracks of the movies. This inter-movies variability is shown to impair the audio events detection results in a realistic framework. In this article, we propose to model the variability using a factor analysis technique, which we then use to compensate the audio features. The factor analysis compensation is validated using the state-of-the-art system based on multiple audio words sequences and contextual Bayesian networks which we previously developed in Penet et al. (2013). Results obtained on the same publicly available dataset for the detection of gunshots and explosions show an improvement in the handling of the variability, while keeping the robustness capabilities of the previous system. Furthermore, the system is applied to the detection of screams and proves its ability to generalise to other types of events. The obtained results also emphasise the fact that, in addition to modelling variability, adding concepts in the system may also be beneficial for the precision rates

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. This corresponds to \(E\left [\mathbf {X}\right ]\) in g, multiplied by |X|.

  2. Silence segmentation is performed using the AudioSeg software, gforge.inria.fr/projects/audioseg.

  3. via the Yael & LibPQ libraries.

  4. The dataset and its annotations are publicly available at: https://research.technicolor.com/rennes/vsd/.

References

  1. Andre-Obrecht R (1988) A new statistical approach for the automatic segmentation of continuous speech signals. IEEE trans Audio, Speech Sig Process 36(1):29–40

    Article  Google Scholar 

  2. Atrey PK, Maddage NC, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: 31st IEEE international conference on acoustics, speech and signal processing. Toulouse, pp V–V

  3. Bello JP, Duxbury C, Davies M, Sandler M (2004) On the Use of phase and energy for musical onset detection in the complex domain. Signal Proc Lett 11(6):553–556

    Article  Google Scholar 

  4. Bonastre JF, Scheffer N, Matrouf D, Fredouille C, Larcher A, Preti A, Pouchoulin G, Evans N, Fauve B, Mason J (2008) ALIZE/SpkDet: a state-of-the-art open source software for speaker Recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. Stellenbosch

  5. Burred JJ (2012) Genetic motif discovery applied to audio analysis. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 361–364

  6. Chin ML , Burred JJ (2012) Audio event detection based on layered symbolic sequence representations. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 1953–1956

  7. Demarty C-H, Penet C, Gravier G, Soleymani M (2012) A benchmarking campaign for the multimodal detection of violent scenes in movies. In: ECCV 2012 workshop on information fusion in computer vision for concept recognition. Firenze, pp 416–425

  8. Gravier G, Demarty C-H, Baghdadi S, Gros P (2012) Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos. In: Multimedia sools and applications. pp 1–17

  9. Giannakopoulos T , Kosmopoulos DI, Aristidou A, Theodoridis S (2007) A multi-class audio classification method with respect to violent content in movies using Bayesian networks. In: IEEE workshop on multimedia signal processing. Crete, pp 90–93

  10. Ionescu B, Schluter J, Mironicǎ I, Schedl M (2013) A nave mid-level concept-based fusion approach to violence Detection in Hollywood Movies. In: ACM international conference on multimedia retrieval. Dallas Texas, pp 215–222

  11. J’egou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Tran Pattern Anal Mach Intell 33(1):117–128

    Article  Google Scholar 

  12. Jin Q, Schulam PF, Rawat S, Burger S, Ding D, Metze F (2012) Event-based video retrieval using audio. In: INTERSPEECH. Portland

  13. Joder C, Essid S , Richard G (2009) Temporal integration for audio classification with application to musical instrument classification. IEEE Trans Audio, Speech, Lang Process 17(1):174–186

    Article  Google Scholar 

  14. Kenny P (2006) Joint factor analysis of speaker and session variability: theory and algorithms technical report, Centre de recherche en informatique de montreal (CRIM)

  15. Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience

  16. Lucas P (2002) Restricted Bayesian network structure learning. In: Advances in Bayesian networks, studies in fuzziness and soft computing. pp 217–232

  17. McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceeding of the international society for music information retrieval. Washington, DC, pp 151–158

  18. Matrouf D, Scheffer N, Fauve B, Bonastrem JF (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. In: INTERSPEECH. Antwerp, pp 1242–1245

  19. Matrouf D, Verdet F, Rouvier M, Bonastre JF, Linares G (2011) Modeling nuisance variabilities with factor analysis for GMM-based Audio Pattern Classification. Comput Speech & Lang 25(3):481–498

    Article  Google Scholar 

  20. Ono N, Miyamoto K, Kameoka H, Sagayama S (2008) A real-time equalizer of harmonic and percussive components in music signals. In: International symposium/conference on music information Retrieval, Philadelphia. pp 139–144

  21. Penet C, Demarty C-H, Gravier G, Gros P (2012) Multimodal information fusion and temporal integration for violence detection in movies. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 2393–2396

  22. Penet C, Demarty C-H, Gravier G, Gros P (2013) Audio event detection in movies using multiple audio words and contextual Bayesian networks. In 11th international workshop on content-based multimedia Indexing, pp 17–22, Veszprem, Hungary, June 2013

  23. Ramona M, Richard G (2009) Comparison of different strategies for a SVM-Based audio segmentation. In: European conference on signal processing. Glasgow, pp 20–24

  24. Rouvier M, Matrouf D, Linars G (2009) Factor analysis for audio-based video genre classification. In: INTERSPEECH. Brighton

  25. Saunders J (1996) Real-time discrimination of broadcast speech/music. In: IEEE international conference on acoustics, speech and signal processing, vol 2. Atlanta, pp 993–996

  26. Schluter J, Ionescu B, Mironica I, Schedl M (2012) ARF @ MediaEval 2012: an uninformed approach to violence detection in hollywood movies. In: MediaEval 2012 Workshop. ceur-ws.org

  27. Trancoso I, Pellegrini T, Portelo J, Meinedo H, Bugalho M, Abad A, Neto J (2009) Audio contributions to semantic video search. In: International conference on multimedia and expo. New York, pp 630–633

  28. Vogt R, Sridha S (January 2008) Explicit modelling of session variability for speaker verification. Comput Speech Lang 22:17–38

    Article  Google Scholar 

  29. Valenzise G, Gerosa L, Tagliasacchi M, Antonacci F, Sarti A (2007) Scream and gunshot detection in noisy environments. In: EUSIPCO

  30. Vair C, Colibro D, Castaldo F, Dalmasso E, Laface P (2006) Channel factors compensation in model and feature domain for speaker recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. San Juan pp 1–6

  31. Zhuang X, Tsakalidis S, Wu S, Natarajan P, Prasad R, Natarajan P (2012) Compact audio representation for event detection in consumer media. In: INTERSPEECH. Portland

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cédric Penet.

Additional information

This work was partly achieved as part of the Quaero Program, funded by OSEO, French State agency for innovation. We would like to acknowledge the MediaEval Multimedia Benchmark http://www.multimediaeval.org/ and in particular the Affect Task 2011/2012 for providing the data used in this research.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Penet, C., Demarty, CH., Gravier, G. et al. Variability modelling for audio events detection in movies. Multimed Tools Appl 74, 1143–1173 (2015). https://doi.org/10.1007/s11042-014-2038-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2038-7

Keywords

Navigation