Abstract
Detecting audio events in Hollywood movies is a complex task due to the presence of variability between the soundtracks of the movies. This inter-movies variability is shown to impair the audio events detection results in a realistic framework. In this article, we propose to model the variability using a factor analysis technique, which we then use to compensate the audio features. The factor analysis compensation is validated using the state-of-the-art system based on multiple audio words sequences and contextual Bayesian networks which we previously developed in Penet et al. (2013). Results obtained on the same publicly available dataset for the detection of gunshots and explosions show an improvement in the handling of the variability, while keeping the robustness capabilities of the previous system. Furthermore, the system is applied to the detection of screams and proves its ability to generalise to other types of events. The obtained results also emphasise the fact that, in addition to modelling variability, adding concepts in the system may also be beneficial for the precision rates
Similar content being viewed by others
Notes
This corresponds to \(E\left [\mathbf {X}\right ]\) in g, multiplied by |X|.
Silence segmentation is performed using the AudioSeg software, gforge.inria.fr/projects/audioseg.
via the Yael & LibPQ libraries.
The dataset and its annotations are publicly available at: https://research.technicolor.com/rennes/vsd/.
References
Andre-Obrecht R (1988) A new statistical approach for the automatic segmentation of continuous speech signals. IEEE trans Audio, Speech Sig Process 36(1):29–40
Atrey PK, Maddage NC, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: 31st IEEE international conference on acoustics, speech and signal processing. Toulouse, pp V–V
Bello JP, Duxbury C, Davies M, Sandler M (2004) On the Use of phase and energy for musical onset detection in the complex domain. Signal Proc Lett 11(6):553–556
Bonastre JF, Scheffer N, Matrouf D, Fredouille C, Larcher A, Preti A, Pouchoulin G, Evans N, Fauve B, Mason J (2008) ALIZE/SpkDet: a state-of-the-art open source software for speaker Recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. Stellenbosch
Burred JJ (2012) Genetic motif discovery applied to audio analysis. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 361–364
Chin ML , Burred JJ (2012) Audio event detection based on layered symbolic sequence representations. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 1953–1956
Demarty C-H, Penet C, Gravier G, Soleymani M (2012) A benchmarking campaign for the multimodal detection of violent scenes in movies. In: ECCV 2012 workshop on information fusion in computer vision for concept recognition. Firenze, pp 416–425
Gravier G, Demarty C-H, Baghdadi S, Gros P (2012) Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos. In: Multimedia sools and applications. pp 1–17
Giannakopoulos T , Kosmopoulos DI, Aristidou A, Theodoridis S (2007) A multi-class audio classification method with respect to violent content in movies using Bayesian networks. In: IEEE workshop on multimedia signal processing. Crete, pp 90–93
Ionescu B, Schluter J, Mironicǎ I, Schedl M (2013) A nave mid-level concept-based fusion approach to violence Detection in Hollywood Movies. In: ACM international conference on multimedia retrieval. Dallas Texas, pp 215–222
J’egou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Tran Pattern Anal Mach Intell 33(1):117–128
Jin Q, Schulam PF, Rawat S, Burger S, Ding D, Metze F (2012) Event-based video retrieval using audio. In: INTERSPEECH. Portland
Joder C, Essid S , Richard G (2009) Temporal integration for audio classification with application to musical instrument classification. IEEE Trans Audio, Speech, Lang Process 17(1):174–186
Kenny P (2006) Joint factor analysis of speaker and session variability: theory and algorithms technical report, Centre de recherche en informatique de montreal (CRIM)
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience
Lucas P (2002) Restricted Bayesian network structure learning. In: Advances in Bayesian networks, studies in fuzziness and soft computing. pp 217–232
McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceeding of the international society for music information retrieval. Washington, DC, pp 151–158
Matrouf D, Scheffer N, Fauve B, Bonastrem JF (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. In: INTERSPEECH. Antwerp, pp 1242–1245
Matrouf D, Verdet F, Rouvier M, Bonastre JF, Linares G (2011) Modeling nuisance variabilities with factor analysis for GMM-based Audio Pattern Classification. Comput Speech & Lang 25(3):481–498
Ono N, Miyamoto K, Kameoka H, Sagayama S (2008) A real-time equalizer of harmonic and percussive components in music signals. In: International symposium/conference on music information Retrieval, Philadelphia. pp 139–144
Penet C, Demarty C-H, Gravier G, Gros P (2012) Multimodal information fusion and temporal integration for violence detection in movies. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 2393–2396
Penet C, Demarty C-H, Gravier G, Gros P (2013) Audio event detection in movies using multiple audio words and contextual Bayesian networks. In 11th international workshop on content-based multimedia Indexing, pp 17–22, Veszprem, Hungary, June 2013
Ramona M, Richard G (2009) Comparison of different strategies for a SVM-Based audio segmentation. In: European conference on signal processing. Glasgow, pp 20–24
Rouvier M, Matrouf D, Linars G (2009) Factor analysis for audio-based video genre classification. In: INTERSPEECH. Brighton
Saunders J (1996) Real-time discrimination of broadcast speech/music. In: IEEE international conference on acoustics, speech and signal processing, vol 2. Atlanta, pp 993–996
Schluter J, Ionescu B, Mironica I, Schedl M (2012) ARF @ MediaEval 2012: an uninformed approach to violence detection in hollywood movies. In: MediaEval 2012 Workshop. ceur-ws.org
Trancoso I, Pellegrini T, Portelo J, Meinedo H, Bugalho M, Abad A, Neto J (2009) Audio contributions to semantic video search. In: International conference on multimedia and expo. New York, pp 630–633
Vogt R, Sridha S (January 2008) Explicit modelling of session variability for speaker verification. Comput Speech Lang 22:17–38
Valenzise G, Gerosa L, Tagliasacchi M, Antonacci F, Sarti A (2007) Scream and gunshot detection in noisy environments. In: EUSIPCO
Vair C, Colibro D, Castaldo F, Dalmasso E, Laface P (2006) Channel factors compensation in model and feature domain for speaker recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. San Juan pp 1–6
Zhuang X, Tsakalidis S, Wu S, Natarajan P, Prasad R, Natarajan P (2012) Compact audio representation for event detection in consumer media. In: INTERSPEECH. Portland
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partly achieved as part of the Quaero Program, funded by OSEO, French State agency for innovation. We would like to acknowledge the MediaEval Multimedia Benchmark http://www.multimediaeval.org/ and in particular the Affect Task 2011/2012 for providing the data used in this research.
Rights and permissions
About this article
Cite this article
Penet, C., Demarty, CH., Gravier, G. et al. Variability modelling for audio events detection in movies. Multimed Tools Appl 74, 1143–1173 (2015). https://doi.org/10.1007/s11042-014-2038-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2038-7