Variability modelling for audio events detection in movies

Penet, Cédric; Demarty, Claire-Hélène; Gravier, Guillaume; Gros, Patrick

doi:10.1007/s11042-014-2038-7

Variability modelling for audio events detection in movies

Published: 01 July 2014

Volume 74, pages 1143–1173, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Cédric Penet¹,
Claire-Hélène Demarty¹,
Guillaume Gravier² &
…
Patrick Gros³

295 Accesses
3 Citations
Explore all metrics

Abstract

Detecting audio events in Hollywood movies is a complex task due to the presence of variability between the soundtracks of the movies. This inter-movies variability is shown to impair the audio events detection results in a realistic framework. In this article, we propose to model the variability using a factor analysis technique, which we then use to compensate the audio features. The factor analysis compensation is validated using the state-of-the-art system based on multiple audio words sequences and contextual Bayesian networks which we previously developed in Penet et al. (2013). Results obtained on the same publicly available dataset for the detection of gunshots and explosions show an improvement in the handling of the variability, while keeping the robustness capabilities of the previous system. Furthermore, the system is applied to the detection of screams and proves its ability to generalise to other types of events. The obtained results also emphasise the fact that, in addition to modelling variability, adding concepts in the system may also be beneficial for the precision rates

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Preliminary Study of Acoustic Events Classification with Factor Analysis in Meeting Rooms

Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

Article Open access 28 August 2014

Diego Castán, Alfonso Ortega, … Eduardo Lleida

COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization

Article Open access 07 August 2017

Athanasia Zlatintsi, Petros Koutras, … Petros Maragos

Notes

This corresponds to \(E\left [\mathbf {X}\right ]\) in g, multiplied by |X|.
Silence segmentation is performed using the AudioSeg software, gforge.inria.fr/projects/audioseg.
via the Yael & LibPQ libraries.
The dataset and its annotations are publicly available at: https://research.technicolor.com/rennes/vsd/.

References

Andre-Obrecht R (1988) A new statistical approach for the automatic segmentation of continuous speech signals. IEEE trans Audio, Speech Sig Process 36(1):29–40
Article Google Scholar
Atrey PK, Maddage NC, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: 31st IEEE international conference on acoustics, speech and signal processing. Toulouse, pp V–V
Bello JP, Duxbury C, Davies M, Sandler M (2004) On the Use of phase and energy for musical onset detection in the complex domain. Signal Proc Lett 11(6):553–556
Article Google Scholar
Bonastre JF, Scheffer N, Matrouf D, Fredouille C, Larcher A, Preti A, Pouchoulin G, Evans N, Fauve B, Mason J (2008) ALIZE/SpkDet: a state-of-the-art open source software for speaker Recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. Stellenbosch
Burred JJ (2012) Genetic motif discovery applied to audio analysis. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 361–364
Chin ML , Burred JJ (2012) Audio event detection based on layered symbolic sequence representations. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 1953–1956
Demarty C-H, Penet C, Gravier G, Soleymani M (2012) A benchmarking campaign for the multimodal detection of violent scenes in movies. In: ECCV 2012 workshop on information fusion in computer vision for concept recognition. Firenze, pp 416–425
Gravier G, Demarty C-H, Baghdadi S, Gros P (2012) Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos. In: Multimedia sools and applications. pp 1–17
Giannakopoulos T , Kosmopoulos DI, Aristidou A, Theodoridis S (2007) A multi-class audio classification method with respect to violent content in movies using Bayesian networks. In: IEEE workshop on multimedia signal processing. Crete, pp 90–93
Ionescu B, Schluter J, Mironicǎ I, Schedl M (2013) A nave mid-level concept-based fusion approach to violence Detection in Hollywood Movies. In: ACM international conference on multimedia retrieval. Dallas Texas, pp 215–222
J’egou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Tran Pattern Anal Mach Intell 33(1):117–128
Article Google Scholar
Jin Q, Schulam PF, Rawat S, Burger S, Ding D, Metze F (2012) Event-based video retrieval using audio. In: INTERSPEECH. Portland
Joder C, Essid S , Richard G (2009) Temporal integration for audio classification with application to musical instrument classification. IEEE Trans Audio, Speech, Lang Process 17(1):174–186
Article Google Scholar
Kenny P (2006) Joint factor analysis of speaker and session variability: theory and algorithms technical report, Centre de recherche en informatique de montreal (CRIM)
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience
Lucas P (2002) Restricted Bayesian network structure learning. In: Advances in Bayesian networks, studies in fuzziness and soft computing. pp 217–232
McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceeding of the international society for music information retrieval. Washington, DC, pp 151–158
Matrouf D, Scheffer N, Fauve B, Bonastrem JF (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. In: INTERSPEECH. Antwerp, pp 1242–1245
Matrouf D, Verdet F, Rouvier M, Bonastre JF, Linares G (2011) Modeling nuisance variabilities with factor analysis for GMM-based Audio Pattern Classification. Comput Speech & Lang 25(3):481–498
Article Google Scholar
Ono N, Miyamoto K, Kameoka H, Sagayama S (2008) A real-time equalizer of harmonic and percussive components in music signals. In: International symposium/conference on music information Retrieval, Philadelphia. pp 139–144
Penet C, Demarty C-H, Gravier G, Gros P (2012) Multimodal information fusion and temporal integration for violence detection in movies. In: 37th international conference on acoustics, speech, and signal processing. Kyoto, pp 2393–2396
Penet C, Demarty C-H, Gravier G, Gros P (2013) Audio event detection in movies using multiple audio words and contextual Bayesian networks. In 11th international workshop on content-based multimedia Indexing, pp 17–22, Veszprem, Hungary, June 2013
Ramona M, Richard G (2009) Comparison of different strategies for a SVM-Based audio segmentation. In: European conference on signal processing. Glasgow, pp 20–24
Rouvier M, Matrouf D, Linars G (2009) Factor analysis for audio-based video genre classification. In: INTERSPEECH. Brighton
Saunders J (1996) Real-time discrimination of broadcast speech/music. In: IEEE international conference on acoustics, speech and signal processing, vol 2. Atlanta, pp 993–996
Schluter J, Ionescu B, Mironica I, Schedl M (2012) ARF @ MediaEval 2012: an uninformed approach to violence detection in hollywood movies. In: MediaEval 2012 Workshop. ceur-ws.org
Trancoso I, Pellegrini T, Portelo J, Meinedo H, Bugalho M, Abad A, Neto J (2009) Audio contributions to semantic video search. In: International conference on multimedia and expo. New York, pp 630–633
Vogt R, Sridha S (January 2008) Explicit modelling of session variability for speaker verification. Comput Speech Lang 22:17–38
Article Google Scholar
Valenzise G, Gerosa L, Tagliasacchi M, Antonacci F, Sarti A (2007) Scream and gunshot detection in noisy environments. In: EUSIPCO
Vair C, Colibro D, Castaldo F, Dalmasso E, Laface P (2006) Channel factors compensation in model and feature domain for speaker recognition. In: IEEE proceedings of odyssey: the speaker and language recognition workshop. San Juan pp 1–6
Zhuang X, Tsakalidis S, Wu S, Natarajan P, Prasad R, Natarajan P (2012) Compact audio representation for event detection in consumer media. In: INTERSPEECH. Portland

Download references

Author information

Authors and Affiliations

Technicolor, 975, avenue des Champs Blancs, ZAC des Champs Blancs, 35576, Cesson, Sevign
Cédric Penet & Claire-Hélène Demarty
CNRS/IRISA Campus de Beaulieu, 35042, Rennes Cedex, France
Guillaume Gravier
INRIA Campus de Beaulieu, 35042, Rennes Cedex, France
Patrick Gros

Authors

Cédric Penet
View author publications
You can also search for this author in PubMed Google Scholar
Claire-Hélène Demarty
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Gravier
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cédric Penet.

Additional information

This work was partly achieved as part of the Quaero Program, funded by OSEO, French State agency for innovation. We would like to acknowledge the MediaEval Multimedia Benchmark http://www.multimediaeval.org/ and in particular the Affect Task 2011/2012 for providing the data used in this research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Penet, C., Demarty, CH., Gravier, G. et al. Variability modelling for audio events detection in movies. Multimed Tools Appl 74, 1143–1173 (2015). https://doi.org/10.1007/s11042-014-2038-7

Download citation

Received: 30 August 2013
Revised: 03 March 2014
Accepted: 17 April 2014
Published: 01 July 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s11042-014-2038-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Variability modelling for audio events detection in movies

Abstract

Access this article

Similar content being viewed by others

A Preliminary Study of Acoustic Events Classification with Factor Analysis in Meeting Rooms

Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variability modelling for audio events detection in movies

Abstract

Access this article

Similar content being viewed by others

A Preliminary Study of Acoustic Events Classification with Factor Analysis in Meeting Rooms

Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation