Real-Time Activity Detection in a Multi-Talker Reverberated Environment

Principi, Emanuele; Rotili, Rudy; Wöllmer, Martin; Eyben, Florian; Squartini, Stefano; Schuller, Björn

doi:10.1007/s12559-012-9133-8

Real-Time Activity Detection in a Multi-Talker Reverberated Environment

Published: 27 March 2012

Volume 4, pages 386–397, (2012)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Emanuele Principi¹,
Rudy Rotili¹,
Martin Wöllmer²,
Florian Eyben²,
Stefano Squartini¹ &
…
Björn Schuller²

254 Accesses
4 Citations
Explore all metrics

Abstract

This paper proposes a real-time person activity detection framework operating in presence of multiple sources in reverberated environments. Such a framework is composed by two main parts: The speech enhancement front-end and the activity detector. The aim of the former is to automatically reduce the distortions introduced by room reverberation in the available distant speech signals and thus to achieve a significant improvement of speech quality for each speaker. The overall front-end is composed by three cooperating blocks, each one fulfilling a specific task: Speaker diarization, room impulse responses identification, and speech dereverberation. In particular, the speaker diarization algorithm is essential to pilot the operations performed in the other two stages in accordance with speakers’ activity in the room. The activity estimation algorithm is based on bidirectional Long Short-Term Memory networks which allow for context-sensitive activity classification from audio feature functionals extracted via the real-time speech feature extraction toolkit openSMILE. Extensive computer simulations have been performed by using a subset of the AMI database for activity evaluation in meetings: Obtained results confirm the effectiveness of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Experimental Study on Fundamental Frequency Detection in Reverberated Speech with Pre-trained Recurrent Neural Networks

Room-localized speech activity detection in multi-microphone smart homes

Article Open access 27 August 2019

Panagiotis Giannoulis, Gerasimos Potamianos & Petros Maragos

A robust polynomial regression-based voice activity detector for speaker verification

Article Open access 11 October 2017

Gökay Dişken, Zekeriya Tüfekci & Ulus Çevik

Notes

References

Allen J, Berkley D. Image method for efficiently simulating small-room acoustics. J Acoust Soc Am. 1979; 65(4):943–50.
Article Google Scholar
Aran O, Gatica-Perez D. Fusing audio-visual nonverbal cues to detect dominant people in group conversations. In: Proceedings of the international conference on pattern recognition. 2010. pp. 3687–90.
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T et al. The AMI meeting corpus: a pre-announcement. Machine learning for multimodal interaction. 2006. pp. 28–39.
Chetouani M, Mahdhaoui A, Ringeval F. Time-scale feature extractions for emotional speech characterization. Cogn Comput. 2009; 1:194–201.
Article Google Scholar
Egger H, Engl H. Tikhonov regularization applied to the inverse problem of option pricing: convergence analysis and rates. Inverse Prob. 2005;21(3):1027–45.
Article Google Scholar
Eyben F, Wöllmer M, Schuller B. openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM Multimedia. Firenze, Italy; 2010. pp. 1459–62.
Fredouille C, Bozonnet S, Evans N. The LIA-EURECOM RT’09 speaker diarization system. In: RT’09, NIST rich transcription workshop. Melbourne, Florida, USA; 2009.
Gatica-Perez D. Automatic nonverbal analysis of social interaction in small groups: A review. Image Vis Comput. 2009; 27(12):1775–87.
Article Google Scholar
Gatica-Perez D, McCowan I, Zhang D, Bengio S. Detecting group interest-level in meetings. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing. Philadelphia; 2005. pp. 489–92.
Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005; 18(5–6):602–10.
Article PubMed Google Scholar
Guillaume M, Grenier Y, Richard G. Iterative algorithms for multichannel equalization in sound reproduction systems. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol. 3. 2005. pp. iii/269–72.
Haque M, Bashar MS, Naylor P, Hirose K, Hasan MK. Energy constrained frequency-domain normalized LMS algorithm for blind channel identification. Signal Image Video Process. 2007; 1(3):203–13.
Article Google Scholar
Haque M, Hasan MK. Noise robust multichannel frequency-domain LMS algorithms for blind channel identification. IEEE Signal Process Lett. 2008; 15:305–8.
Article Google Scholar
Hikichi T, Delcroix M, Miyoshi M. Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations. EURASIP J Adv Signal Process 2007;2007(1):1–12.
Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9(8):1735–80.
Article PubMed CAS Google Scholar
Hörnler B, Rigoll G. Multi-modal activity and dominance detection in smart meeting rooms. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing. 2009. pp. 1777–80.
Huang Y, Benesty J. A class of frequency-domain adaptive approaches to blind multichannel identification. IEEE Trans Speech Audio Process. 2003; 51(1):11–24.
Google Scholar
Huang Y, Benesty J, Chen J. A blind channel identification-based two-stage approach to separation and dereverberation of speech signals in a reverberant environment. IEEE Trans Speech Audio Process. 2005;13(5):882–95.
Article Google Scholar
Hung H, Huang Y, Friedland G, Gatica-Perez D. Estimating dominance in multi-party meetings using speaker diarization. IEEE Trans Audio Speech Lang Process 2011;19(4):847–60.
Article Google Scholar
Jayagopi D, Hung H, Yeo C, Gatica-Perez D. Modeling dominance in group conversations using nonverbal activity cues. IEEE Trans Audio Speech Lang Process 2009;17(3):501–13.
Article Google Scholar
Johnson DH, Dudgeon DE. Array signal processing. Englewood Cliffs, NJ: Prentice-Hall; 1993.
Google Scholar
Jovanovic N. To whom it may concern: addressing in face-to-face meetings. Ph.D thesis, Department of Computer Science, University of Twente 2007.
McCowan L, Gatica-Perez D, Bengio S, Lathoud G, Barnard M, Zhang D. Automatic analysis of multimodal group actions in meetings. IEEE Trans Pattern Anal Mach Intell. 2005; 27(3):305–17.
Article PubMed Google Scholar
Miyoshi M, Kaneda Y. Inverse filtering of room acoustics. IEEE Trans Signal Process 1988;36(2):145–52.
Google Scholar
Morgan D, Benesty J, Sondhi M. On the evaluation of estimated impulse responses. IEEE Signal Process Lett. 1998;5(7):174–76.
Article Google Scholar
Naylor P, Gaubitch N. Speech dereverberation. Signals and communication technology. New York: Springer; 2010.
Oppenheim AV, Schafer RW, Buck JR. Discrete-time signal processing, 2 edn. Upper Saddle River, NJ: Prentice Hall; 1999.
Google Scholar
Pianesi F, Mana N, Cappelletti A, Lepri B, Zancanaro M. Multimodal recognition of personality traits in social interactions. In: Proceedings of the international conference on multimodal interfaces. Chania, Greece; 2008. pp. 53–60.
Principi E, Cifani S, Rocchi C, Squartini S, Piazza F. Keyword spotting based system for conversation fostering in tabletop scenarios: preliminary evaluation. In: Proceedings of 2nd international conference on human system interaction, pp. 216–9. Catania 2009.
Reiter S, Schuller B, Rigoll G. Segmentation and recognition of meeting events using a two-layered HMM and a combined MLP-HMM approach. In: Proceedings of IEEE international conference on multimedia and expo, pp. 953–6. Toronto 2006.
Rotili R, Cifani S, Principi E, Squartini S, Piazza F. A robust iterative inverse filtering approach for speech dereverberation in presence of disturbances. In: Proceedings of IEEE Asia Pacific conference on circuits and systems, pp. 434–7.
Rotili R, De Simone C, Perelli A, Cifani A, Squartini S. Joint multichannel blind speech separation and dereverberation: a real-time algorithmic implementation. In: Proceedings of 6th international conference on intelligent computing, 2010; pp. 85–93.
Rotili R, Principi E, Squartini S, Schuller B. Real-time speech recognition in a multi-talker reverberated acoustic scenario. In: Huang DS, Gan Y, Gupta P, Gromiha M, editors. Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, Lecture Notes in Computer Science, vol. 6839. Berlin, Heidelberg: Springer; 2012. pp. 379–86.
Schuller B, Batliner A, Steidl S, Seppi D. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication 2011. pp. 1062–87.
Schuller B, Steidl S, Batliner A, Schiel F, Krajewski J. The interspeech 2011 speaker state challenge. In: Proceedings of interspeech 2011. Florence, Italy 2011.
Taylor J. Cognitive computation. Cogn Comput 2009;1:4–16.
Article Google Scholar
Vinyals O, Friedland G. Towards semantic analysis of conversations: a system for the live identification of speakers in meetings. In: Proceedings of IEEE international conference on semantic computing. 2008. pp. 426 –431.
Wöllmer M, Blaschke C, Schindl T, Schuller B, Färber B, Mayer S, Trefflich B. On-line driver distraction detection using long short-term memory. IEEE Trans Intell Trans Syst. 2011;12(2):574–82.
Article Google Scholar
Wöllmer M, Eyben F, Graves A, Schuller B, Rigoll G. Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn Comput. 2010;2:180–90.
Article Google Scholar
Wöllmer M, Marchi E, Squartini S, Schuller B. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting. Cogn Neurodynamics. 2011;5(3):253–64.
Article PubMed Google Scholar
Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of interspeech. Makuhari, Japan; 2010. pp. 2362–5.
Wöllmer M, Schuller B, Eyben F, Rigoll G. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Sel Topics Signal Process. 2010;4(5):867–81.
Article Google Scholar
Wooters C, Huijbregts M. The ICSI RT07s Speaker Diarization System. In: Stiefelhagen R, Bowers R, Fiscus J, editors. Multimodal technologies for perception of humans, lecture notes in computer science. Berlin, Heidelberg: Springer; 2008. pp. 509–19.
Xu G, Liu H, Tong L, Kailath T. A Least-Squares Approach to Blind Channel Identification. IEEE Trans Signal Process. 1995;43(12):2982–93.
Article Google Scholar
Yu Z, Er M. A robust adaptive blind multichannel identification algorithm for acoustic applications. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol. 2. 2004. pp. 25–8.
Zancanaro M, Lepri B, Pianesi F. Automatic detection of group functional roles in face to face interactions. In: Proceedings of the international conference on multimodal interfaces. Banff, Canada; 2006. pp. 28–34.
Zhang D, Gatica-Perez D, Bengio S, McCowan I, Lathoud G. Multimodal group action clustering in meetings. In: Proceedings of the ACM 2nd international workshop on video surveillance and sensor networks. New York, NY, USA; 2004. pp. 54–62.

Download references

Author information

Authors and Affiliations

3MediaLabs, Department of Information Engineering, Università Politecnica delle Marche, Via Brecce Bianche 1, 60131, Ancona, Italy
Emanuele Principi, Rudy Rotili & Stefano Squartini
Institute for Human-Machine Communication, Technische Universität München, Arcisstr. 21, 80333, Munich, Germany
Martin Wöllmer, Florian Eyben & Björn Schuller

Authors

Emanuele Principi
View author publications
You can also search for this author in PubMed Google Scholar
Rudy Rotili
View author publications
You can also search for this author in PubMed Google Scholar
Martin Wöllmer
View author publications
You can also search for this author in PubMed Google Scholar
Florian Eyben
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Squartini
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emanuele Principi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Principi, E., Rotili, R., Wöllmer, M. et al. Real-Time Activity Detection in a Multi-Talker Reverberated Environment. Cogn Comput 4, 386–397 (2012). https://doi.org/10.1007/s12559-012-9133-8

Download citation

Received: 10 August 2011
Accepted: 13 March 2012
Published: 27 March 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s12559-012-9133-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Activity Detection in a Multi-Talker Reverberated Environment

Abstract

Access this article

Similar content being viewed by others

An Experimental Study on Fundamental Frequency Detection in Reverberated Speech with Pre-trained Recurrent Neural Networks

Room-localized speech activity detection in multi-microphone smart homes

A robust polynomial regression-based voice activity detector for speaker verification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real-Time Activity Detection in a Multi-Talker Reverberated Environment

Abstract

Access this article

Similar content being viewed by others

An Experimental Study on Fundamental Frequency Detection in Reverberated Speech with Pre-trained Recurrent Neural Networks

Room-localized speech activity detection in multi-microphone smart homes

A robust polynomial regression-based voice activity detector for speaker verification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation