Real-Time Activity Detection in a Multi-Talker Reverberated Environment

Abstract

This paper proposes a real-time person activity detection framework operating in presence of multiple sources in reverberated environments. Such a framework is composed by two main parts: The speech enhancement front-end and the activity detector. The aim of the former is to automatically reduce the distortions introduced by room reverberation in the available distant speech signals and thus to achieve a significant improvement of speech quality for each speaker. The overall front-end is composed by three cooperating blocks, each one fulfilling a specific task: Speaker diarization, room impulse responses identification, and speech dereverberation. In particular, the speaker diarization algorithm is essential to pilot the operations performed in the other two stages in accordance with speakers’ activity in the room. The activity estimation algorithm is based on bidirectional Long Short-Term Memory networks which allow for context-sensitive activity classification from audio feature functionals extracted via the real-time speech feature extraction toolkit openSMILE. Extensive computer simulations have been performed by using a subset of the AMI database for activity evaluation in meetings: Obtained results confirm the effectiveness of the approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    http://home.tiscali.nl/ehabets/rir_generator.html.

  2. 2.

    http://www.itl.nist.gov/iad/mig/tests/rt/2004-fall/.

References

  1. 1.

    Allen J, Berkley D. Image method for efficiently simulating small-room acoustics. J Acoust Soc Am. 1979; 65(4):943–50.

    Article  Google Scholar 

  2. 2.

    Aran O, Gatica-Perez D. Fusing audio-visual nonverbal cues to detect dominant people in group conversations. In: Proceedings of the international conference on pattern recognition. 2010. pp. 3687–90.

  3. 3.

    Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T et al. The AMI meeting corpus: a pre-announcement. Machine learning for multimodal interaction. 2006. pp. 28–39.

  4. 4.

    Chetouani M, Mahdhaoui A, Ringeval F. Time-scale feature extractions for emotional speech characterization. Cogn Comput. 2009; 1:194–201.

    Article  Google Scholar 

  5. 5.

    Egger H, Engl H. Tikhonov regularization applied to the inverse problem of option pricing: convergence analysis and rates. Inverse Prob. 2005;21(3):1027–45.

    Article  Google Scholar 

  6. 6.

    Eyben F, Wöllmer M, Schuller B. openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM Multimedia. Firenze, Italy; 2010. pp. 1459–62.

  7. 7.

    Fredouille C, Bozonnet S, Evans N. The LIA-EURECOM RT’09 speaker diarization system. In: RT’09, NIST rich transcription workshop. Melbourne, Florida, USA; 2009.

  8. 8.

    Gatica-Perez D. Automatic nonverbal analysis of social interaction in small groups: A review. Image Vis Comput. 2009; 27(12):1775–87.

    Article  Google Scholar 

  9. 9.

    Gatica-Perez D, McCowan I, Zhang D, Bengio S. Detecting group interest-level in meetings. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing. Philadelphia; 2005. pp. 489–92.

  10. 10.

    Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005; 18(5–6):602–10.

    PubMed  Article  Google Scholar 

  11. 11.

    Guillaume M, Grenier Y, Richard G. Iterative algorithms for multichannel equalization in sound reproduction systems. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol. 3. 2005. pp. iii/269–72.

  12. 12.

    Haque M, Bashar MS, Naylor P, Hirose K, Hasan MK. Energy constrained frequency-domain normalized LMS algorithm for blind channel identification. Signal Image Video Process. 2007; 1(3):203–13.

    Article  Google Scholar 

  13. 13.

    Haque M, Hasan MK. Noise robust multichannel frequency-domain LMS algorithms for blind channel identification. IEEE Signal Process Lett. 2008; 15:305–8.

    Article  Google Scholar 

  14. 14.

    Hikichi T, Delcroix M, Miyoshi M. Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations. EURASIP J Adv Signal Process 2007;2007(1):1–12.

    Google Scholar 

  15. 15.

    Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9(8):1735–80.

    PubMed  Article  CAS  Google Scholar 

  16. 16.

    Hörnler B, Rigoll G. Multi-modal activity and dominance detection in smart meeting rooms. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing. 2009. pp. 1777–80.

  17. 17.

    Huang Y, Benesty J. A class of frequency-domain adaptive approaches to blind multichannel identification. IEEE Trans Speech Audio Process. 2003; 51(1):11–24.

    Google Scholar 

  18. 18.

    Huang Y, Benesty J, Chen J. A blind channel identification-based two-stage approach to separation and dereverberation of speech signals in a reverberant environment. IEEE Trans Speech Audio Process. 2005;13(5):882–95.

    Article  Google Scholar 

  19. 19.

    Hung H, Huang Y, Friedland G, Gatica-Perez D. Estimating dominance in multi-party meetings using speaker diarization. IEEE Trans Audio Speech Lang Process 2011;19(4):847–60.

    Article  Google Scholar 

  20. 20.

    Jayagopi D, Hung H, Yeo C, Gatica-Perez D. Modeling dominance in group conversations using nonverbal activity cues. IEEE Trans Audio Speech Lang Process 2009;17(3):501–13.

    Article  Google Scholar 

  21. 21.

    Johnson DH, Dudgeon DE. Array signal processing. Englewood Cliffs, NJ: Prentice-Hall; 1993.

    Google Scholar 

  22. 22.

    Jovanovic N. To whom it may concern: addressing in face-to-face meetings. Ph.D thesis, Department of Computer Science, University of Twente 2007.

  23. 23.

    McCowan L, Gatica-Perez D, Bengio S, Lathoud G, Barnard M, Zhang D. Automatic analysis of multimodal group actions in meetings. IEEE Trans Pattern Anal Mach Intell. 2005; 27(3):305–17.

    PubMed  Article  Google Scholar 

  24. 24.

    Miyoshi M, Kaneda Y. Inverse filtering of room acoustics. IEEE Trans Signal Process 1988;36(2):145–52.

    Google Scholar 

  25. 25.

    Morgan D, Benesty J, Sondhi M. On the evaluation of estimated impulse responses. IEEE Signal Process Lett. 1998;5(7):174–76.

    Article  Google Scholar 

  26. 26.

    Naylor P, Gaubitch N. Speech dereverberation. Signals and communication technology. New York: Springer; 2010.

  27. 27.

    Oppenheim AV, Schafer RW, Buck JR. Discrete-time signal processing, 2 edn. Upper Saddle River, NJ: Prentice Hall; 1999.

    Google Scholar 

  28. 28.

    Pianesi F, Mana N, Cappelletti A, Lepri B, Zancanaro M. Multimodal recognition of personality traits in social interactions. In: Proceedings of the international conference on multimodal interfaces. Chania, Greece; 2008. pp. 53–60.

  29. 29.

    Principi E, Cifani S, Rocchi C, Squartini S, Piazza F. Keyword spotting based system for conversation fostering in tabletop scenarios: preliminary evaluation. In: Proceedings of 2nd international conference on human system interaction, pp. 216–9. Catania 2009.

  30. 30.

    Reiter S, Schuller B, Rigoll G. Segmentation and recognition of meeting events using a two-layered HMM and a combined MLP-HMM approach. In: Proceedings of IEEE international conference on multimedia and expo, pp. 953–6. Toronto 2006.

  31. 31.

    Rotili R, Cifani S, Principi E, Squartini S, Piazza F. A robust iterative inverse filtering approach for speech dereverberation in presence of disturbances. In: Proceedings of IEEE Asia Pacific conference on circuits and systems, pp. 434–7.

  32. 32.

    Rotili R, De Simone C, Perelli A, Cifani A, Squartini S. Joint multichannel blind speech separation and dereverberation: a real-time algorithmic implementation. In: Proceedings of 6th international conference on intelligent computing, 2010; pp. 85–93.

  33. 33.

    Rotili R, Principi E, Squartini S, Schuller B. Real-time speech recognition in a multi-talker reverberated acoustic scenario. In: Huang DS, Gan Y, Gupta P, Gromiha M, editors. Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, Lecture Notes in Computer Science, vol. 6839. Berlin, Heidelberg: Springer; 2012. pp. 379–86.

  34. 34.

    Schuller B, Batliner A, Steidl S, Seppi D. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication 2011. pp. 1062–87.

  35. 35.

    Schuller B, Steidl S, Batliner A, Schiel F, Krajewski J. The interspeech 2011 speaker state challenge. In: Proceedings of interspeech 2011. Florence, Italy 2011.

  36. 36.

    Taylor J. Cognitive computation. Cogn Comput 2009;1:4–16.

    Article  Google Scholar 

  37. 37.

    Vinyals O, Friedland G. Towards semantic analysis of conversations: a system for the live identification of speakers in meetings. In: Proceedings of IEEE international conference on semantic computing. 2008. pp. 426 –431.

  38. 38.

    Wöllmer M, Blaschke C, Schindl T, Schuller B, Färber B, Mayer S, Trefflich B. On-line driver distraction detection using long short-term memory. IEEE Trans Intell Trans Syst. 2011;12(2):574–82.

    Article  Google Scholar 

  39. 39.

    Wöllmer M, Eyben F, Graves A, Schuller B, Rigoll G. Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn Comput. 2010;2:180–90.

    Article  Google Scholar 

  40. 40.

    Wöllmer M, Marchi E, Squartini S, Schuller B. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting. Cogn Neurodynamics. 2011;5(3):253–64.

    PubMed  Article  Google Scholar 

  41. 41.

    Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of interspeech. Makuhari, Japan; 2010. pp. 2362–5.

  42. 42.

    Wöllmer M, Schuller B, Eyben F, Rigoll G. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Sel Topics Signal Process. 2010;4(5):867–81.

    Article  Google Scholar 

  43. 43.

    Wooters C, Huijbregts M. The ICSI RT07s Speaker Diarization System. In: Stiefelhagen R, Bowers R, Fiscus J, editors. Multimodal technologies for perception of humans, lecture notes in computer science. Berlin, Heidelberg: Springer; 2008. pp. 509–19.

  44. 44.

    Xu G, Liu H, Tong L, Kailath T. A Least-Squares Approach to Blind Channel Identification. IEEE Trans Signal Process. 1995;43(12):2982–93.

    Article  Google Scholar 

  45. 45.

    Yu Z, Er M. A robust adaptive blind multichannel identification algorithm for acoustic applications. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol. 2. 2004. pp. 25–8.

  46. 46.

    Zancanaro M, Lepri B, Pianesi F. Automatic detection of group functional roles in face to face interactions. In: Proceedings of the international conference on multimodal interfaces. Banff, Canada; 2006. pp. 28–34.

  47. 47.

    Zhang D, Gatica-Perez D, Bengio S, McCowan I, Lathoud G. Multimodal group action clustering in meetings. In: Proceedings of the ACM 2nd international workshop on video surveillance and sensor networks. New York, NY, USA; 2004. pp. 54–62.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Emanuele Principi.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Principi, E., Rotili, R., Wöllmer, M. et al. Real-Time Activity Detection in a Multi-Talker Reverberated Environment. Cogn Comput 4, 386–397 (2012). https://doi.org/10.1007/s12559-012-9133-8

Download citation

Keywords

  • Speech enhancement
  • Blind channel identification
  • Speech dereverberation
  • Speaker diarization
  • Real-time signal processing
  • Activity detection