Modeling Vocal Interaction for Segmentation in Meeting Recognition

Laskowski, Kornel; Schultz, Tanja

doi:10.1007/978-3-540-78155-4_23

Kornel Laskowski¹ &
Tanja Schultz¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4892))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

999 Accesses
2 Citations

Abstract

Automatic segmentation is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, but standard vocal activity detection algorithms for close-talk microphones in meetings continue to treat participants independently. In this work we present a multispeaker segmentation system which models a particular aspect of human-human communication, that of vocal interaction or the interdependence between participants’ on-off speech patterns. We describe our vocal interaction model, its training, and its use during vocal activity decoding. Our experiments show that this approach almost completely eliminates the problem of crosstalk, and word error rates on our development set are lower than those obtained with human-generatated reference segmentation. We also observe significant performance improvements on unseen data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1962–1965 (2006)
Google Scholar
Brady, P.: A model for generating on-off speech patterns in two-way conversation. Bell Systems Technical Journal 48(7), 2445–2472 (1969)
Google Scholar
Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The impact of meeting type on speech style. In: Proc. of ICSLP 2002, Denver CO, USA, pp. 301–304 (2002)
Google Scholar
Çetin, Ö., Shriberg, E.: Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 200–211. Springer, Heidelberg (2006)
Chapter Google Scholar
Dabbs Jr., J., Ruback, R.: Dimensions of group process: Amount and structure of vocal interaction. Advances in Experimental Social Psychology 20, 123–169 (1987)
Article Google Scholar
Dines, J., Vepa, J., Hain, T.: The segmentation of multi-channel meeting recordings for automatic speech recognition. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1213–1216 (2006)
Google Scholar
Fay, N., Garrod, S., Carletta, J.: Group discussion as interactive dialogue or serial monologue: The influence of group size. Psychological Science 11(6), 487–492 (2000)
Article Google Scholar
Fügen, C., Ikbal, S., Kraft, F., Kumatani, K., Laskowski, K., McDonough, J., Ostendorf, M., Stüker, S., Wölfel, M.: The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 407–418. Springer, Heidelberg (2006)
Chapter Google Scholar
Laskowski, K., Jin, Q., Schultz, T.: Crosscorrelation-based multispeaker speech activity detection. In: Proc. of ISLP 2004, Jeju Island, South Korea, pp. 973–976 (2004)
Google Scholar
Laskowski, K., Schultz, T.: Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings. In: Proc. of ICASSP 2006, Toulouse, France, vol. I, pp. 993–996 (2006)
Google Scholar
Laskowski, K., Schultz, T.: A geometric interpretation of normalized maximum crosscorrelation for vocal activity detection in meetings. In: Proc. of HLT-NAACL, Short Papers, Rochester NY, USA, pp. 89–92 (2007)
Google Scholar
Pfau, T., Ellis, D., Stolcke, A.: Multispeaker speech activity detection for the ICSI Meeting Recorder. In: Proc. of ASRU 2001, Madonna di Campiglio, Italy, pp. 107–110 (2001)
Google Scholar
Sacks, H., Schegloff, E., Jefferson, G.: A simplest sematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)
Article Google Scholar
Wrigley, S., Brown, G., Wan, V., Renals, S.: Feature selection for the classification of crosstalk in multi-channel audio. In: Proc. of EUROSPEECH 2003, Geneva, Switzerland, pp. 469–472 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

interACT, Carnegie Mellon University, Pittsburgh, PA, USA
Kornel Laskowski & Tanja Schultz

Authors

Kornel Laskowski
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Schultz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laskowski, K., Schultz, T. (2008). Modeling Vocal Interaction for Segmentation in Meeting Recognition. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-78155-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics