Abstract
Automatic segmentation is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, but standard vocal activity detection algorithms for close-talk microphones in meetings continue to treat participants independently. In this work we present a multispeaker segmentation system which models a particular aspect of human-human communication, that of vocal interaction or the interdependence between participants’ on-off speech patterns. We describe our vocal interaction model, its training, and its use during vocal activity decoding. Our experiments show that this approach almost completely eliminates the problem of crosstalk, and word error rates on our development set are lower than those obtained with human-generatated reference segmentation. We also observe significant performance improvements on unseen data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1962–1965 (2006)
Brady, P.: A model for generating on-off speech patterns in two-way conversation. Bell Systems Technical Journal 48(7), 2445–2472 (1969)
Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The impact of meeting type on speech style. In: Proc. of ICSLP 2002, Denver CO, USA, pp. 301–304 (2002)
Çetin, Ö., Shriberg, E.: Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 200–211. Springer, Heidelberg (2006)
Dabbs Jr., J., Ruback, R.: Dimensions of group process: Amount and structure of vocal interaction. Advances in Experimental Social Psychology 20, 123–169 (1987)
Dines, J., Vepa, J., Hain, T.: The segmentation of multi-channel meeting recordings for automatic speech recognition. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1213–1216 (2006)
Fay, N., Garrod, S., Carletta, J.: Group discussion as interactive dialogue or serial monologue: The influence of group size. Psychological Science 11(6), 487–492 (2000)
Fügen, C., Ikbal, S., Kraft, F., Kumatani, K., Laskowski, K., McDonough, J., Ostendorf, M., Stüker, S., Wölfel, M.: The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 407–418. Springer, Heidelberg (2006)
Laskowski, K., Jin, Q., Schultz, T.: Crosscorrelation-based multispeaker speech activity detection. In: Proc. of ISLP 2004, Jeju Island, South Korea, pp. 973–976 (2004)
Laskowski, K., Schultz, T.: Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings. In: Proc. of ICASSP 2006, Toulouse, France, vol. I, pp. 993–996 (2006)
Laskowski, K., Schultz, T.: A geometric interpretation of normalized maximum crosscorrelation for vocal activity detection in meetings. In: Proc. of HLT-NAACL, Short Papers, Rochester NY, USA, pp. 89–92 (2007)
Pfau, T., Ellis, D., Stolcke, A.: Multispeaker speech activity detection for the ICSI Meeting Recorder. In: Proc. of ASRU 2001, Madonna di Campiglio, Italy, pp. 107–110 (2001)
Sacks, H., Schegloff, E., Jefferson, G.: A simplest sematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)
Wrigley, S., Brown, G., Wan, V., Renals, S.: Feature selection for the classification of crosstalk in multi-channel audio. In: Proc. of EUROSPEECH 2003, Geneva, Switzerland, pp. 469–472 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Laskowski, K., Schultz, T. (2008). Modeling Vocal Interaction for Segmentation in Meeting Recognition. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-78155-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)