Skip to main content

Modeling Vocal Interaction for Segmentation in Meeting Recognition

  • Conference paper
Machine Learning for Multimodal Interaction (MLMI 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4892))

Included in the following conference series:

Abstract

Automatic segmentation is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, but standard vocal activity detection algorithms for close-talk microphones in meetings continue to treat participants independently. In this work we present a multispeaker segmentation system which models a particular aspect of human-human communication, that of vocal interaction or the interdependence between participants’ on-off speech patterns. We describe our vocal interaction model, its training, and its use during vocal activity decoding. Our experiments show that this approach almost completely eliminates the problem of crosstalk, and word error rates on our development set are lower than those obtained with human-generatated reference segmentation. We also observe significant performance improvements on unseen data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1962–1965 (2006)

    Google Scholar 

  2. Brady, P.: A model for generating on-off speech patterns in two-way conversation. Bell Systems Technical Journal 48(7), 2445–2472 (1969)

    Google Scholar 

  3. Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The impact of meeting type on speech style. In: Proc. of ICSLP 2002, Denver CO, USA, pp. 301–304 (2002)

    Google Scholar 

  4. Çetin, Ö., Shriberg, E.: Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 200–211. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Dabbs Jr., J., Ruback, R.: Dimensions of group process: Amount and structure of vocal interaction. Advances in Experimental Social Psychology 20, 123–169 (1987)

    Article  Google Scholar 

  6. Dines, J., Vepa, J., Hain, T.: The segmentation of multi-channel meeting recordings for automatic speech recognition. In: Proc. of INTERSPEECH 2006, Pittsburgh PA, USA, pp. 1213–1216 (2006)

    Google Scholar 

  7. Fay, N., Garrod, S., Carletta, J.: Group discussion as interactive dialogue or serial monologue: The influence of group size. Psychological Science 11(6), 487–492 (2000)

    Article  Google Scholar 

  8. Fügen, C., Ikbal, S., Kraft, F., Kumatani, K., Laskowski, K., McDonough, J., Ostendorf, M., Stüker, S., Wölfel, M.: The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 407–418. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Laskowski, K., Jin, Q., Schultz, T.: Crosscorrelation-based multispeaker speech activity detection. In: Proc. of ISLP 2004, Jeju Island, South Korea, pp. 973–976 (2004)

    Google Scholar 

  10. Laskowski, K., Schultz, T.: Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings. In: Proc. of ICASSP 2006, Toulouse, France, vol. I, pp. 993–996 (2006)

    Google Scholar 

  11. Laskowski, K., Schultz, T.: A geometric interpretation of normalized maximum crosscorrelation for vocal activity detection in meetings. In: Proc. of HLT-NAACL, Short Papers, Rochester NY, USA, pp. 89–92 (2007)

    Google Scholar 

  12. Pfau, T., Ellis, D., Stolcke, A.: Multispeaker speech activity detection for the ICSI Meeting Recorder. In: Proc. of ASRU 2001, Madonna di Campiglio, Italy, pp. 107–110 (2001)

    Google Scholar 

  13. Sacks, H., Schegloff, E., Jefferson, G.: A simplest sematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)

    Article  Google Scholar 

  14. Wrigley, S., Brown, G., Wan, V., Renals, S.: Feature selection for the classification of crosstalk in multi-channel audio. In: Proc. of EUROSPEECH 2003, Geneva, Switzerland, pp. 469–472 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Laskowski, K., Schultz, T. (2008). Modeling Vocal Interaction for Segmentation in Meeting Recognition. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78155-4_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78154-7

  • Online ISBN: 978-3-540-78155-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics