Skip to main content

The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars

  • Conference paper
Machine Learning for Multimodal Interaction (MLMI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4299))

Included in the following conference series:

Abstract

In this paper, we describe the IBM system submitted to the NIST Rich Transcription Spring 2006 (RT06s) evaluation campaign for automatic speech activity detection (SAD). This SAD system has been developed and evaluated on CHIL lecture meeting data using far-field microphone sensors, namely a single distant microphone (SDM) configuration and a multiple distant microphone (MDM) condition. The IBM SAD system employs a three-class statistical classifier, trained on features that augment traditional signal energy ones with features that are based on acoustic phonetic likelihoods. The latter are obtained using a large speaker-independent acoustic model trained on meeting data. In the detection stage, after feature extraction and classification, the resulting sequence of classified states is further collapsed into segments belonging to only two classes, speech or silence, following two levels of smoothing. In the MDM condition, the process is repeated for every available microphone channel, and the outputs are combined based on a simple majority voting rule, biased towards speech. The system performed well at the RT06s evaluation campaign, resulting to 8.62% and 5.01% “speaker diarization error” in the SDM and MDM conditions respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Macho, D., Padrell, J., Abad, A., et al.: Automatic speech activity detection, source localization, and speech recognition on the CHIL seminar corpus. In: Proc. ICME (2005)

    Google Scholar 

  2. Li, Q., Zheng, J., Zhou, Q., Lee, C.-H.: A robust, real-time endpoint detector with energy normalization for ASR in adverse environments. In: Proc. ICASSP, pp. 233–236 (2001)

    Google Scholar 

  3. Martin, A., Charlet, D., Mauuary, L.: Robust speech/non-speech detection using LDA applied to MFCC. In: Proc. ICASSP, pp. 237–240 (2001)

    Google Scholar 

  4. Bou-Ghazale, S., Assaleh, K.: A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. In: Proc. ICASSP, pp. 3808–3811 (2002)

    Google Scholar 

  5. Padrell, J., Macho, D., Nadeu, C.: Robust speech activity detection using LDA applied to FF parameters. In: Proc. ICASSP, vol. 1, pp. 557–560 (2005)

    Google Scholar 

  6. Monkowski, M.: Automatic Gain Control in a Speech Recognition System, U.S. Patent US6314396

    Google Scholar 

  7. Marcheret, E., Visweswariah, K., Potamianos, G.: Speech activity detection fusing acoustic phonetic and energy features. In: Proc. ICSLP (2005)

    Google Scholar 

  8. Chu, S.M., Marcheret, E., Potamianos, G.: Automatic speech recognition and speech activity detection in the CHIL smart room. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 332–343. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Huang, J., Westphal, M., Chen, S., et al.: The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299. Springer, Heidelberg (2006)

    Google Scholar 

  10. Van Compernolle, D.: Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings. In: Proc. ICASSP, pp. 833–836 (1990)

    Google Scholar 

  11. Armani, L., Matassoni, M., Omologo, M., Svaizer, P.: Use of a CSP-based voice activity detector for distant-talking ASR. In: Proc. Eurospeech, pp. 501–504 (2003)

    Google Scholar 

  12. Novak, M., Gopinath, R.A., Sedivy, J.: Efficient hierarchical labeler algorithm for Gaussian likelihoods computation in resource constrained speech recognition systems, available on-line at: http://www.research.ibm.com/people/r/rameshg/novak-icassp.ps

  13. Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, 3rd edn., ch. 11. Kluwer Academic Publishers, Dordrecht (1993)

    Google Scholar 

  14. Ramaswamy, G.N., Navratil, A., Chaudhari, U.V., Zilca, R.D.: The IBM system for the NIST 2002 cellular speaker verification evaluation. In: Proc. ICASSP, vol. 2, pp. 61–64 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J. (2006). The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_29

Download citation

  • DOI: https://doi.org/10.1007/11965152_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69267-6

  • Online ISBN: 978-3-540-69268-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics