The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars

  • Etienne Marcheret
  • Gerasimos Potamianos
  • Karthik Visweswariah
  • Jing Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4299)


In this paper, we describe the IBM system submitted to the NIST Rich Transcription Spring 2006 (RT06s) evaluation campaign for automatic speech activity detection (SAD). This SAD system has been developed and evaluated on CHIL lecture meeting data using far-field microphone sensors, namely a single distant microphone (SDM) configuration and a multiple distant microphone (MDM) condition. The IBM SAD system employs a three-class statistical classifier, trained on features that augment traditional signal energy ones with features that are based on acoustic phonetic likelihoods. The latter are obtained using a large speaker-independent acoustic model trained on meeting data. In the detection stage, after feature extraction and classification, the resulting sequence of classified states is further collapsed into segments belonging to only two classes, speech or silence, following two levels of smoothing. In the MDM condition, the process is repeated for every available microphone channel, and the outputs are combined based on a simple majority voting rule, biased towards speech. The system performed well at the RT06s evaluation campaign, resulting to 8.62% and 5.01% “speaker diarization error” in the SDM and MDM conditions respectively.


Hide Markov Model Gaussian Mixture Model Automatic Speech Recognition Acoustic Model Automatic Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Macho, D., Padrell, J., Abad, A., et al.: Automatic speech activity detection, source localization, and speech recognition on the CHIL seminar corpus. In: Proc. ICME (2005)Google Scholar
  2. 2.
    Li, Q., Zheng, J., Zhou, Q., Lee, C.-H.: A robust, real-time endpoint detector with energy normalization for ASR in adverse environments. In: Proc. ICASSP, pp. 233–236 (2001)Google Scholar
  3. 3.
    Martin, A., Charlet, D., Mauuary, L.: Robust speech/non-speech detection using LDA applied to MFCC. In: Proc. ICASSP, pp. 237–240 (2001)Google Scholar
  4. 4.
    Bou-Ghazale, S., Assaleh, K.: A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. In: Proc. ICASSP, pp. 3808–3811 (2002)Google Scholar
  5. 5.
    Padrell, J., Macho, D., Nadeu, C.: Robust speech activity detection using LDA applied to FF parameters. In: Proc. ICASSP, vol. 1, pp. 557–560 (2005)Google Scholar
  6. 6.
    Monkowski, M.: Automatic Gain Control in a Speech Recognition System, U.S. Patent US6314396Google Scholar
  7. 7.
    Marcheret, E., Visweswariah, K., Potamianos, G.: Speech activity detection fusing acoustic phonetic and energy features. In: Proc. ICSLP (2005)Google Scholar
  8. 8.
    Chu, S.M., Marcheret, E., Potamianos, G.: Automatic speech recognition and speech activity detection in the CHIL smart room. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 332–343. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Huang, J., Westphal, M., Chen, S., et al.: The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299. Springer, Heidelberg (2006)Google Scholar
  10. 10.
    Van Compernolle, D.: Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings. In: Proc. ICASSP, pp. 833–836 (1990)Google Scholar
  11. 11.
    Armani, L., Matassoni, M., Omologo, M., Svaizer, P.: Use of a CSP-based voice activity detector for distant-talking ASR. In: Proc. Eurospeech, pp. 501–504 (2003)Google Scholar
  12. 12.
    Novak, M., Gopinath, R.A., Sedivy, J.: Efficient hierarchical labeler algorithm for Gaussian likelihoods computation in resource constrained speech recognition systems, available on-line at:
  13. 13.
    Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, 3rd edn., ch. 11. Kluwer Academic Publishers, Dordrecht (1993)Google Scholar
  14. 14.
    Ramaswamy, G.N., Navratil, A., Chaudhari, U.V., Zilca, R.D.: The IBM system for the NIST 2002 cellular speaker verification evaluation. In: Proc. ICASSP, vol. 2, pp. 61–64 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Etienne Marcheret
    • 1
  • Gerasimos Potamianos
    • 1
  • Karthik Visweswariah
    • 1
  • Jing Huang
    • 1
  1. 1.IBM T.J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations