The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars
In this paper, we describe the IBM system submitted to the NIST Rich Transcription Spring 2006 (RT06s) evaluation campaign for automatic speech activity detection (SAD). This SAD system has been developed and evaluated on CHIL lecture meeting data using far-field microphone sensors, namely a single distant microphone (SDM) configuration and a multiple distant microphone (MDM) condition. The IBM SAD system employs a three-class statistical classifier, trained on features that augment traditional signal energy ones with features that are based on acoustic phonetic likelihoods. The latter are obtained using a large speaker-independent acoustic model trained on meeting data. In the detection stage, after feature extraction and classification, the resulting sequence of classified states is further collapsed into segments belonging to only two classes, speech or silence, following two levels of smoothing. In the MDM condition, the process is repeated for every available microphone channel, and the outputs are combined based on a simple majority voting rule, biased towards speech. The system performed well at the RT06s evaluation campaign, resulting to 8.62% and 5.01% “speaker diarization error” in the SDM and MDM conditions respectively.
Unable to display preview. Download preview PDF.
- 1.Macho, D., Padrell, J., Abad, A., et al.: Automatic speech activity detection, source localization, and speech recognition on the CHIL seminar corpus. In: Proc. ICME (2005)Google Scholar
- 2.Li, Q., Zheng, J., Zhou, Q., Lee, C.-H.: A robust, real-time endpoint detector with energy normalization for ASR in adverse environments. In: Proc. ICASSP, pp. 233–236 (2001)Google Scholar
- 3.Martin, A., Charlet, D., Mauuary, L.: Robust speech/non-speech detection using LDA applied to MFCC. In: Proc. ICASSP, pp. 237–240 (2001)Google Scholar
- 4.Bou-Ghazale, S., Assaleh, K.: A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. In: Proc. ICASSP, pp. 3808–3811 (2002)Google Scholar
- 5.Padrell, J., Macho, D., Nadeu, C.: Robust speech activity detection using LDA applied to FF parameters. In: Proc. ICASSP, vol. 1, pp. 557–560 (2005)Google Scholar
- 6.Monkowski, M.: Automatic Gain Control in a Speech Recognition System, U.S. Patent US6314396Google Scholar
- 7.Marcheret, E., Visweswariah, K., Potamianos, G.: Speech activity detection fusing acoustic phonetic and energy features. In: Proc. ICSLP (2005)Google Scholar
- 9.Huang, J., Westphal, M., Chen, S., et al.: The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299. Springer, Heidelberg (2006)Google Scholar
- 10.Van Compernolle, D.: Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings. In: Proc. ICASSP, pp. 833–836 (1990)Google Scholar
- 11.Armani, L., Matassoni, M., Omologo, M., Svaizer, P.: Use of a CSP-based voice activity detector for distant-talking ASR. In: Proc. Eurospeech, pp. 501–504 (2003)Google Scholar
- 12.Novak, M., Gopinath, R.A., Sedivy, J.: Efficient hierarchical labeler algorithm for Gaussian likelihoods computation in resource constrained speech recognition systems, available on-line at: http://www.research.ibm.com/people/r/rameshg/novak-icassp.ps
- 13.Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, 3rd edn., ch. 11. Kluwer Academic Publishers, Dordrecht (1993)Google Scholar
- 14.Ramaswamy, G.N., Navratil, A., Chaudhari, U.V., Zilca, R.D.: The IBM system for the NIST 2002 cellular speaker verification evaluation. In: Proc. ICASSP, vol. 2, pp. 61–64 (2003)Google Scholar