The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings
We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meeting data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM). The system building process is similar to the IBM conversational telephone speech recognition system. However, the best models for the far-field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from scientific conference proceedings, and spontaneous telephone conversations. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year’s CHIL evaluation. Furthermore, the developed STT system significantly outperformed our last year’s results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal.
KeywordsAcoustic Model Word Error Rate Evaluation Campaign Speaker Adaptation Speaker Cluster
Unable to display preview. Download preview PDF.
- 1.The LDC Corpus Catalog, Linguistic Data Consortium, University of Pennsylvania. Philadelphia, PA. Available: http://www.ldc.upenn.edu/Catalog
- 2.Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), Santa Barbara, CA, pp. 347–354 (1997)Google Scholar
- 4.Ajmera, J., Wooters, C.: A robust speaker clustering algorithm. In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), St. Thomas, US Virgin Islands, pp. 411–416 (2003)Google Scholar
- 5.Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B., Wooters, C., Zheng, J.: Further progress in meeting recognition: the ICSI-SRI Spring 2005 speech-to-text evaluation system. In: Proc. Rich Transcription 2005 Spring Meeting Recog. Eval., Edinburgh, UK, pp. 39–50 (2005)Google Scholar
- 7.Saon, G., Zweig, G., Padmanabhan, M.: Linear feature space projections for speaker adaptation. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Salt Lake City, UT, pp. 325–328 (2001)Google Scholar
- 8.Wegmann, S., McAllaster, D., Orloff, J., Peskin, B.: Speaker normalization on conversational telephone speech. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Atlanta, GA, pp. 339–341 (1996)Google Scholar
- 9.Saon, G., Padmanabhan, M., Gopinath, R.: Eliminating inter-speaker variability prior to discriminant transforms. In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), Trento, Italy, pp. 73–76 (2001)Google Scholar
- 10.Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Philadelphia, PA, vol. 1, pp. 961–964 (2005)Google Scholar
- 11.Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Orlando, FL, pp. 105–108 (2002)Google Scholar
- 13.Stolcke, A.: Entropy-based pruning of backoff languge models. In: Proc. DARPA Broadcast News Transcription and Understanding Wksp., Lansdowne, VA, pp. 270–274 (1998)Google Scholar