The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings
We present the IBM systems for the Rich Transcription 2007 (RT07) speaker diarization evaluation task on lecture meeting data. We first overview our baseline system that was developed last year, as part of our speech-to-text system for the RT06s evaluation. We then present a number of simple schemes considered this year in our effort to improve speaker diarization performance, namely: (i) A better speech activity detection (SAD) system, a necessary pre-processing step to speaker diarization; (ii) Use of word information from a speaker-independent speech recognizer; (iii) Modifications to speaker cluster merging criteria and the underlying segment model; and (iv) Use of speaker models based on Gaussian mixture models, and their iterative refinement by frame-level re-labeling and smoothing of decision likelihoods. We report development experiments on the RT06s evaluation test set that demonstrate that these methods are effective, resulting in dramatic performance improvements over our baseline diarization system. For example, changes in the cluster segment models and cluster merging methodology result in a 24.2% relative reduction in speaker error rate, whereas use of the iterative model refinement process and word-level alignment produce a 36.0% and 9.2% speaker error relative reduction, respectively. The importance of the SAD subsystem is also shown, with SAD error reduction from 12.3% to 4.3% translating to a 20.3% relative reduction in speaker error rate. Unfortunately however, the developed diarization system heavily depends on appropriately tuning thresholds in the speaker cluster merging process. Possibly as a result of over-tuning such thresholds, performance on the RT07 evaluation test set degrades significantly compared to the one observed on development data. Nevertheless, our experiments show that the introduced techniques of cluster merging, speaker model refinement and alignment remain valuable in the RT07 evaluation.
KeywordsFalse Alarm Baseline System Speaker Model Speaker Diarization Speaker Cluster
Unable to display preview. Download preview PDF.
- 1.NIST 2007 Spring Rich Transcription Evaluation, http://www.nist.gov/speech/tests/rt/rt2007/index.html
- 7.NIST Rich Transcription Benchmark Tests, http://www.nist.gov/speech/tests/rt
- 8.Anguera, X., Wooters, C., Hernando, J.: Purity algorithms for speaker diarization of meetings data. In: Proc. Int. Conf. Acoustic Speech Signal Process (ICASSP), Toulouse, France, vol. 1, pp. 1025–1028 (2006)Google Scholar
- 9.Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining speaker identification and BIC for speaker diarization. In: Proc. Interspeech, Lisbon, Portugal, pp. 2441–2444 (2005)Google Scholar
- 10.Reynolds, D.A., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proc. Int. Conf. Acoustic Speech Signal Process (ICASSP), Philadelphia, PA, vol. 5, pp. 953–956 (2005)Google Scholar
- 11.Ajmera, J., Wooters, C.: A robust speaker clustering algorithm. In: Proc. Automatic Speech Recogn. Understanding Works (ASRU), St. Thomas, US Virgin Islands (2003)Google Scholar
- 12.Gauvain, J.-L., Lamel, L., Adda, G.: Partitioning and transcription of Broadcast News data. In: Proc. Int. Conf. Spoken Language Systems (ICSLP), Sydney, Australia (1998)Google Scholar
- 13.Sinha, R., Tranter, S.E., Gales, M.J.F., Woodland, P.C.: The Cambridge University speaker diarisation system. In: Proc. Interspeech, Lisbon, Portugal, March 2005, pp. 2437–2440 (2005)Google Scholar
- 14.Canseco-Rodriguez, L., Lamel, L., Gauvain, J.-L.: Speaker diarization from speech transcripts. In: Proc. Int. Conf. Spoken Language Systems (ICSLP), Jeju Island, S. Korea (2004)Google Scholar
- 18.CHIL: Computers in the Human Interaction Loop, http://chil.server.de