Advertisement

The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings

  • Jing Huang
  • Etienne Marcheret
  • Karthik Visweswariah
  • Gerasimos Potamianos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4625)

Abstract

We present the IBM systems for the Rich Transcription 2007 (RT07) speaker diarization evaluation task on lecture meeting data. We first overview our baseline system that was developed last year, as part of our speech-to-text system for the RT06s evaluation. We then present a number of simple schemes considered this year in our effort to improve speaker diarization performance, namely: (i) A better speech activity detection (SAD) system, a necessary pre-processing step to speaker diarization; (ii) Use of word information from a speaker-independent speech recognizer; (iii) Modifications to speaker cluster merging criteria and the underlying segment model; and (iv) Use of speaker models based on Gaussian mixture models, and their iterative refinement by frame-level re-labeling and smoothing of decision likelihoods. We report development experiments on the RT06s evaluation test set that demonstrate that these methods are effective, resulting in dramatic performance improvements over our baseline diarization system. For example, changes in the cluster segment models and cluster merging methodology result in a 24.2% relative reduction in speaker error rate, whereas use of the iterative model refinement process and word-level alignment produce a 36.0% and 9.2% speaker error relative reduction, respectively. The importance of the SAD subsystem is also shown, with SAD error reduction from 12.3% to 4.3% translating to a 20.3% relative reduction in speaker error rate. Unfortunately however, the developed diarization system heavily depends on appropriately tuning thresholds in the speaker cluster merging process. Possibly as a result of over-tuning such thresholds, performance on the RT07 evaluation test set degrades significantly compared to the one observed on development data. Nevertheless, our experiments show that the introduced techniques of cluster merging, speaker model refinement and alignment remain valuable in the RT07 evaluation.

Keywords

False Alarm Baseline System Speaker Model Speaker Diarization Speaker Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    NIST 2007 Spring Rich Transcription Evaluation, http://www.nist.gov/speech/tests/rt/rt2007/index.html
  2. 2.
    Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The Rich Transcription 2006 Spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Anguera, X., Wooters, C., Pardo, J.M.: Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 346–358. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Fredouille, C., Senay, G.: Technical improvements of the E-HMM based speaker diarization system for meeting records. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 359–370. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Zhu, X., Barras, C., Lamel, L., Gauvain, J.-L.: Speaker diarization: From Broadcast News to lectures. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 396–406. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    van Leeuwen, D.A., Huijbregts, M.: The AMI speaker diarization system for NIST RT06s meeting data. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 371–384. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    NIST Rich Transcription Benchmark Tests, http://www.nist.gov/speech/tests/rt
  8. 8.
    Anguera, X., Wooters, C., Hernando, J.: Purity algorithms for speaker diarization of meetings data. In: Proc. Int. Conf. Acoustic Speech Signal Process (ICASSP), Toulouse, France, vol. 1, pp. 1025–1028 (2006)Google Scholar
  9. 9.
    Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining speaker identification and BIC for speaker diarization. In: Proc. Interspeech, Lisbon, Portugal, pp. 2441–2444 (2005)Google Scholar
  10. 10.
    Reynolds, D.A., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proc. Int. Conf. Acoustic Speech Signal Process (ICASSP), Philadelphia, PA, vol. 5, pp. 953–956 (2005)Google Scholar
  11. 11.
    Ajmera, J., Wooters, C.: A robust speaker clustering algorithm. In: Proc. Automatic Speech Recogn. Understanding Works (ASRU), St. Thomas, US Virgin Islands (2003)Google Scholar
  12. 12.
    Gauvain, J.-L., Lamel, L., Adda, G.: Partitioning and transcription of Broadcast News data. In: Proc. Int. Conf. Spoken Language Systems (ICSLP), Sydney, Australia (1998)Google Scholar
  13. 13.
    Sinha, R., Tranter, S.E., Gales, M.J.F., Woodland, P.C.: The Cambridge University speaker diarisation system. In: Proc. Interspeech, Lisbon, Portugal, March 2005, pp. 2437–2440 (2005)Google Scholar
  14. 14.
    Canseco-Rodriguez, L., Lamel, L., Gauvain, J.-L.: Speaker diarization from speech transcripts. In: Proc. Int. Conf. Spoken Language Systems (ICSLP), Jeju Island, S. Korea (2004)Google Scholar
  15. 15.
    Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J.: The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 323–335. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  16. 16.
    Huang, J., Westphal, M., Chen, S., Siohan, O., et al.: The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 432–443. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  17. 17.
    Chen, S.F., Kingsbury, B., Mangu, L., Povey, D., et al.: Advances in speech transcription at IBM under the DARPA EARS program. IEEE Trans. Speech Audio Language Process. 14(5), 1596–1608 (2006)CrossRefGoogle Scholar
  18. 18.
    CHIL: Computers in the Human Interaction Loop, http://chil.server.de

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jing Huang
    • 1
  • Etienne Marcheret
    • 1
  • Karthik Visweswariah
    • 1
  • Gerasimos Potamianos
    • 1
  1. 1.IBM Thomas J. Watson Research CenterU.S.A.

Personalised recommendations