Multi-stage Speaker Diarization for Conference and Lecture Meetings

  • X. Zhu
  • C. Barras
  • L. Lamel
  • J-L. Gauvain
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4625)


The LIMSI RT-07S speaker diarization system for the conference and lecture meetings is presented in this paper. This system builds upon the RT-06S diarization system designed for lecture data. The baseline system combines agglomerative clustering based on Bayesian information criterion (BIC) with a second clustering using state-of-the-art speaker identification (SID) techniques. Since the baseline system provides a high speech activity detection (SAD) error around of 10% on lecture data, some different acoustic representations with various normalization techniques are investigated within the framework of log-likelihood ratio (LLR) based speech activity detector. UBMs trained on the different types of acoustic features are also examined in the SID clustering stage. All SAD acoustic models and UBMs are trained with the forced alignment segmentations of the conference data. The diarization system integrating the new SAD models and UBM gives comparable results on both the RT-07S conference and lecture evaluation data for the multiple distant microphone (MDM) condition.


Bayesian Information Criterion Broadcast News Universal Background Model Lecture Data Speaker Diarization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    NIST, Spring 2007 Rich Transcription (RT-07S) Meeting Recognition Evaluation Plan (February 2007),
  2. 2.
    Anguera, X., Wooters, C., Hernando, J.: Speaker Diarization for Multi-Party Meetings Using Acoustic Fusion. In: Automatic Speech Recognition and Understanding (ASRU 2005), San Juan, Puerto Rico. IEEE, Los Alamitos (2005)Google Scholar
  3. 3.
    Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining Speaker Identification and BIC for Speaker Diarization. In: ISCA Interspeech 2005, Lisbon, September 2005, pp. 2441–2444 (2005)Google Scholar
  4. 4.
    Barras, C., Zhu, X., Meignier, S., Gauvain, J.-L.: Multi-Stage Speaker Diarization of Broadcast News. The IEEE Transactions on Audio, Speech and Language Processing, September 2006 (to appear)Google Scholar
  5. 5.
    Zhu, X., Barras, C., Lamel, L., Gauvain, J.L.: Speaker Diarization: from Broadcast News to Lectures. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, Springer, Heidelberg (2006)Google Scholar
  6. 6.
    Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation and clustering of broadcast news audio. In: The DARPA Speech Recognition Workshop, Chantilly, USA (February 1997)Google Scholar
  7. 7.
    Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, USA (February 1998)Google Scholar
  8. 8.
    Cettolo, M.: Segmentation, classification and clustering of an Italian broadcast news corpus. In: Conf. on Content-Based Multimedia Information Access (RIAO 2000), April 2000, Paris (2000)Google Scholar
  9. 9.
    Barras, C., Zhu, X., Meignier, S., Gauvain, J.L.: Improving speaker diarization. In: The Proceedings of Fall 2004 Rich Transcription Workshop (RT 2004), November 2004, Palisades, NY, USA (2004)Google Scholar
  10. 10.
    Tranter, S.E., Reynolds, D.A.: Speaker diarization for broadcast news. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2004, May 2004, Toledo, Spain (2004)Google Scholar
  11. 11.
    Schroeder, J., Campbell, J. (eds.): Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop. Academic Press, London (2000)Google Scholar
  12. 12.
    Barras, C., Gauvain, J.-L.: Feature and score normalization for speaker verification of cellular data. In: IEEE ICASSP 2003, Hong Kong (2003)Google Scholar
  13. 13.
    Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2001, June 2001, pp. 213–218 (2001)Google Scholar
  14. 14.
    Gauvain, J.-L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)CrossRefGoogle Scholar
  15. 15.
    Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop 10(1-3), 19–41 (2000)Google Scholar
  16. 16.
    Reynolds, D.A., Singer, E., Carlson, B.A., O’Leary, G.C., McLaughlin, J.J., Zissman, M.A.: Blind clustering of speech utterances based on speaker and language characteristics. In: Proc. of International Conf. on Spoken Language Processing (ICSLP 1998) (1998)Google Scholar
  17. 17.
    Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B., Wooters, C., Zheng, J.: The ICSI-SRI Spring 2005 Speech-To-Text evaluation System. In: Rich Transcription 2005 Spring Meeting Recognition Evaluation, July 2005, Edinburgh, Great Britain (2005)Google Scholar
  18. 18.
    NIST, Fall 2004 Rich Transcription (RT-04F) evaluation plan (August 2004),
  19. 19.
    NIST, Spring 2006 Rich Transcription (RT-06S) Meeting Recognition Evaluation Plan (February 2006),
  20. 20.
    Wooters, C., Huijbregts, M.: The ICSI RT07s Speaker Diarization System. In: Rich Transcription 2007 Meeting Recognition Workshop, Baltimore, USA (May 2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • X. Zhu
    • 1
    • 2
  • C. Barras
    • 1
    • 2
  • L. Lamel
    • 1
  • J-L. Gauvain
    • 1
  1. 1.Spoken Language Processing GroupLIMSI-CNRSOrsay cedexFrance
  2. 2.Univ Paris-SudOrsayFrance

Personalised recommendations