Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech

  • Zbyněk ZajícEmail author
  • Jan Zelinka
  • Luděk Müller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


In this paper, we have been investigating an approach to a speaker representation for a diarization system that clusters short telephone conversation segments (produced by the same speaker). The proposed approach applies a neural-network-based descriptor that replaces a usual i-vector descriptor in the state-of-the-art diarization systems. The comparison of these two techniques was done on the English part of the CallHome corpus. The final results indicate the superiority of the i-vector’s approach although our proposed descriptor brings an additive information. Thus, the combined descriptor represents a speaker in a segment for diarization purpose with lower diarization error (almost 20% relative improvement compared with only i-vector application).


Neural network Speaker diarization i-Vector 



This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B048.


  1. 1.
    Adami, A.G., Kajarekar, S.S., Hermansky, H.: A new speaker change detection method for two-speaker segmentation. In: ICASSP, vol. 4, pp. 3908–3911 (2002)Google Scholar
  2. 2.
    Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)Google Scholar
  3. 3.
    Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)Google Scholar
  4. 4.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  5. 5.
    Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)CrossRefGoogle Scholar
  6. 6.
    Fredouille, C., Bozonnet, S., Evans, N.: The LIA-EURECOM RT 2009 Speaker Diarization System. In: NIST Rich Transcription Workshop (RT09), Melbourne, USA (2009)Google Scholar
  7. 7.
    Furui, S., Itoh, D.: Neural-network-based HMM adaptation for noisy speech. In: ICASSP, Salt Lake City, pp. 365–368 (2001)Google Scholar
  8. 8.
    Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-Vector length normalization in speaker recognition systems. In: Interspeech, Florence, pp. 249–252 (2011)Google Scholar
  9. 9.
    Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., Vaquero, C.: Unsupervised domain adaptation for i-Vector speaker recognition. In: Odyssey - Speaker and Language Recognition Workshop, Joensuu, pp. 260–264 (2014)Google Scholar
  10. 10.
    Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., McCree, A.: Speaker diarization using deep neural network embedings. In: ICASSP, New Orleans, pp. 4930–4934 (2017)Google Scholar
  11. 11.
    Graff, D., Miller, D., Walker, K.: Switchboard-2 phase III audio. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1999)Google Scholar
  12. 12.
    Graff, D., Walker, K., Canavan, A.: Switchboard-2 phase II, LDC99S79. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2002)Google Scholar
  13. 13.
    Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, Brisbane, pp. 4420–4424 (2015)Google Scholar
  14. 14.
    Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP, Shanghai, pp. 31–35 (2016)Google Scholar
  15. 15.
    Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker Diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)Google Scholar
  16. 16.
    Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. Technical report, Centre de Recherche Informatique de Montreal (2006)Google Scholar
  17. 17.
    Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey - Speaker and Language Recognition Workshop, Toledo, pp. 219–226 (2004)Google Scholar
  18. 18.
    Machlica, L., Zajíc, Z.: Factor analysis and nuisance attribute projection revisited. In: Interspeech, Portland, pp. 1570–1573 (2012)Google Scholar
  19. 19.
    Martin, A., Przybocki, M.: 2004 NIST speaker recognition evaluation, LDC 2006 S44. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)Google Scholar
  20. 20.
    Milner, R., Hain, T.: DNN-based speaker clustering for speaker Diarisation. In: Interspeech, San Francisco, 08 September 2012, pp. 2185–2189 (2016)Google Scholar
  21. 21.
    NIST Multimodal Information Group: 2005 NIST Speaker Recognition Evaluation Training Data, LDC2011S01. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)Google Scholar
  22. 22.
    NIST Multimodal Information Group: 2006 NIST Speaker Recognition Evaluation Training Set, LDC2011S09. In: LDC Catalog (2011)Google Scholar
  23. 23.
    Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news Diarization. In: Interspeech, Lyon, p. 5 (2013)Google Scholar
  24. 24.
    Sell, G., Garcia-Romero, D.: Speaker Diarization with PLDA i-Vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)Google Scholar
  25. 25.
    Sell, G., Garcia-Romero, D., Mccree, A.: Speaker Diarization with i-Vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)Google Scholar
  26. 26.
    Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the Cosine distance-based mean shift for telephone speech diarization. Audio, Speech Lang. Process. 22(1), 217–227 (2014)CrossRefGoogle Scholar
  27. 27.
    Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)Google Scholar
  28. 28.
    Shum, S.H., Dehak, N., Dehak, R., Glass, J.R.: Unsupervised methods for speaker diarization: an integrated and iterative approach. Audio, Speech Lang. Process. 21(10), 2015–2028 (2013)CrossRefGoogle Scholar
  29. 29.
    Theano Development Team: Theano: A Python Framework for Fast Computation of Mathematical Expressions. arXiv e-prints abs/1605.0 (2016)Google Scholar
  30. 30.
    Wang, R., Gu, M., Li, L., Xu, M., Zheng, T.F.: Speaker segmentation using deep speaker vectors for fast speaker change scenarios. In: ICASSP, New Orleans, pp. 5420–5424 (2017)Google Scholar
  31. 31.
    Yells, S.H., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of IEEE Spoken Language Technology Workshop, pp. 402–406. IEEE (2014)Google Scholar
  32. 32.
    Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 411–418. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_49 CrossRefGoogle Scholar
  33. 33.
    Zajíc, Z., Machlica, L., Müller, L.: Initialization of fMLLR with sufficient statistics from similar speakers. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 187–194. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23538-2_24 CrossRefGoogle Scholar
  34. 34.
    Zelinka, J., Vaněk, J., Müller, L.: Neural-network-based spectrum processing for speech recognition and speaker verification. In: Statistical Language and Speech Processing, Budapest, vol. 9449, pp. 288–299 (2015)Google Scholar
  35. 35.
    Zhu, W., Pelecanos, J.: Online speaker Diarization using adapted i-Vector transforms. In: ICASSP, Shanghai, pp. 5045–5049 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Faculty of Applied SciencesNTIS - New Technologies for the Information Society, University of West BohemiaPlzeňCzech Republic
  2. 2.Department of Cybernetics, Faculty of Applied SciencesUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations