LSTM Neural Network for Speaker Change Detection in Telephone Conversations

  • Marek HrúzEmail author
  • Miroslav Hlaváč
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)


In this paper, we analyze an approach to speaker change detection in telephone conversations based on recurrent Long Short-Term Memory Neural Networks. We compare this approach to speaker change detection via Convolutional Neural Networks. We show that by finetuning the architecture and using suitable input data in the form of spectrograms, we obtain better results relatively by 2%. We have discovered that a smaller architecture performs better on unseen data. Also, we found out that using stateful LSTM layers that try to remember whole conversations is much worse than using recurrent networks that memorize only small sequences of speech.


Speaker change Diarization Stateful LSTM 



This paper was supported by the project no. P103/12/G084 of the Grant Agency of the Czech Republic. The work has also been supported by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.


  1. 1.
    Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)Google Scholar
  2. 2.
    Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, pp. 4420–4424. Brisbane (2015).
  3. 3.
    Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). Scholar
  4. 4.
    Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, pp. 4945-4949 (2017)Google Scholar
  5. 5.
    Shaobing, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998)Google Scholar
  6. 6.
    Yin, R., Bredin, H., Barras, C.: Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In: Interspeech 2017, Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), ISCA, Stockholm, Sweden (2017).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Faculty of Applied Sciences, NTISUWBPilsenCzech Republic
  2. 2.Faculty of Applied Sciences, Department of CyberneticsUWBPilsenCzech Republic
  3. 3.ITMO UniversitySt. PetersburgRussia

Personalised recommendations