LSTM Neural Network for Speaker Change Detection in Telephone Conversations
In this paper, we analyze an approach to speaker change detection in telephone conversations based on recurrent Long Short-Term Memory Neural Networks. We compare this approach to speaker change detection via Convolutional Neural Networks. We show that by finetuning the architecture and using suitable input data in the form of spectrograms, we obtain better results relatively by 2%. We have discovered that a smaller architecture performs better on unseen data. Also, we found out that using stateful LSTM layers that try to remember whole conversations is much worse than using recurrent networks that memorize only small sequences of speech.
KeywordsSpeaker change Diarization Stateful LSTM
This paper was supported by the project no. P103/12/G084 of the Grant Agency of the Czech Republic. The work has also been supported by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
- 1.Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)Google Scholar
- 2.Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, pp. 4420–4424. Brisbane (2015). https://doi.org/10.1109/ICASSP.2015.7178806
- 4.Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, pp. 4945-4949 (2017)Google Scholar
- 5.Shaobing, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998)Google Scholar
- 6.Yin, R., Bredin, H., Barras, C.: Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In: Interspeech 2017, Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), ISCA, Stockholm, Sweden (2017). https://doi.org/10.21437/Interspeech.2017-65