Abstract
In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.
This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B009.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Available on https://www.tensorflow.org.
References
Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)
Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)
Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)
Zajíc, Z., Hrúz, M., Müller, L.: Speaker diarization using convolutional neural network for statistics accumulation refinement. In: Interpeech, Stockholm, pp. 3562–3566 (2017)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)
Valente, F., Vijayasenan, D., Motlicek, P.: Speaker diarization of meetings based on speaker role n-gram models. In: ICASSP, pp. 4416–4419. IEEE, Prague (2011)
Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C.: Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In: ICASSP, pp. 753–756. IEEE, Montreal (2004)
Kunešová, M., Zajíc, Z., Radová, V.: Experiments with segmentation in an online speaker diarization system. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 429–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_48
Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_22
Soutner, D., Müller, L.: Application of LSTM neural networks in language modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_14
Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Zajíc, Z., Machlica, L., Müller, L.: Robust adaptation techniques dealing with small amount of data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 480–487. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_58
Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey, Toledo, pp. 219–226 (2004)
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)
Godfrey, J.J., Holliman, E.: Switchboard-1 release 2. In: LDC Catalog. Linguistics Data Consortium, Philadelphia (1997)
Daniel, P., et al.: Modelos animales de dolor neuropático. In: Workshop on Automatic Speech Recognition and Understanding, IEEE Catalog No.: CFP11SRW-USB (2011)
Harris, M., Aubert, X., Haeb-Umbach, R., Beyerlein, P.: A study of broadcast news audio stream segmentation and segment clustering. In: EUROSPEECH, Budapest, pp. 1027–1030 (1999)
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)
Bredin, H.: pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In: Interspeech, Stockholm, pp. 3587–3591 (2017)
Sell, G., Garcia-Romero, D., Mccree, A.: Speaker diarization with I-vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)
Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)
India, M., Fonollosa, J., Hernando, J.: LSTM neural network-based speaker segmentation using acoustic and language modelling. In: Interspeech, Stockholm, pp. 2834–2838 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zajíc, Z., Soutner, D., Hrúz, M., Müller, L., Radová, V. (2018). Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)