Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

Zajíc, Zbyněk; Soutner, Daniel; Hrúz, Marek; Müller, Luděk; Radová, Vlasta

doi:10.1007/978-3-030-00794-2_37

Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

Conference paper
First Online: 08 September 2018

1461 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Abstract

In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.

This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B009.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Available on https://www.tensorflow.org.

References

Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)
Google Scholar
Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)
Google Scholar
Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)
Google Scholar
Zajíc, Z., Hrúz, M., Müller, L.: Speaker diarization using convolutional neural network for statistics accumulation refinement. In: Interpeech, Stockholm, pp. 3562–3566 (2017)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)
Google Scholar
Valente, F., Vijayasenan, D., Motlicek, P.: Speaker diarization of meetings based on speaker role n-gram models. In: ICASSP, pp. 4416–4419. IEEE, Prague (2011)
Google Scholar
Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C.: Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In: ICASSP, pp. 753–756. IEEE, Montreal (2004)
Google Scholar
Kunešová, M., Zajíc, Z., Radová, V.: Experiments with segmentation in an online speaker diarization system. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 429–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_48
Chapter Google Scholar
Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_22
Chapter Google Scholar
Soutner, D., Müller, L.: Application of LSTM neural networks in language modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_14
Chapter Google Scholar
Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Zajíc, Z., Machlica, L., Müller, L.: Robust adaptation techniques dealing with small amount of data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 480–487. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_58
Chapter Google Scholar
Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey, Toledo, pp. 219–226 (2004)
Google Scholar
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)
Google Scholar
Godfrey, J.J., Holliman, E.: Switchboard-1 release 2. In: LDC Catalog. Linguistics Data Consortium, Philadelphia (1997)
Google Scholar
Daniel, P., et al.: Modelos animales de dolor neuropático. In: Workshop on Automatic Speech Recognition and Understanding, IEEE Catalog No.: CFP11SRW-USB (2011)
Google Scholar
Harris, M., Aubert, X., Haeb-Umbach, R., Beyerlein, P.: A study of broadcast news audio stream segmentation and segment clustering. In: EUROSPEECH, Budapest, pp. 1027–1030 (1999)
Google Scholar
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)
Google Scholar
Bredin, H.: pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In: Interspeech, Stockholm, pp. 3587–3591 (2017)
Google Scholar
Sell, G., Garcia-Romero, D., Mccree, A.: Speaker diarization with I-vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)
Google Scholar
Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)
Article Google Scholar
India, M., Fonollosa, J., Hernando, J.: LSTM neural network-based speaker segmentation using acoustic and language modelling. In: Interspeech, Stockholm, pp. 2834–2838 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Applied Sciences, NTIS - New Technologies for the Information Society and Department of Cybernetics, University of West Bohemia, Univerzitní 8, 306 14, Plzeň, Czech Republic
Zbyněk Zajíc, Daniel Soutner, Marek Hrúz, Luděk Müller & Vlasta Radová

Authors

Zbyněk Zajíc
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Soutner
View author publications
You can also search for this author in PubMed Google Scholar
Marek Hrúz
View author publications
You can also search for this author in PubMed Google Scholar
Luděk Müller
View author publications
You can also search for this author in PubMed Google Scholar
Vlasta Radová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zbyněk Zajíc .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zajíc, Z., Soutner, D., Hrúz, M., Müller, L., Radová, V. (2018). Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-00794-2_37
Published: 08 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics