Skip to main content

Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Abstract

In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.

This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B009.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Available on https://www.tensorflow.org.

References

  1. Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)

    Google Scholar 

  2. Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)

    Google Scholar 

  3. Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)

    Google Scholar 

  4. Zajíc, Z., Hrúz, M., Müller, L.: Speaker diarization using convolutional neural network for statistics accumulation refinement. In: Interpeech, Stockholm, pp. 3562–3566 (2017)

    Google Scholar 

  5. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  6. Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)

    Google Scholar 

  7. Valente, F., Vijayasenan, D., Motlicek, P.: Speaker diarization of meetings based on speaker role n-gram models. In: ICASSP, pp. 4416–4419. IEEE, Prague (2011)

    Google Scholar 

  8. Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C.: Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In: ICASSP, pp. 753–756. IEEE, Montreal (2004)

    Google Scholar 

  9. Kunešová, M., Zajíc, Z., Radová, V.: Experiments with segmentation in an online speaker diarization system. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 429–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_48

    Chapter  Google Scholar 

  10. Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_22

    Chapter  Google Scholar 

  11. Soutner, D., Müller, L.: Application of LSTM neural networks in language modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_14

    Chapter  Google Scholar 

  12. Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  13. Zajíc, Z., Machlica, L., Müller, L.: Robust adaptation techniques dealing with small amount of data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 480–487. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_58

    Chapter  Google Scholar 

  14. Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey, Toledo, pp. 219–226 (2004)

    Google Scholar 

  15. Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  16. Godfrey, J.J., Holliman, E.: Switchboard-1 release 2. In: LDC Catalog. Linguistics Data Consortium, Philadelphia (1997)

    Google Scholar 

  17. Daniel, P., et al.: Modelos animales de dolor neuropático. In: Workshop on Automatic Speech Recognition and Understanding, IEEE Catalog No.: CFP11SRW-USB (2011)

    Google Scholar 

  18. Harris, M., Aubert, X., Haeb-Umbach, R., Beyerlein, P.: A study of broadcast news audio stream segmentation and segment clustering. In: EUROSPEECH, Budapest, pp. 1027–1030 (1999)

    Google Scholar 

  19. Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)

    Google Scholar 

  20. Bredin, H.: pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In: Interspeech, Stockholm, pp. 3587–3591 (2017)

    Google Scholar 

  21. Sell, G., Garcia-Romero, D., Mccree, A.: Speaker diarization with I-vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)

    Google Scholar 

  22. Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)

    Article  Google Scholar 

  23. India, M., Fonollosa, J., Hernando, J.: LSTM neural network-based speaker segmentation using acoustic and language modelling. In: Interspeech, Stockholm, pp. 2834–2838 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zbyněk Zajíc .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zajíc, Z., Soutner, D., Hrúz, M., Müller, L., Radová, V. (2018). Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics