Abstract
The efficiency of modern automatic meeting transcription suffers from the problem of speaker diarization during overlapping speech segments. The problem can be tackled if each segment of a recording could be marked with the number of active speakers. However, overlapped speech recordings with more than two simultaneous speakers serve as a weak point for speaker number estimation. The problem becomes even more complicated if the speaker number estimation system tends to far-field recordings of multiple speakers acquired by a distant microphone. In this paper we propose an improvement for speaker number estimation by combining it with an overlapped speech detector. In our approach we apply different configurations of speaker number estimation and overlapped speech detector models trained and evaluated on the AMI and LibriSpeech datasets with several types of signal representation. Experimental evaluation based on fusion of models yields an improvement of speaker number estimation performance of up to 10% based on the F1-score metric compared with base speaker number estimation model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
AMI Corpus. https://groups.inf.ed.ac.uk/ami/corpus/. Accessed 10 May 2021
Andrei, V., Cucu, H., Burileanu, C.: Overlapped speech detection and competing speaker counting - humans versus deep learning. J. Sel. Topics Signal Process. 13(4), 850–862 (2019)
Astapov, S., Lavrentyev, A., Shuranov, E.: Far field speech enhancement at low SNR in presence of nonstationary noise based on spectral masking and MVDR beamforming. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 21–31. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_3
Astapov, S., Popov, D., Kabarov, V.: Directional clustering with polyharmonic phase estimation for enhanced speaker localization. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 45–56. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_5
Boakye, K., Trueba-Hornero, B., Vinyals, O., Friedland, G.: Overlapped speech detection for improved speaker diarization in multiparty meetings. In: International Conference on Acoustics, Speech and Signal Processing, pp. 4353–4356 (2008)
Boakye, K., Vinyals, O., Friedland, G.: Two’s a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech. In: INTERSPEECH, pp. 32–35 (2008)
Bredin, H., Yin, R., Coria, J.C., Gelly, G., Korshunov, P.: Pyannote.audio: neural building blocks for speaker diarization. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 7124–7128 (2020)
Bullock, L., Bredin, H., Garcia, P.: Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection. In: International Conference on Acoustics, Speech and Signal Processing, pp. 7114–7118 (2020)
Charlet, D., Barras, C., Liénard, J.-S.: Impact of overlapping speech detection on speaker diarization for broadcast news and debates. In: International Conference on Acoustics, Speech and Signal Processing, pp. 7707–7711 (2013)
Cornell, S., Omologo, M., Squartini, S., Vincent, E.: Detecting and counting overlapping speakers in distant speech scenarios. In: INTERSPEECH (2020)
Grumiaux, P.A., Kitic, S., Girin, L., Guérin, A.: Multichannel CRNN for speaker counting: an analysis of performance. arXiv preprint arXiv:2101.01977 (2021)
Kunešová, M., Hrúz, M., Zajíc, Z., Radová, V.: Detection of overlapping speech for the purposes of speaker diarization. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 247–257. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_26
Otterson, S., Ostendorf, M.: Efficient use of overlap information in speaker diarization. In: Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 683–686 (2007)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015)
Sajjan, N., Ganesh, S., Sharma, N., Ganapathy, S., Ryant, N.: Leveraging LSTM models for overlap detection in multi-party meetings. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 5249–5253 (2018)
Sayoud, H., Ouamour, S.: Proposal of a new confidence parameter estimating the number of speakers-an experimental investigation. J. Inf. Hiding Multimed. Signal Process. 1(2), 101–109 (2010)
Seltzer, M.L., Yu, D., Wang Y.: An investigation of deep neural networks for noise robust speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 7398–7402 (2013)
Stöter, R.-F., Chakrabarty, S., Edler, B., Emanuël, H.: Classification vs. regression in supervised learning for single channel speaker count estimation. In: International Conference on Acoustics, Speech and Signal Processing, pp. 436–440 (2018)
Stöter, R.-F., Chakrabarty, S., Edler, B., Emanuël, H.: CountNet: estimating the number of concurrent speakers using supervised learning. Trans. Audio Speech Lang. Process. 27(2), 268–282 (2019)
Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., Alleva, F.: Recognizing overlapped speech in meetings: a multichannel separation approach using neural networks. In: INTERSPEECH, pp. 3038–3042 (2018)
Zelenak, M., Hernando, J.: On the improvement of speaker diarization by detecting overlapped speech. VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop (2010)
Acknowledgments
This research was financially supported by the ITMO University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Timofeeva, E., Evseeva, E., Zaluskaia, V., Kapranova, V., Astapov, S., Kabarov, V. (2021). Improvement of Speaker Number Estimation by Applying an Overlapped Speech Detector. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_62
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)