Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Campr, Pavel; Kunešová, Marie; Vaněk, Jan; Čech, Jan; Psutka, Josef

doi:10.1007/978-3-319-10816-2_56

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Pavel Campr²¹,
Marie Kunešová²¹,
Jan Vaněk²¹,
Jan Čech²² &
…
Josef Psutka²¹

Conference paper

1540 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Abstract

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Uřičář, M., Franc, V., Hlaváč, V.: Detector of Facial Landmarks Learned by the Structured Output SVM. In: VISAPP 2012: Proceedings of the 7th International Conference on Computer Vision Theory and Applications, pp. 547–556 (2012)
Google Scholar
Sonnenburg, S., Franc, V.: COFFIN: A Computational Framework for Linear SVMs. Technical Report, Center for Machine Perception, Czech Technical University, Prague, Czech Republic (2009)
Google Scholar
Bendris, M., Charlet, D., Chollet, G.: People indexing in TV-content using lip-activity and unsupervised audio-visual identity verification. In: 9th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 139–144 (2011)
Google Scholar
El Khoury, E., Sénac, C., Joly, P.: Audiovisual diarization of people in video content. Multimedia Tools and Applications (2012)
Google Scholar
Markov, K., Nakamura, S.: Never-Ending Learning System for Online Speaker Diarization. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2007, pp. 699–704 (2007)
Google Scholar
Geiger, J., Wallhoff, F., Rigoll, G.: GMM-UBM based open-set online speaker diarization. In: INTERSPEECH 2010, pp. 2330–2333 (2010)
Google Scholar
Sato, M., Ishii, S.: On-line EM algorithm for the Normalized Gaussian Network. Neural Computation 12, 407–432 (2000)
Article Google Scholar
Reynolds, D., Singer, E., Carlson, B., O’Leary, J., McLaughlin, J., Zissman, M.: Blind clustering of speech utterances based on speaker and language characteristics. In: Proceedings of the 5th International Conference on Spoken Language Processing, vol. 7, pp. 3193–3196 (1998)
Google Scholar
National Institute of Standards and Technology, http://www.itl.nist.gov

Download references

Author information

Authors and Affiliations

Faculty of Applied Sciences, Dept. of Cybernetics, University of West Bohemia, Univerzitni 8, 306 14, Plzen, Czech Republic
Pavel Campr, Marie Kunešová, Jan Vaněk & Josef Psutka
Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine Perception, Czech Technical University in Prague, Technicka 2, 166 27, Praha 6, Czech Republic
Jan Čech

Authors

Pavel Campr
View author publications
You can also search for this author in PubMed Google Scholar
Marie Kunešová
View author publications
You can also search for this author in PubMed Google Scholar
Jan Vaněk
View author publications
You can also search for this author in PubMed Google Scholar
Jan Čech
View author publications
You can also search for this author in PubMed Google Scholar
Josef Psutka
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Campr, P., Kunešová, M., Vaněk, J., Čech, J., Psutka, J. (2014). Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_56

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_56
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics