Advertisement

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

  • Pavel Campr
  • Marie Kunešová
  • Jan Vaněk
  • Jan Čech
  • Josef Psutka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)

Abstract

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).

Keywords

audio-video speaker diarization audio speaker recognition face recognition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Uřičář, M., Franc, V., Hlaváč, V.: Detector of Facial Landmarks Learned by the Structured Output SVM. In: VISAPP 2012: Proceedings of the 7th International Conference on Computer Vision Theory and Applications, pp. 547–556 (2012)Google Scholar
  2. 2.
    Sonnenburg, S., Franc, V.: COFFIN: A Computational Framework for Linear SVMs. Technical Report, Center for Machine Perception, Czech Technical University, Prague, Czech Republic (2009)Google Scholar
  3. 3.
    Bendris, M., Charlet, D., Chollet, G.: People indexing in TV-content using lip-activity and unsupervised audio-visual identity verification. In: 9th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 139–144 (2011)Google Scholar
  4. 4.
    El Khoury, E., Sénac, C., Joly, P.: Audiovisual diarization of people in video content. Multimedia Tools and Applications (2012)Google Scholar
  5. 5.
    Markov, K., Nakamura, S.: Never-Ending Learning System for Online Speaker Diarization. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2007, pp. 699–704 (2007)Google Scholar
  6. 6.
    Geiger, J., Wallhoff, F., Rigoll, G.: GMM-UBM based open-set online speaker diarization. In: INTERSPEECH 2010, pp. 2330–2333 (2010)Google Scholar
  7. 7.
    Sato, M., Ishii, S.: On-line EM algorithm for the Normalized Gaussian Network. Neural Computation 12, 407–432 (2000)CrossRefGoogle Scholar
  8. 8.
    Reynolds, D., Singer, E., Carlson, B., O’Leary, J., McLaughlin, J., Zissman, M.: Blind clustering of speech utterances based on speaker and language characteristics. In: Proceedings of the 5th International Conference on Spoken Language Processing, vol. 7, pp. 3193–3196 (1998)Google Scholar
  9. 9.
    National Institute of Standards and Technology, http://www.itl.nist.gov

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Pavel Campr
    • 1
  • Marie Kunešová
    • 1
  • Jan Vaněk
    • 1
  • Jan Čech
    • 2
  • Josef Psutka
    • 1
  1. 1.Faculty of Applied Sciences, Dept. of CyberneticsUniversity of West BohemiaPlzenCzech Republic
  2. 2.Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine PerceptionCzech Technical University in PraguePraha 6Czech Republic

Personalised recommendations