Advertisement

Multimedia Tools and Applications

, Volume 75, Issue 15, pp 8999–9023 | Cite as

Naming multi-modal clusters to identify persons in TV broadcast

  • Johann Poignant
  • Guillaume Fortier
  • Laurent Besacier
  • Georges Quénot
Article

Abstract

Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.

Keywords

Multimodal fusion VideoOCR Face and speaker identification TV broadcast 

Notes

Acknowledgments

This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency

References

  1. 1.
    Barras C, Zhu X, Meignier S, Gauvain J-L (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio, Speech Language Processing 14(5):1505–1512CrossRefGoogle Scholar
  2. 2.
    Bendris M, Favre B, Charlet D, Damnati G, Auguste R, Martinet J, Senay G (2013) Unsupervised face identification in TV content using audio-visual sources. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI), pp 243–249Google Scholar
  3. 3.
    Béchet F, Bendris M, Charlet D, Damnati G, Favre B, Rouvier M, Auguste R, Bigot B, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2014) Multimodal understanding for person recognition in video broadcasts. In: 15th annual conference of the internationnal speech communication association (INTERSPEECH)Google Scholar
  4. 4.
    Bredin H, Poignant J (2013) Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. In: the 14th annual conference of the international speech communication association, (INTERSPEECH)Google Scholar
  5. 5.
    Bredin H, Poignant J, Tapaswi M, Fortier G, Le VB, Napoleon T, Gao H, Barras C, Rosset S, Besacier L, Verbeek J, Quénot G, Jurie F, Kemal Ekenel H (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: Workshop on information fusion in computer vision for concept recognition, ECCV-IFCVCR, pp 385–394Google Scholar
  6. 6.
    Bredin H, Poignant J, Fortier G, Tapaswi M, Le VB, Sarkar A, Barras C, Rosset S, Roy A, Yang Q, Gao H, Mignon A, Verbeek J, Besacier L, Quénot G, Kemal Ekenel H, Stiefelhagen R (2013) QCompere at REPERE 2013. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAMGoogle Scholar
  7. 7.
    Bredin H, Roy A, Le VB, Barras C (2014) Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identication in TV broadcast. In: International journal of multimedia information retrievalGoogle Scholar
  8. 8.
    Buml M, Bernardin K, Fischer M, Ekenel HK, Stiefelhagen R (2010) Multi-pose face recognition for person retrieval in camera networks. In: 7th International conference on advanced video and signal-based surveillance, AVSS, pp 441–447Google Scholar
  9. 9.
    Canseco-Rodriguez L, Lamel L, Gauvain J-L (2004) Speaker diarization from speech transcripts. In: the 5th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
  10. 10.
    Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419Google Scholar
  11. 11.
    Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop, pp 127–132Google Scholar
  12. 12.
    Estève Y, Meignier S, Deléglise P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: the 8th annual conference of the international speech communication association, INTERSPEECH, pp 2601–2604Google Scholar
  13. 13.
    Favre B, Damnati G, Béchet F, Bendris M, Charlet D, Auguste R, Ayache S, Bigot B, Delteil A, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2013) PERCOLI: a person identification system for the 2013 REPERE challenge. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
  14. 14.
    Gay P, Dupuy G, Lailler C, Odobez J-M, Meignier S, Deléglise P (2014) Comparison of two methods for unsupervised person identification in TV shows. In: 12th international workshop on content-based multimedia indexing (CBMI)Google Scholar
  15. 15.
    Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: the 8th international conference on language resources and evaluation, LRECGoogle Scholar
  16. 16.
    Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. In: the IEEE 12th international conference on computer vision, pp 498–505Google Scholar
  17. 17.
    Houghton R (1999) Named faces: putting names to faces. IEEE Intell Syst 14:45–50CrossRefGoogle Scholar
  18. 18.
    Jousse V, Petit-Renaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systemsGoogle Scholar
  19. 19.
    Kahn J, Galibert O, Quintard L, Carré M, Giraudel A, Joly P (2012) A presentation of the REPERE challenge. In: the 10th international workshop on content-based multimedia indexing (CBMI), pp 1–6Google Scholar
  20. 20.
    Khoury E, Snac C, Joly P (2012) Audiovisual diarization of people in video content. In: Multimedia tools and applicationsGoogle Scholar
  21. 21.
    Le VB, Barras C, Ferràs M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Odyssey - the speaker and language recognition workshop, pp 146–150Google Scholar
  22. 22.
    Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking?. In: IEEE Odyssey 2006 - the speaker and language recognition workshopGoogle Scholar
  23. 23.
    Petit-Renaud S, Jousse V, Meignier S, Estève Y (2010) Identification of speakers by name using belief functions. In: the 13th international conference on information processing and management of uncertainty in knowledge-based systems, theory and methods, IPMU, pp 179–188Google Scholar
  24. 24.
    Pham PT, Moens M-F, Tuytelaars T (2010) Naming persons in news video with label propagation. In: IEEE international conference on Multimedia and Expo, ICME, p 15281533Google Scholar
  25. 25.
    Pham PT, Tuytelaars T, Moens M-F (2011) Naming people in news videos with label propagation. IEEE MultiMedia 18(3):4455CrossRefGoogle Scholar
  26. 26.
    Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: IEEE international conference on multimedia and expo, ICME, pp 854–859Google Scholar
  27. 27.
    Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: the 13rd annual conference of the international speech communication association, INTERSPEECH, pp 2650–2653Google Scholar
  28. 28.
    Poignant J, Besacier L, Quénot G (2013) Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité. In: la 10ème cOnférence en recherche d’Information et applications, CORIAGoogle Scholar
  29. 29.
    Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?. In: the 14th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
  30. 30.
    Poignant J, Bredin H, Besacier L, Quénot G, Barras C (2013) Towards a better integration of written names for unsupervised speakers identification in videos. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAMGoogle Scholar
  31. 31.
    Poignant J, Besacier L, Quénot G (2014) Nommage non-supervisé des personnes dans les émissions de télévision: utilisation des noms écrits, des noms prononcés ou des deux?. In: Documents numriques, pp 37–60Google Scholar
  32. 32.
    Rouvier M, Meignier S (2012) A Global Optimization Framework For Speaker Diarization. In: Odyssey - the speaker and language recognition workshopGoogle Scholar
  33. 33.
    Rouvier M, Favre B, Bendris M, Charlet D, Damnati G (2014) Scene understanding for identifying persons in TV shows: beyond face authentication. In: 12th international workshop on content-based multimedia indexing (CBMI)Google Scholar
  34. 34.
    Sato T, Kanade T, Hughes TK, Smith MA, Satoh S (1999) Video OCR: Indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia SystemsGoogle Scholar
  35. 35.
    Satoh S, Nakamura Y, Kanade T (1999) Name-It: naming and detecting faces in news videos. IEEE Multimedia 6:22–35CrossRefGoogle Scholar
  36. 36.
    Tranter SE (2006) Who really spoke when? finding speaker turns and identities in broadcast news audio. In: the 31st IEEE international conference on acoustics, speech and signal processing, ICASSP, pp 1013–1016Google Scholar
  37. 37.
    Uřičář M, Franc V, Hlaváč V (2012) Detector of Facial Landmarks Learned by the Structured Output SVM. In: the 7th international conference on computer vision theory and applications, pp 547–556Google Scholar
  38. 38.
    Yang J., Hauptmann A G (2004) Naming every individual in news video monologuesGoogle Scholar
  39. 39.
    Yang J, Yan R, Hauptmann A G (2005) Multiple instance learning for labeling faces in broadcasting news video. In: the 13th ACM international conference on multimedia, ACMMM, pp 31–40Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Johann Poignant
    • 1
    • 2
  • Guillaume Fortier
    • 1
    • 2
  • Laurent Besacier
    • 1
    • 2
  • Georges Quénot
    • 1
    • 2
  1. 1.Université Grenoble Alpes, LIGGrenobleFrance
  2. 2.CNRS, LIGGrenobleFrance

Personalised recommendations