Naming multi-modal clusters to identify persons in TV broadcast
- 163 Downloads
- 3 Citations
Abstract
Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.
Keywords
Multimodal fusion VideoOCR Face and speaker identification TV broadcastNotes
Acknowledgments
This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency
References
- 1.Barras C, Zhu X, Meignier S, Gauvain J-L (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio, Speech Language Processing 14(5):1505–1512CrossRefGoogle Scholar
- 2.Bendris M, Favre B, Charlet D, Damnati G, Auguste R, Martinet J, Senay G (2013) Unsupervised face identification in TV content using audio-visual sources. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI), pp 243–249Google Scholar
- 3.Béchet F, Bendris M, Charlet D, Damnati G, Favre B, Rouvier M, Auguste R, Bigot B, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2014) Multimodal understanding for person recognition in video broadcasts. In: 15th annual conference of the internationnal speech communication association (INTERSPEECH)Google Scholar
- 4.Bredin H, Poignant J (2013) Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. In: the 14th annual conference of the international speech communication association, (INTERSPEECH)Google Scholar
- 5.Bredin H, Poignant J, Tapaswi M, Fortier G, Le VB, Napoleon T, Gao H, Barras C, Rosset S, Besacier L, Verbeek J, Quénot G, Jurie F, Kemal Ekenel H (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: Workshop on information fusion in computer vision for concept recognition, ECCV-IFCVCR, pp 385–394Google Scholar
- 6.Bredin H, Poignant J, Fortier G, Tapaswi M, Le VB, Sarkar A, Barras C, Rosset S, Roy A, Yang Q, Gao H, Mignon A, Verbeek J, Besacier L, Quénot G, Kemal Ekenel H, Stiefelhagen R (2013) QCompere at REPERE 2013. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAMGoogle Scholar
- 7.Bredin H, Roy A, Le VB, Barras C (2014) Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identication in TV broadcast. In: International journal of multimedia information retrievalGoogle Scholar
- 8.Buml M, Bernardin K, Fischer M, Ekenel HK, Stiefelhagen R (2010) Multi-pose face recognition for person retrieval in camera networks. In: 7th International conference on advanced video and signal-based surveillance, AVSS, pp 441–447Google Scholar
- 9.Canseco-Rodriguez L, Lamel L, Gauvain J-L (2004) Speaker diarization from speech transcripts. In: the 5th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
- 10.Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419Google Scholar
- 11.Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop, pp 127–132Google Scholar
- 12.Estève Y, Meignier S, Deléglise P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: the 8th annual conference of the international speech communication association, INTERSPEECH, pp 2601–2604Google Scholar
- 13.Favre B, Damnati G, Béchet F, Bendris M, Charlet D, Auguste R, Ayache S, Bigot B, Delteil A, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2013) PERCOLI: a person identification system for the 2013 REPERE challenge. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
- 14.Gay P, Dupuy G, Lailler C, Odobez J-M, Meignier S, Deléglise P (2014) Comparison of two methods for unsupervised person identification in TV shows. In: 12th international workshop on content-based multimedia indexing (CBMI)Google Scholar
- 15.Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: the 8th international conference on language resources and evaluation, LRECGoogle Scholar
- 16.Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. In: the IEEE 12th international conference on computer vision, pp 498–505Google Scholar
- 17.Houghton R (1999) Named faces: putting names to faces. IEEE Intell Syst 14:45–50CrossRefGoogle Scholar
- 18.Jousse V, Petit-Renaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systemsGoogle Scholar
- 19.Kahn J, Galibert O, Quintard L, Carré M, Giraudel A, Joly P (2012) A presentation of the REPERE challenge. In: the 10th international workshop on content-based multimedia indexing (CBMI), pp 1–6Google Scholar
- 20.Khoury E, Snac C, Joly P (2012) Audiovisual diarization of people in video content. In: Multimedia tools and applicationsGoogle Scholar
- 21.Le VB, Barras C, Ferràs M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Odyssey - the speaker and language recognition workshop, pp 146–150Google Scholar
- 22.Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking?. In: IEEE Odyssey 2006 - the speaker and language recognition workshopGoogle Scholar
- 23.Petit-Renaud S, Jousse V, Meignier S, Estève Y (2010) Identification of speakers by name using belief functions. In: the 13th international conference on information processing and management of uncertainty in knowledge-based systems, theory and methods, IPMU, pp 179–188Google Scholar
- 24.Pham PT, Moens M-F, Tuytelaars T (2010) Naming persons in news video with label propagation. In: IEEE international conference on Multimedia and Expo, ICME, p 15281533Google Scholar
- 25.Pham PT, Tuytelaars T, Moens M-F (2011) Naming people in news videos with label propagation. IEEE MultiMedia 18(3):4455CrossRefGoogle Scholar
- 26.Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: IEEE international conference on multimedia and expo, ICME, pp 854–859Google Scholar
- 27.Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: the 13rd annual conference of the international speech communication association, INTERSPEECH, pp 2650–2653Google Scholar
- 28.Poignant J, Besacier L, Quénot G (2013) Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité. In: la 10ème cOnférence en recherche d’Information et applications, CORIAGoogle Scholar
- 29.Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?. In: the 14th annual conference of the international speech communication association, INTERSPEECHGoogle Scholar
- 30.Poignant J, Bredin H, Besacier L, Quénot G, Barras C (2013) Towards a better integration of written names for unsupervised speakers identification in videos. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAMGoogle Scholar
- 31.Poignant J, Besacier L, Quénot G (2014) Nommage non-supervisé des personnes dans les émissions de télévision: utilisation des noms écrits, des noms prononcés ou des deux?. In: Documents numriques, pp 37–60Google Scholar
- 32.Rouvier M, Meignier S (2012) A Global Optimization Framework For Speaker Diarization. In: Odyssey - the speaker and language recognition workshopGoogle Scholar
- 33.Rouvier M, Favre B, Bendris M, Charlet D, Damnati G (2014) Scene understanding for identifying persons in TV shows: beyond face authentication. In: 12th international workshop on content-based multimedia indexing (CBMI)Google Scholar
- 34.Sato T, Kanade T, Hughes TK, Smith MA, Satoh S (1999) Video OCR: Indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia SystemsGoogle Scholar
- 35.Satoh S, Nakamura Y, Kanade T (1999) Name-It: naming and detecting faces in news videos. IEEE Multimedia 6:22–35CrossRefGoogle Scholar
- 36.Tranter SE (2006) Who really spoke when? finding speaker turns and identities in broadcast news audio. In: the 31st IEEE international conference on acoustics, speech and signal processing, ICASSP, pp 1013–1016Google Scholar
- 37.Uřičář M, Franc V, Hlaváč V (2012) Detector of Facial Landmarks Learned by the Structured Output SVM. In: the 7th international conference on computer vision theory and applications, pp 547–556Google Scholar
- 38.Yang J., Hauptmann A G (2004) Naming every individual in news video monologuesGoogle Scholar
- 39.Yang J, Yan R, Hauptmann A G (2005) Multiple instance learning for labeling faces in broadcasting news video. In: the 13th ACM international conference on multimedia, ACMMM, pp 31–40Google Scholar