Advertisement

Multimedia Tools and Applications

, Volume 68, Issue 3, pp 747–775 | Cite as

Audiovisual diarization of people in video content

  • Elie El KhouryEmail author
  • Christine Sénac
  • Philippe Joly
Article

Abstract

Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.

Keywords

People diarization Segmentation Unsupervised clustering Audiovisual fusion Video indexing 

Notes

Acknowledgements

This work was supported by a 3-year individual fellowship from the French Ministry of High Education and Research, and by the SODA project funded by the National French Research Agency (ANR).

References

  1. 1.
    Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06 evaluation system. In: International conference on spoken language processingGoogle Scholar
  2. 2.
    Andriluka M, Roth S, Schiele B (2008) People-tracking-by-detection and people-detection-by-tracking. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  3. 3.
    Arandjelovic O, Zisserman A (2005) Automatic face recognition for film character retrieval in feature-length films. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  4. 4.
    Azarbayejani A, Starner T, Horowitz B, Pentland A (1993) Visually controlled graphics. IEEE Trans Pattern Anal Mach Intell 15:602–605CrossRefGoogle Scholar
  5. 5.
    Bicego M, Lagorio A, Grosso E, Tistarelli M (2006) On the use of sift features for face authentication. In: Computer vision and pattern recognition workshopGoogle Scholar
  6. 6.
    Bigot B, Ferrané I, Pinquier J (2010) Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. In: International workshop on content-based multimedia indexingGoogle Scholar
  7. 7.
    Bozonnet S, Evans N, Fredouille C (2010) The LIA-EURECOM RT09 Speaker diarization system: anhancements in speaker modelling and cluster purification. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  8. 8.
    Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the bic. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  9. 9.
    Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky E (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: TREC video retrieval workshop, NISTGoogle Scholar
  10. 10.
    Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Audio-visual speaker recognition using time-varying stream. In: IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  11. 11.
    Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia and expoGoogle Scholar
  12. 12.
    Chen SS, Gopalakrishnan PS (1998) Clustering via the bayesian information criterion with applications in speech recognition. In: IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  13. 13.
    Chu WT, Lee YL, Yu JY (2009) Visual language model for face clustering in consumer photos. In: ACM international conference on multimediaGoogle Scholar
  14. 14.
    Cinbis G, Verbeek J, Schmid C (2011) Unsupervised metric learning for face identification in TV video. In: IEEE international conference on computer visionGoogle Scholar
  15. 15.
    Czirjek C, Marlow S, Murphy N (2003) Face detection and clustering for video indexing applications. In: Advanced concepts for intelligent vision systemsGoogle Scholar
  16. 16.
    Dielmann A (2010) Unsupervised detection of multimodal clusters in edited recordings. In: IEEE international workshop on Multimedia Signal Processing (MMSP)Google Scholar
  17. 17.
    Doretto G, Sebastian T, Tu P, Rittscher J (2011) Appearance-based person re-identification in camera networks: Problem overview and current approaches. Journal of Ambient Intelligence and Humanized Computing 2(2):127–151CrossRefGoogle Scholar
  18. 18.
    Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... buffy—automatic naming of characters in TV video. In: British Machine Vision Conference, BMVC06Google Scholar
  19. 19.
    Everingham M, Sivic J, Zisserman A (2009) Taking the bite out of automated naming of characters in TV video. Image Vision Comput 27(5):545–559CrossRefGoogle Scholar
  20. 20.
    Fitzgibbon AW, Zisserman A (2002) On affine invariant clustering and automatic cast listing in movies. In: ECCV ’02: European Conference on Computer VisionGoogle Scholar
  21. 21.
    Fredouille C, Bozonnet S, Evans N (2009) The LIA-EURECOM RT09 speaker diarization system. In: NIST Rich transcription workshopGoogle Scholar
  22. 22.
    Friedland G, Hung H, Chuohao Yeo (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  23. 23.
    Friedland G, Yeo C, Hung H (2010) Dialocalisation: acoustic speaker diarization and visual localization as joint optimization problem. ACM Trans Multimedia Comput Commun Appl, TOMCCAP 6(4):27Google Scholar
  24. 24.
    Galliano S, Geofrois E, Mosterfa D, Bonastre JF, Gravier G (2005) The ESTER phase II evaluation campaign for the rich transcription of the French broadcast news. In: European conference on speech communication and technologyGoogle Scholar
  25. 25.
    Galliano S, Gravier G, Chaubard L (2009) The ester 2 evaluation campaign for the rich transcription of French radio broadcasts. INTERSPEECHGoogle Scholar
  26. 26.
    Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: International conference on acoustics, speech, and signal processingGoogle Scholar
  27. 27.
    Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. ICCVGoogle Scholar
  28. 28.
    Hilsmann A, Eisert P (2009) Tracking and retexturing cloth for real-time virtual clothing applications. In: International conference on computer vision/computer graphics collaboration techniquesGoogle Scholar
  29. 29.
    Hung H, Friedland G (2008) Towards audio-visual on-line diarization of participants In group meetings. In: Workshop on multi-camera and multi-modal sensor fusionGoogle Scholar
  30. 30.
    Ioffe S, Forsyth DA (2001) Human tracking with mixtures of trees. ICCV01Google Scholar
  31. 31.
    Jaffré G, Joly P (2004) Costume: a new feature for automatic video content indexing. RIAOGoogle Scholar
  32. 32.
    El Khoury E, Senac C, André-Obrecht R (2007) Speaker Diarization: Towards a more robust and portable system. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  33. 33.
    El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  34. 34.
    El Khoury E, Senac C, Joly P (2010) Unsupervised segmentation methods of TV contents. Int J Digital Multimedia Broadcast. doi: 10.1155/2010/539796 Google Scholar
  35. 35.
    El Khoury E, Senac C, Joly P (2010) Face-and-clothing based people clustering in video content. In: ACM International conference on multimedia information retrievalGoogle Scholar
  36. 36.
    Leeuwen DAV, Konecný M (2008) Progress in the AMIDA speaker diarization system for meeting data. In: Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007Google Scholar
  37. 37.
    Lerdsudwichai C, Abdel-MottalebM, Ansari AN (2005) Tracking multiple people with recovery from partial and total occlusion. Pattern Recogn 38(7):1059–1070CrossRefGoogle Scholar
  38. 38.
    Liu Z, Gibbon D, Zavesky E, Shahraray B, Haffner P (2007) A fast, comprehensive shot boundary determination system. In: IEEE international conference on multimedia and expoGoogle Scholar
  39. 39.
    Liu Z, Wang Y (2001) Major cast detection in video using both audio and visual information. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  40. 40.
    Liu Z, Wang Y (2007) Major cast detection in video using both speaker and face information. IEEE Transactions on Multimedia 9(1):89–101CrossRefGoogle Scholar
  41. 41.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110CrossRefGoogle Scholar
  42. 42.
    Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837–842CrossRefGoogle Scholar
  43. 43.
    Nguyen TH, Sun H, Zhao S, Khine SZ, Tran HD, Ma TL, Ma B, Chng ES, Li H (2009) The IIR-NTU speaker diarization systems for RT 2009. In: NIST rich transcription workshopGoogle Scholar
  44. 44.
    Nockc HJ, Iyengar G, Neti C (2003) Speaker localisation using audio-visual synchrony: an ampirical study. In: CIVR: ACM international conference on image and video retrievalGoogle Scholar
  45. 45.
    Peng J, Lin QX (2008) Automatic classification video for person indexing. In: Proceedings of the 2008 congress on image and signal processing, CISP ’08, vol 2. IEEE Computer Society, Washington, DC, USA, pp 475–479. ISBN 978-0-7695-3119-9CrossRefGoogle Scholar
  46. 46.
    Philippeau J, Pinquier J, Joly P (2006) Intervenant classification in an audiovisual document. In: International conference on signal processing and multimedia applicationsGoogle Scholar
  47. 47.
    Pinquier J, Rouas JL, André-Obrecht R (2003) A fusion study in speech/music classification. In: IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  48. 48.
    Plackett RL (1983) Karl Pearson and the chi-squared test. Int Stat Rev 51(1):59–72CrossRefzbMATHMathSciNetGoogle Scholar
  49. 49.
    Ramirez J, Girriz JM, Segura JC (2007) Voice activity detection. In: Grimm M, Kroschel K (eds) Fundamentals and speech recognition system robustness. Robust Speech Recognition and UnderstandingGoogle Scholar
  50. 50.
    Rosenhahn B, Kersting U, Powell K, Brox T, Seidel HP (2007) Tracking clothed people. In: Human motion—understanding, modeling, capture, and animation. SpringerGoogle Scholar
  51. 51.
    Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  52. 52.
    Schmalenstroeer J, Haeb-Umbach R (2010) Online Diarization of Streaming Audio-Visual Data for Smart Environments. J Sel Topics Signal Processing 4(5):845–856CrossRefGoogle Scholar
  53. 53.
    Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: DARPA Speech Recognition WorkshopGoogle Scholar
  54. 54.
    Sivakumaran P, Fortuna J, Ariyaeeinia AM (2001) On the use of the bayesian information criterion in multiple speaker detection. In: The 7th European conference on speech communication and technology (Eurospeech’01)Google Scholar
  55. 55.
    Smeaton AF, Over P, Doherty AR (2010) Video shot boundary detection: seven years of trecvid activity. Comput Vis Image Und 114(4):411–418CrossRefGoogle Scholar
  56. 56.
    Stiefelhagen R, Bowers R, Fiscus J (2008) Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007. ser. Lecture Notes in Computer Science. SpringerGoogle Scholar
  57. 57.
    Sung JW, Kanade T, Kim DJ (2008) Pose robust face tracking by combining active appearance models and cylinder head models. Int J Comput Vis 80(2):260–274CrossRefGoogle Scholar
  58. 58.
    Tamura S, Iwano K, Furui S (2004) Multi-modal speech recognition using optical-flow analysis for lip images. J VLSI Signal Process Syst 36(2/3):117–124Google Scholar
  59. 59.
    Terzopoulos D, Waters K (1993) Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans Pattern Anal Mach Intell 15:569–579CrossRefGoogle Scholar
  60. 60.
    Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: ACM international conference on MultimediaGoogle Scholar
  61. 61.
    Tsai WH, Cheng SS, Chao YH, Wang HM (2005) Clustering speech utterances by speaker using eigenvoice-motivated vector space model. In: IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  62. 62.
    Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: ICPR’06: international conference on pattern recognitionGoogle Scholar
  63. 63.
    Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proc. GraphiconGoogle Scholar
  64. 64.
    Viola P, Jones MJ, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: ICCV ’03: IEEE international conference on computer visionGoogle Scholar
  65. 65.
    Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154CrossRefGoogle Scholar
  66. 66.
    Yang MH (2009) Face detection. In: Encyclopedia of biometrics. SpringerGoogle Scholar
  67. 67.
    Zhou B, Hansen JHL (2005) Efficient audio stream segmentation via the combined T2 statistic and the bayesian information criterion. IEEE Trans Speech Audio Processing 13(4):467–474CrossRefGoogle Scholar
  68. 68.
    Zhu X, Barras C, Lamel L, Gauvain JL (2008) Multi-stage speaker diarization for conference and lecture meetings. In: Multimodal technologies for perception of humans. SpringerGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Elie El Khoury
    • 1
    • 2
    Email author
  • Christine Sénac
    • 3
  • Philippe Joly
    • 3
  1. 1.Idiap Research InstituteMartignySwitzerland
  2. 2.Laboratoire d’Informatique de l’Université du MaineLe MansFrance
  3. 3.Institut de Recherche en Informatique de ToulouseToulouseFrance

Personalised recommendations