Multimedia Tools and Applications

, Volume 60, Issue 2, pp 347–369 | Cite as

Detecting individual role using features extracted from speaker diarization results

  • Benjamin BigotEmail author
  • Isabelle Ferrané
  • Julien Pinquier
  • Régine André-Obrecht


In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled.


Speaker segmentation Speaker role detection Temporal and prosodic features Dimensionality reduction  Classification methods 



This work is conducted within the EPAC Project—ANR-06-CIS6-MDCA-006.


  1. 1.
    Banerjee S, Rudnicky AI (2006) You are what you say: using meeting participants speech to detect their roles and expertise. In: NAACL-HLT workshop on analyzing conversations in text and speech. New York, USAGoogle Scholar
  2. 2.
    Barzilay R, Collins M, Hirschberg J, Whittaker S (2000) The rules behind roles: identifying speaker role in radio broadcasts. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. AAAI Press/The MIT Press, pp 679–684Google Scholar
  3. 3.
    Béchet F, Gorin AL, Wright JH, Hakkani-Tur D (2004) Detecting and extracting named entities from spontaneous speech in a mixed initiative spoken dialogue context: how may I help you? Speech Commun 42(2):207–225CrossRefGoogle Scholar
  4. 4.
    Bigot B, Ferrané I (2008) From audio content analysis to conversational speech detection and characterization. In: ACM SIGIR workshop: searching spontaneous conversational speech (SSCS), Singapore, pp 62–65Google Scholar
  5. 5.
    Bigot B, Ferrané I, Al Abidin Ibrahim Z (2008) Towards the detection and the characterization of conversational speech zones in audiovisual documents. In: International workshop on content-based multimedia indexing (CBMI). IEEE, pp 162–169Google Scholar
  6. 6.
    Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: MULTIMEDIA ’05: proceedings of the 13th annual ACM international conference on multimedia, pp 628–637Google Scholar
  7. 7.
    Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419, 27–27Google Scholar
  8. 8.
    Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines.
  9. 9.
    de Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930CrossRefGoogle Scholar
  10. 10.
    Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-InterscienceGoogle Scholar
  11. 11.
    El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech and signal processing, pp 4097–4100Google Scholar
  12. 12.
    Estève Y, Bazillon T, Antoine J-Y, Béchet F, Farinas J (2010) The EPAC corpus: manual and automatic annotations of conversational speech in french broadcast news. In: Proceedings of the seventh language evaluation and resources conference. ELRA, Valletta, MaltaGoogle Scholar
  13. 13.
    Favre S, Vinciarelli A, Dielmann A (2009) Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In: ACM international conference on multimedia. BeijingGoogle Scholar
  14. 14.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals Eugen 7:179–188CrossRefGoogle Scholar
  15. 15.
    Fürnkranz J (2001) Round robin rule learning. In: ICML 01: proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 146–153Google Scholar
  16. 16.
    Galliano S, Geoffrois E, Gravier G, Bonastre J-F, Mostefa D, Choukri K (2006) Corpus description of the ESTER evaluation campaign for the rich transcription of french broadcast news. In: Proceedings of the language evaluation and resources conferenceGoogle Scholar
  17. 17.
    Hsueh P-Y, Moore JD (2007) Combining multiple knowledge sources for dialogue segmentation in multimedia archives. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, pp 1016–1023Google Scholar
  18. 18.
    Lamel L, Gauvain J-L (2005) Alternate phone models for conversational speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 1005–1008Google Scholar
  19. 19.
    Li B, Errico JH, Pan H, Sezan I (2004) Bridging the semantic gap in sports video retrieval and summarization. J Vis Commun Image Represent 15(3):393–424zbMATHGoogle Scholar
  20. 20.
    Liu Y (2006) Initial study on automatic identification of speaker role in broadcast news speech. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, New York, pp 81–84CrossRefGoogle Scholar
  21. 21.
    Luz S (2009) Locating case discussion segments in recorded medical team meetings. In: SSCS ’09: proceedings of the third workshop on searching spontaneous conversational speech. ACM, New York, pp 21–30CrossRefGoogle Scholar
  22. 22.
    Mccowan I, Lathoud G, Lincoln M, Lisowska A, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus. In: Noldus LPJJ, Grieco F, Loijens LWS, Zimmerman PH (eds) Proceedings measuring behavior 2005, 5th international conference on methods and techniques in behavioral research. Noldus Information Technology, WageningenGoogle Scholar
  23. 23.
    Popescu A-M, Etzioni O (2005) Extracting product features and opinions from reviews. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing, pp 339–346Google Scholar
  24. 24.
    Rouas J-L, Farinas J, Pellegrino F, André-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47(4):436–456CrossRefGoogle Scholar
  25. 25.
    Stolcke A, Shriberg E, Hakkani-Tür D, Tür G, Rivlin Z, Sönmez K (1999) Combining words and speech prosody for automatic topic segmentation. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 61–64Google Scholar
  26. 26.
    Vinciarelli A (2007) Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. IEEE Trans Multimedia 9(6):1215–1226CrossRefGoogle Scholar
  27. 27.
    Zhao R, Grosky W (2002) Narrowing the semantic gap—improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2):189–200CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Benjamin Bigot
    • 1
    Email author
  • Isabelle Ferrané
    • 1
  • Julien Pinquier
    • 1
  • Régine André-Obrecht
    • 1
  1. 1.IRIT—Université de ToulouseCedex 09France

Personalised recommendations