Advertisement

A Free Synthetic Corpus for Speaker Diarization Research

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)

Abstract

A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 h of training data, and over 9 h each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.

Keywords

Speaker diarization Speech activity detection Open-source corpora 

References

  1. 1.
    Anguera Miró, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Politècnica de Catalunya (2006)Google Scholar
  2. 2.
    Anguera Miró, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRefGoogle Scholar
  3. 3.
    Anguera Miró, X., Hernando Pericás, F.: Evolutive speaker segmentation using a repository system. In: Proceedings of ICSLP, pp. 605–608. ISCA (2004)Google Scholar
  4. 4.
    Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006).  https://doi.org/10.1007/11677482_34CrossRefGoogle Scholar
  5. 5.
    Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012)Google Scholar
  6. 6.
    Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of ICSLP, pp. 301–304. ISCA (2002)Google Scholar
  7. 7.
    Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010)Google Scholar
  8. 8.
    Delacourt, P., Kryze, D., Wellekens, C.: Speaker-based segmentation for audio data indexing. In: Proceedings of ESCA Tutorial and Research Workshop, pp. 78–83. ISCA (1999)Google Scholar
  9. 9.
    Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018)Google Scholar
  10. 10.
    Gangadharaiah, R., Narayanaswamy, B.: A novel method for two-speaker segmentation. In: Proceedings of ICSLP, pp. 2337–2340. ISCA (2004)Google Scholar
  11. 11.
    Garofolo, J., Laprun, C., Michel, M., Stanford, V., Tabassi, E.: The NIST meeting room pilot corpus. In: Proceedings of LREC, p. 4. ELRA (2004)Google Scholar
  12. 12.
    Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997)Google Scholar
  13. 13.
    Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)Google Scholar
  14. 14.
    Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: Proceedings of ICASSP, vol. 1, pp. 517–520. IEEE (1992)Google Scholar
  15. 15.
    Hain, T., et al.: The development of the AMI system for the transcription of speech in meetings. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 344–356. Springer, Heidelberg (2006).  https://doi.org/10.1007/11677482_30CrossRefGoogle Scholar
  16. 16.
    Heldner, M., Edlund, J.: Pauses, gaps and overlaps in conversations. J. Phon. 38(4), 555–568 (2010)CrossRefGoogle Scholar
  17. 17.
    Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008)Google Scholar
  18. 18.
    Ikbal, S., Visweswariah, K.: Learning essential speaker sub-space using hetero-associative neural networks for speaker clustering. In: Proceedings of INTERSPEECH, pp. 28–31. ISCA (2008)Google Scholar
  19. 19.
    Janin, A., et al.: The ICSI meeting corpus. In: Proceedings of ICASSP, vol. 1, pp. 364–367. IEEE (2003)Google Scholar
  20. 20.
    Jothilakshmi, S., Ramalingam, V., Palanivel, S.: Speaker diarization using autoassociative neural networks. Eng. Appl. Artif. Intell. 22(4–5), 667–675 (2009)CrossRefGoogle Scholar
  21. 21.
    Kim, K., Kim, M.: Robust speaker recognition against background noise in an enhanced multi-condition domain. IEEE Trans. Consum. Electron. 56(3), 1684–1688 (2010)CrossRefGoogle Scholar
  22. 22.
    Liu, C., Yan, Y.: Speaker change detection using minimum message length criterion. In: Proceedings of ICSLP, pp. 514–517. ISCA (2000)Google Scholar
  23. 23.
    Meinedo, H., Neto, J.: A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In: Proceedings of INTERSPEECH, pp. 237–240. ISCA (2005)Google Scholar
  24. 24.
    Metzger, Y.: Blind segmentation of a multi-speaker conversation using two different sets of features. In: Proceedings of Odyssey Workshop, pp. 157–162. ISCA (2001)Google Scholar
  25. 25.
    Moattar, M., Homayounpour, M.: A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)CrossRefGoogle Scholar
  26. 26.
    Mohammadi, S., Sameti, H., Langarani, M., Tavanaei, A.: KNNDIST: a non-parametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH, pp. 2282–2285. ISCA (2012)Google Scholar
  27. 27.
    NIST: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation plan. Report RT-06S, National Institute of Standards and Technology, Spring 2006Google Scholar
  28. 28.
    Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015)Google Scholar
  29. 29.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of Workshop ASRU, Waikoloa Village, HI, p. 4. IEEE (2011)Google Scholar
  30. 30.
    Rohlicek, J., et al.: Gisting conversational speech. In: Proceedings of ICASSP, vol. 2, pp. 113–116. IEEE (1992)Google Scholar
  31. 31.
    Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013)Google Scholar
  32. 32.
    Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980)Google Scholar
  33. 33.
    Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, pp. 97–99. DARPA (1997)Google Scholar
  34. 34.
    Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992)Google Scholar
  35. 35.
    Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)Google Scholar
  36. 36.
    Sönmez, M., Heck, L., Weintraub, M.: Speaker tracking and detection with multiple speakers. In: Proceedings of EUROSPEECH, pp. 2219–2222. ISCA (1999)Google Scholar
  37. 37.
    Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Natl. Acad. Sci U.S.A. 106(26), 10587–10592 (2009)CrossRefGoogle Scholar
  38. 38.
    Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993)Google Scholar
  39. 39.
    Takagi, K., Itahashi, S.: Segmentation of spoken dialogue by interjections, disfluent utterances and pauses. In: Proceedings of ICSLP, pp. 697–700. ISCA (1996)Google Scholar
  40. 40.
    Valente, F., Wellekens, C.: Scoring unknown speaker clustering: VB vs. BIC. In: Proceedings of ICSLP, pp. 593–596. ISCA (2004)Google Scholar
  41. 41.
    Viñals, I., Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Bottleneck based front-end for diarization systems. In: Abad, A., et al. (eds.) IberSPEECH 2016. LNCS (LNAI), vol. 10077, pp. 276–286. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49169-1_27CrossRefGoogle Scholar
  42. 42.
    Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010)Google Scholar
  43. 43.
    Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994)Google Scholar
  44. 44.
    Yella, S., Motlícek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597–601. ISCA (2014)Google Scholar
  45. 45.
    Yella, S., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of SLT Workshop, pp. 402–406. IEEE (2014)Google Scholar
  46. 46.
    Zâo, L., Coelho, R.: Colored noise based multicondition training technique for robust speaker identification. IEEE Signal Process. Lett. 18(11), 675–678 (2011)CrossRefGoogle Scholar
  47. 47.
    Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.EMR.AI Inc.San FranciscoUSA
  2. 2.University of California BerkeleyBerkeleyUSA
  3. 3.DHBWKarlsruheGermany

Personalised recommendations