Evolving Systems

, Volume 2, Issue 3, pp 199–214 | Cite as

Adaptive systems for unsupervised speaker tracking and speech recognition

  • Tobias Herbig
  • Franz Gerl
  • Wolfgang Minker
  • Reinhold Haeb-Umbach
Original Paper

Abstract

Speech recognition offers an intuitive and convenient interface to control technical devices. Improvements achieved through ongoing research activities enable the user to handle increasingly complex tasks via speech. For special applications, e.g. dictation, highly sophisticated techniques have been developed to yield high recognition accuracy. Many use cases, however, are characterized by changing conditions such as different speakers or time-variant environments. A manifold of approaches has been published to handle the problem of changes in the acoustic environment or speaker specific voice characteristics by adapting the statistical models of a speech recognizer and speaker tracking. Combining speaker adaptation and speaker tracking may be advantageous, because it allows a system to adapt to more than one user at the same time. The performance of speech controlled systems may be continuously improved over time. In this article we review some techniques and systems for unsupervised speaker tracking which may be combined with speech recognition. We discuss a unified view on speaker identification and speech recognition embedded in a self-learning system. The latter adapts individually to its main users without requiring additional interventions of the user such as an enrollment. Robustness is continuously improved by progressive speaker adaptation. We analyze our evaluation results for a realistic in-car application to validate the evolution of the system in terms of speech recognition accuracy and identification rate.

Keywords

Speaker change detection Speaker identification Speaker adaptation 

References

  1. Ajmera J, McCowan I, Bourlard H (2004) Robust speaker change detection. IEEE Signal Process Lett 11(8):649–651CrossRefGoogle Scholar
  2. Angkititrakul P, Hansen JHL (2007) Discriminative in-set/out-of-set speaker recognition. IEEE Trans Audio Speech Lang Process 15(2):498–508CrossRefGoogle Scholar
  3. Botterweck H (2001) Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 353–356Google Scholar
  4. Campbell JP (1997) Speaker recognition—a tutorial. Proc IEEE 85(9):1437–1462CrossRefGoogle Scholar
  5. Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 127–132Google Scholar
  6. Cheng S-S, Wang H-M (2003) A sequential metric-based audio segmentation method via the Bayesian information criterion. In: EUROSPEECH-2003, pp 945–948Google Scholar
  7. Class F, Kaltenmeier A, Regel-Brietzmann P (1993) Optimization of an HMM-based continuous speech recognizer. In: EUROSPEECH-1993, pp 803–806Google Scholar
  8. Class F, Haiber U, Kaltenmeier A (2003) Automatic detection of change in speaker in speaker adaptive speech recognition systems. US patent application 2003/0187645 A1Google Scholar
  9. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATHGoogle Scholar
  10. Dobler S, Rühl H-W (1995) Speaker adaptation for telephone based speech dialogue systems. In: EUROSPEECH-1995, pp 1139–1143Google Scholar
  11. Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2nd edn. Wiley-Interscience, New YorkMATHGoogle Scholar
  12. Eatock JP, Mason JS (1994) A quantitative assessment of the relative speaker discriminating properties of phonemes. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-1994, vol 1, pp 133–136Google Scholar
  13. Espi M, Miyabe S, Nishimoto T, Ono N, Sagayama S (2010) Analysis on speech characteristics for robust voice activity detection. In: IEEE workshop on spoken language technology, SLT-2010, pp 139–144Google Scholar
  14. Fink GA (2003) Mustererkennung mit Markov-Modellen: Theorie-Praxis-Anwendungsgebiete. Leitfäden der Informatik. B. G. Teubner, Stuttgart (in German)Google Scholar
  15. Fortuna J, Sivakumaran P, Ariyaeeinia A, Malegaonkar A (2005) Open-set speaker identification using adapted Gaussian mixture models. In: INTERSPEECH-2005, pp 1997–2000Google Scholar
  16. Furui S (2009) Selected topics from 40 years of research in speech and speaker recognition. In: INTERSPEECH-2009, pp 1–8Google Scholar
  17. Gauvain J-L, Lee C-H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298CrossRefGoogle Scholar
  18. Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. In: INTERSPEECH-2010, pp 2330–2333Google Scholar
  19. Gutman D, Bistritz Y (2002) Speaker verification using phoneme-adapted Gaussian mixture models. In: The XI European signal processing conference, EUSIPCO-2002, vol 3, pp 85–88Google Scholar
  20. Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ (1998) Segment generation and clustering in the HTK broadcast news transcription system. In: Proceedings of the broadcast news transcription and understanding workshop, pp 133–137Google Scholar
  21. Harrag A, Mohamadi T, Serignat JF (2005) LDA combination of pitch and MFCC features in speaker recognition. In: IEEE Indicon conference, pp 237–240Google Scholar
  22. Herbig T, Gerl F, Minker W (2010a) Detection of unknown speakers in an unsupervised speech controlled system. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 25–35Google Scholar
  23. Herbig T, Gerl F, Minker W (2010b) Evaluation of two approaches for speaker specific speech recognition. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 36–47Google Scholar
  24. Herbig T, Gerl F, Minker W (2010c) Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th international conference on intelligent environments, IE-2010, pp 100–105Google Scholar
  25. Herbig T, Gerl F, Minker W (2010d) Simultaneous speech recognition and speaker identification. In: IEEE workshop on spoken language technology, SLT-2010, pp 206–210Google Scholar
  26. Herbig T, Gerl F, Minker W (2010e) Speaker tracking in an unsupervised speech controlled system. In: INTERSPEECH-2010, pp 2666–2669Google Scholar
  27. Herbig T, Gerl F, Minker W (2011a) Evolution of an adaptive unsupervised speech controlled system. In: IEEE workshop on evolving and adaptive intelligent systems, EAIS-2011, pp 163–169Google Scholar
  28. Herbig T, Gerl F, Minker W (2011) Self-learning speaker identification: a system for enhanced speech recognition. Signals and communication technology. Springer, HeidelbergGoogle Scholar
  29. Iskra D, Grosskopf B, Marasek K, van den Heuvel H, Diehl F, Kiessling A (2002) SPEECON-speech databases for consumer devices: database specification and validation. In: Proceedings of the third international conference on language resources and evaluation, LREC-2002, pp 329–333Google Scholar
  30. Johnson SE (1999) Who spoke when? Automatic segmentation and clustering for determining speaker turns. In: EUROSPEECH-1999, vol 5, pp 2211–2214Google Scholar
  31. Junqua J-C (2000) Robust speech recognition in embedded systems and PC applications. Kluwer, DordrechtGoogle Scholar
  32. Kuhn R, Junqua J-C, Nguyen P, Niedzielski N (2000) Rapid speaker adaptation in eigenvoice space. IEEE Trans Speech Audio Process 8(6):695–707CrossRefGoogle Scholar
  33. Kwon S, Narayanan SS (2002) Speaker change detection using a new weighted distance measure. In: International conference on spoken language processing, ICSLP-2002, pp 2537–2540Google Scholar
  34. Kwon S, Narayanan S (2005) Unsupervised speaker indexing using generic models. IEEE Trans Speech Audio Process 13(5):1004–1013CrossRefGoogle Scholar
  35. Lu L, Zhang HJ (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proceedings of the tenth ACM international conference on multimedia, MULTIMEDIA-2002, pp 602–610Google Scholar
  36. Meinedo H, Neto J (2003) Audio segmentation, classification and clustering in a broadcast news task. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 5–8Google Scholar
  37. Mori K, Nakagawa S (2001) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 413–416Google Scholar
  38. Nakagawa S, Zhang W, Takahashi M (2006) Text-independent/text-prompted speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. IEICE Trans Inf Syst E89-D(3):1058–1065CrossRefGoogle Scholar
  39. Nishida M, Kawahara T (2004) Speaker indexing and adaptation using speaker clustering based on statistical model selection. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2004, vol 1, pp 353–356Google Scholar
  40. Nishida M, Kawahara T (2005) Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing. IEEE Trans Speech Audio Process 13(4):583–592CrossRefGoogle Scholar
  41. O’Shaughnessy D (2000) Speech communications: human and machine, 2nd edn. IEEE Press, New YorkGoogle Scholar
  42. Park A, Hazen TJ (2002) ASR dependent techniques for speaker identification. In: International conference on spoken language processing, ICSLP-2002, pp 1337–1340Google Scholar
  43. Pelecanos J, Slomka S, Sridharan S (1999) Enhancing automatic speaker identification using phoneme clustering and frame based parameter and frame size selection. In: Proceedings of the fifth international symposium on signal processing and its applications, ISSPA-1999, vol 2, pp 633–636Google Scholar
  44. Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall, Englewood CliffsGoogle Scholar
  45. Ramírez J, Górriz JM, Segura JC (2007) Voice activity detection. Fundamentals and speech recognition system robustness. In: Grimm M, Kroschel K (eds) Robust speech recognition and understanding. I-Tech Education and Publishing, Vienna, pp 1–22Google Scholar
  46. Reynolds DA (1995) Large population speaker identification using clean and telephone speech. IEEE Signal Process Lett 2(3):46–48CrossRefGoogle Scholar
  47. Reynolds DA, Carlson BA (1995) Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers. In: EUROSPEECH-1995, pp 647–650Google Scholar
  48. Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83CrossRefGoogle Scholar
  49. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10(1–3):19–41CrossRefGoogle Scholar
  50. Rodríguez-Liñares L, García-Mateo C (1997) On the use of acoustic segmentation in speaker identification. In: EUROSPEECH-1997, pp 2315–2318Google Scholar
  51. Schmalenstroeer J, Haeb-Umbach R (2007) Joint speaker segmentation, localization and identification for streaming audio. In: INTERSPEECH-2007, pp 570–573Google Scholar
  52. Schukat-Talamazzini EG (1995) Automatische Spracherkennung. Vieweg, Braunschweig (in German)Google Scholar
  53. Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of the DARPA speech recognition workshop, pp 97–99Google Scholar
  54. Thompson J, Mason JS (1993) Cepstral statistics within phonetic subgroups. In: International conference on signal processing, ICSP-1993, pp 737–740Google Scholar
  55. Tritschler A, Gopinath R (1999) Improved speaker segmentation and segments clustering using the Bayesian information criterion. In: EUROSPEECH-1999, vol 2, pp 679–682Google Scholar
  56. Wilcox L, Kimber D, Chen F (1994) Audio indexing using speaker identification. In: Proceedings of the SPIE conference on automatic systems for the inspection and identification of humans, vol 2277, pp 149–157Google Scholar
  57. Wu T, Lu L, Chen K, Zhang H-J (2003) UBM-based real-time speaker segmentation for broad-casting news. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 193–196Google Scholar
  58. Yella SH, Varma V, Prahallad K (2010) Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news. In: IEEE workshop on spoken language technology, SLT-2010, pp 13–18Google Scholar
  59. Yin S-C, Rose R, Kenny P (2008) Adaptive score normalization for progressive model adaptation in text independent speaker verification. In: IEEE international conference on acoustics, speech and signal processing, ICASSP-2008, pp 4857–4860Google Scholar
  60. Zhang Z-P, Furui S, Ohtsuki K (2000) On-line incremental speaker adaptation with automatic speaker change detection. In: IEEE international conference of acoustics, speech, and signal processing, ICASSP-2000, vol 2, pp 961–964Google Scholar
  61. Zhou B, Hansen JHL (2000) Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. In: International conference on spoken language processing, ICSLP-2000, vol 3, pp 714–717Google Scholar
  62. Zhu X, Barras C, Meignier S, Gauvain J-L (2005) Combining speaker identification and BIC for speaker diarization. In: INTERSPEECH-2005, pp 2441–2444Google Scholar
  63. Zochová P, Radová V (2005) Modified DISTBIC algorithm for speaker change detection. In: INTERSPEECH-2005, pp 3073–3076Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Tobias Herbig
    • 1
    • 3
  • Franz Gerl
    • 2
  • Wolfgang Minker
    • 3
  • Reinhold Haeb-Umbach
    • 4
  1. 1.Nuance Communications Aachen GmbHUlmGermany
  2. 2.SVOX Deutschland GmbHUlmGermany
  3. 3.Institute of Information TechnologyUniversity of UlmUlmGermany
  4. 4.Department of Communications EngineeringUniversity of PaderbornPaderbornGermany

Personalised recommendations