Abstract
Speech recognition offers an intuitive and convenient interface to control technical devices. Improvements achieved through ongoing research activities enable the user to handle increasingly complex tasks via speech. For special applications, e.g. dictation, highly sophisticated techniques have been developed to yield high recognition accuracy. Many use cases, however, are characterized by changing conditions such as different speakers or time-variant environments. A manifold of approaches has been published to handle the problem of changes in the acoustic environment or speaker specific voice characteristics by adapting the statistical models of a speech recognizer and speaker tracking. Combining speaker adaptation and speaker tracking may be advantageous, because it allows a system to adapt to more than one user at the same time. The performance of speech controlled systems may be continuously improved over time. In this article we review some techniques and systems for unsupervised speaker tracking which may be combined with speech recognition. We discuss a unified view on speaker identification and speech recognition embedded in a self-learning system. The latter adapts individually to its main users without requiring additional interventions of the user such as an enrollment. Robustness is continuously improved by progressive speaker adaptation. We analyze our evaluation results for a realistic in-car application to validate the evolution of the system in terms of speech recognition accuracy and identification rate.
Similar content being viewed by others
Notes
The smallest linguistic units of a language are given by phonemes. Many languages comprise 20–40 phonemes (O’Shaughnessy 2000).
The USKCP database has been internally collected by TEMIC Speech Dialog Systems, Ulm, Germany. The speech database contains command and control utterances recorded for realistic in-car applications. The language is US-English.
References
Ajmera J, McCowan I, Bourlard H (2004) Robust speaker change detection. IEEE Signal Process Lett 11(8):649–651
Angkititrakul P, Hansen JHL (2007) Discriminative in-set/out-of-set speaker recognition. IEEE Trans Audio Speech Lang Process 15(2):498–508
Botterweck H (2001) Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 353–356
Campbell JP (1997) Speaker recognition—a tutorial. Proc IEEE 85(9):1437–1462
Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 127–132
Cheng S-S, Wang H-M (2003) A sequential metric-based audio segmentation method via the Bayesian information criterion. In: EUROSPEECH-2003, pp 945–948
Class F, Kaltenmeier A, Regel-Brietzmann P (1993) Optimization of an HMM-based continuous speech recognizer. In: EUROSPEECH-1993, pp 803–806
Class F, Haiber U, Kaltenmeier A (2003) Automatic detection of change in speaker in speaker adaptive speech recognition systems. US patent application 2003/0187645 A1
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Dobler S, Rühl H-W (1995) Speaker adaptation for telephone based speech dialogue systems. In: EUROSPEECH-1995, pp 1139–1143
Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2nd edn. Wiley-Interscience, New York
Eatock JP, Mason JS (1994) A quantitative assessment of the relative speaker discriminating properties of phonemes. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-1994, vol 1, pp 133–136
Espi M, Miyabe S, Nishimoto T, Ono N, Sagayama S (2010) Analysis on speech characteristics for robust voice activity detection. In: IEEE workshop on spoken language technology, SLT-2010, pp 139–144
Fink GA (2003) Mustererkennung mit Markov-Modellen: Theorie-Praxis-Anwendungsgebiete. Leitfäden der Informatik. B. G. Teubner, Stuttgart (in German)
Fortuna J, Sivakumaran P, Ariyaeeinia A, Malegaonkar A (2005) Open-set speaker identification using adapted Gaussian mixture models. In: INTERSPEECH-2005, pp 1997–2000
Furui S (2009) Selected topics from 40 years of research in speech and speaker recognition. In: INTERSPEECH-2009, pp 1–8
Gauvain J-L, Lee C-H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298
Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. In: INTERSPEECH-2010, pp 2330–2333
Gutman D, Bistritz Y (2002) Speaker verification using phoneme-adapted Gaussian mixture models. In: The XI European signal processing conference, EUSIPCO-2002, vol 3, pp 85–88
Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ (1998) Segment generation and clustering in the HTK broadcast news transcription system. In: Proceedings of the broadcast news transcription and understanding workshop, pp 133–137
Harrag A, Mohamadi T, Serignat JF (2005) LDA combination of pitch and MFCC features in speaker recognition. In: IEEE Indicon conference, pp 237–240
Herbig T, Gerl F, Minker W (2010a) Detection of unknown speakers in an unsupervised speech controlled system. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 25–35
Herbig T, Gerl F, Minker W (2010b) Evaluation of two approaches for speaker specific speech recognition. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 36–47
Herbig T, Gerl F, Minker W (2010c) Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th international conference on intelligent environments, IE-2010, pp 100–105
Herbig T, Gerl F, Minker W (2010d) Simultaneous speech recognition and speaker identification. In: IEEE workshop on spoken language technology, SLT-2010, pp 206–210
Herbig T, Gerl F, Minker W (2010e) Speaker tracking in an unsupervised speech controlled system. In: INTERSPEECH-2010, pp 2666–2669
Herbig T, Gerl F, Minker W (2011a) Evolution of an adaptive unsupervised speech controlled system. In: IEEE workshop on evolving and adaptive intelligent systems, EAIS-2011, pp 163–169
Herbig T, Gerl F, Minker W (2011) Self-learning speaker identification: a system for enhanced speech recognition. Signals and communication technology. Springer, Heidelberg
Iskra D, Grosskopf B, Marasek K, van den Heuvel H, Diehl F, Kiessling A (2002) SPEECON-speech databases for consumer devices: database specification and validation. In: Proceedings of the third international conference on language resources and evaluation, LREC-2002, pp 329–333
Johnson SE (1999) Who spoke when? Automatic segmentation and clustering for determining speaker turns. In: EUROSPEECH-1999, vol 5, pp 2211–2214
Junqua J-C (2000) Robust speech recognition in embedded systems and PC applications. Kluwer, Dordrecht
Kuhn R, Junqua J-C, Nguyen P, Niedzielski N (2000) Rapid speaker adaptation in eigenvoice space. IEEE Trans Speech Audio Process 8(6):695–707
Kwon S, Narayanan SS (2002) Speaker change detection using a new weighted distance measure. In: International conference on spoken language processing, ICSLP-2002, pp 2537–2540
Kwon S, Narayanan S (2005) Unsupervised speaker indexing using generic models. IEEE Trans Speech Audio Process 13(5):1004–1013
Lu L, Zhang HJ (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proceedings of the tenth ACM international conference on multimedia, MULTIMEDIA-2002, pp 602–610
Meinedo H, Neto J (2003) Audio segmentation, classification and clustering in a broadcast news task. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 5–8
Mori K, Nakagawa S (2001) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 413–416
Nakagawa S, Zhang W, Takahashi M (2006) Text-independent/text-prompted speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. IEICE Trans Inf Syst E89-D(3):1058–1065
Nishida M, Kawahara T (2004) Speaker indexing and adaptation using speaker clustering based on statistical model selection. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2004, vol 1, pp 353–356
Nishida M, Kawahara T (2005) Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing. IEEE Trans Speech Audio Process 13(4):583–592
O’Shaughnessy D (2000) Speech communications: human and machine, 2nd edn. IEEE Press, New York
Park A, Hazen TJ (2002) ASR dependent techniques for speaker identification. In: International conference on spoken language processing, ICSLP-2002, pp 1337–1340
Pelecanos J, Slomka S, Sridharan S (1999) Enhancing automatic speaker identification using phoneme clustering and frame based parameter and frame size selection. In: Proceedings of the fifth international symposium on signal processing and its applications, ISSPA-1999, vol 2, pp 633–636
Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs
Ramírez J, Górriz JM, Segura JC (2007) Voice activity detection. Fundamentals and speech recognition system robustness. In: Grimm M, Kroschel K (eds) Robust speech recognition and understanding. I-Tech Education and Publishing, Vienna, pp 1–22
Reynolds DA (1995) Large population speaker identification using clean and telephone speech. IEEE Signal Process Lett 2(3):46–48
Reynolds DA, Carlson BA (1995) Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers. In: EUROSPEECH-1995, pp 647–650
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10(1–3):19–41
Rodríguez-Liñares L, García-Mateo C (1997) On the use of acoustic segmentation in speaker identification. In: EUROSPEECH-1997, pp 2315–2318
Schmalenstroeer J, Haeb-Umbach R (2007) Joint speaker segmentation, localization and identification for streaming audio. In: INTERSPEECH-2007, pp 570–573
Schukat-Talamazzini EG (1995) Automatische Spracherkennung. Vieweg, Braunschweig (in German)
Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of the DARPA speech recognition workshop, pp 97–99
Thompson J, Mason JS (1993) Cepstral statistics within phonetic subgroups. In: International conference on signal processing, ICSP-1993, pp 737–740
Tritschler A, Gopinath R (1999) Improved speaker segmentation and segments clustering using the Bayesian information criterion. In: EUROSPEECH-1999, vol 2, pp 679–682
Wilcox L, Kimber D, Chen F (1994) Audio indexing using speaker identification. In: Proceedings of the SPIE conference on automatic systems for the inspection and identification of humans, vol 2277, pp 149–157
Wu T, Lu L, Chen K, Zhang H-J (2003) UBM-based real-time speaker segmentation for broad-casting news. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 193–196
Yella SH, Varma V, Prahallad K (2010) Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news. In: IEEE workshop on spoken language technology, SLT-2010, pp 13–18
Yin S-C, Rose R, Kenny P (2008) Adaptive score normalization for progressive model adaptation in text independent speaker verification. In: IEEE international conference on acoustics, speech and signal processing, ICASSP-2008, pp 4857–4860
Zhang Z-P, Furui S, Ohtsuki K (2000) On-line incremental speaker adaptation with automatic speaker change detection. In: IEEE international conference of acoustics, speech, and signal processing, ICASSP-2000, vol 2, pp 961–964
Zhou B, Hansen JHL (2000) Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. In: International conference on spoken language processing, ICSLP-2000, vol 3, pp 714–717
Zhu X, Barras C, Meignier S, Gauvain J-L (2005) Combining speaker identification and BIC for speaker diarization. In: INTERSPEECH-2005, pp 2441–2444
Zochová P, Radová V (2005) Modified DISTBIC algorithm for speaker change detection. In: INTERSPEECH-2005, pp 3073–3076
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was conducted at Harman-Becker. T. Herbig is now with Nuance Communications. F. Gerl is now with SVOX.
Rights and permissions
About this article
Cite this article
Herbig, T., Gerl, F., Minker, W. et al. Adaptive systems for unsupervised speaker tracking and speech recognition. Evolving Systems 2, 199–214 (2011). https://doi.org/10.1007/s12530-011-9034-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-011-9034-1