Skip to main content
Log in

Adaptive systems for unsupervised speaker tracking and speech recognition

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Speech recognition offers an intuitive and convenient interface to control technical devices. Improvements achieved through ongoing research activities enable the user to handle increasingly complex tasks via speech. For special applications, e.g. dictation, highly sophisticated techniques have been developed to yield high recognition accuracy. Many use cases, however, are characterized by changing conditions such as different speakers or time-variant environments. A manifold of approaches has been published to handle the problem of changes in the acoustic environment or speaker specific voice characteristics by adapting the statistical models of a speech recognizer and speaker tracking. Combining speaker adaptation and speaker tracking may be advantageous, because it allows a system to adapt to more than one user at the same time. The performance of speech controlled systems may be continuously improved over time. In this article we review some techniques and systems for unsupervised speaker tracking which may be combined with speech recognition. We discuss a unified view on speaker identification and speech recognition embedded in a self-learning system. The latter adapts individually to its main users without requiring additional interventions of the user such as an enrollment. Robustness is continuously improved by progressive speaker adaptation. We analyze our evaluation results for a realistic in-car application to validate the evolution of the system in terms of speech recognition accuracy and identification rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The smallest linguistic units of a language are given by phonemes. Many languages comprise 20–40 phonemes (O’Shaughnessy 2000).

  2. The USKCP database has been internally collected by TEMIC Speech Dialog Systems, Ulm, Germany. The speech database contains command and control utterances recorded for realistic in-car applications. The language is US-English.

References

  • Ajmera J, McCowan I, Bourlard H (2004) Robust speaker change detection. IEEE Signal Process Lett 11(8):649–651

    Article  Google Scholar 

  • Angkititrakul P, Hansen JHL (2007) Discriminative in-set/out-of-set speaker recognition. IEEE Trans Audio Speech Lang Process 15(2):498–508

    Article  Google Scholar 

  • Botterweck H (2001) Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 353–356

  • Campbell JP (1997) Speaker recognition—a tutorial. Proc IEEE 85(9):1437–1462

    Article  Google Scholar 

  • Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 127–132

  • Cheng S-S, Wang H-M (2003) A sequential metric-based audio segmentation method via the Bayesian information criterion. In: EUROSPEECH-2003, pp 945–948

  • Class F, Kaltenmeier A, Regel-Brietzmann P (1993) Optimization of an HMM-based continuous speech recognizer. In: EUROSPEECH-1993, pp 803–806

  • Class F, Haiber U, Kaltenmeier A (2003) Automatic detection of change in speaker in speaker adaptive speech recognition systems. US patent application 2003/0187645 A1

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Dobler S, Rühl H-W (1995) Speaker adaptation for telephone based speech dialogue systems. In: EUROSPEECH-1995, pp 1139–1143

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2nd edn. Wiley-Interscience, New York

    MATH  Google Scholar 

  • Eatock JP, Mason JS (1994) A quantitative assessment of the relative speaker discriminating properties of phonemes. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-1994, vol 1, pp 133–136

  • Espi M, Miyabe S, Nishimoto T, Ono N, Sagayama S (2010) Analysis on speech characteristics for robust voice activity detection. In: IEEE workshop on spoken language technology, SLT-2010, pp 139–144

  • Fink GA (2003) Mustererkennung mit Markov-Modellen: Theorie-Praxis-Anwendungsgebiete. Leitfäden der Informatik. B. G. Teubner, Stuttgart (in German)

  • Fortuna J, Sivakumaran P, Ariyaeeinia A, Malegaonkar A (2005) Open-set speaker identification using adapted Gaussian mixture models. In: INTERSPEECH-2005, pp 1997–2000

  • Furui S (2009) Selected topics from 40 years of research in speech and speaker recognition. In: INTERSPEECH-2009, pp 1–8

  • Gauvain J-L, Lee C-H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298

    Article  Google Scholar 

  • Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. In: INTERSPEECH-2010, pp 2330–2333

  • Gutman D, Bistritz Y (2002) Speaker verification using phoneme-adapted Gaussian mixture models. In: The XI European signal processing conference, EUSIPCO-2002, vol 3, pp 85–88

  • Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ (1998) Segment generation and clustering in the HTK broadcast news transcription system. In: Proceedings of the broadcast news transcription and understanding workshop, pp 133–137

  • Harrag A, Mohamadi T, Serignat JF (2005) LDA combination of pitch and MFCC features in speaker recognition. In: IEEE Indicon conference, pp 237–240

  • Herbig T, Gerl F, Minker W (2010a) Detection of unknown speakers in an unsupervised speech controlled system. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 25–35

  • Herbig T, Gerl F, Minker W (2010b) Evaluation of two approaches for speaker specific speech recognition. In: Lee GG, Mariani J, Minker W, Nakamura S (eds) Spoken dialogue systems for ambient environments: second international workshop on spoken dialogue systems technology, IWSDS-2010. Lecture notes in computer science, vol 6392. Springer, Heidelberg, pp 36–47

  • Herbig T, Gerl F, Minker W (2010c) Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th international conference on intelligent environments, IE-2010, pp 100–105

  • Herbig T, Gerl F, Minker W (2010d) Simultaneous speech recognition and speaker identification. In: IEEE workshop on spoken language technology, SLT-2010, pp 206–210

  • Herbig T, Gerl F, Minker W (2010e) Speaker tracking in an unsupervised speech controlled system. In: INTERSPEECH-2010, pp 2666–2669

  • Herbig T, Gerl F, Minker W (2011a) Evolution of an adaptive unsupervised speech controlled system. In: IEEE workshop on evolving and adaptive intelligent systems, EAIS-2011, pp 163–169

  • Herbig T, Gerl F, Minker W (2011) Self-learning speaker identification: a system for enhanced speech recognition. Signals and communication technology. Springer, Heidelberg

    Google Scholar 

  • Iskra D, Grosskopf B, Marasek K, van den Heuvel H, Diehl F, Kiessling A (2002) SPEECON-speech databases for consumer devices: database specification and validation. In: Proceedings of the third international conference on language resources and evaluation, LREC-2002, pp 329–333

  • Johnson SE (1999) Who spoke when? Automatic segmentation and clustering for determining speaker turns. In: EUROSPEECH-1999, vol 5, pp 2211–2214

  • Junqua J-C (2000) Robust speech recognition in embedded systems and PC applications. Kluwer, Dordrecht

    Google Scholar 

  • Kuhn R, Junqua J-C, Nguyen P, Niedzielski N (2000) Rapid speaker adaptation in eigenvoice space. IEEE Trans Speech Audio Process 8(6):695–707

    Article  Google Scholar 

  • Kwon S, Narayanan SS (2002) Speaker change detection using a new weighted distance measure. In: International conference on spoken language processing, ICSLP-2002, pp 2537–2540

  • Kwon S, Narayanan S (2005) Unsupervised speaker indexing using generic models. IEEE Trans Speech Audio Process 13(5):1004–1013

    Article  Google Scholar 

  • Lu L, Zhang HJ (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proceedings of the tenth ACM international conference on multimedia, MULTIMEDIA-2002, pp 602–610

  • Meinedo H, Neto J (2003) Audio segmentation, classification and clustering in a broadcast news task. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 5–8

  • Mori K, Nakagawa S (2001) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2001, vol 1, pp 413–416

  • Nakagawa S, Zhang W, Takahashi M (2006) Text-independent/text-prompted speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. IEICE Trans Inf Syst E89-D(3):1058–1065

    Article  Google Scholar 

  • Nishida M, Kawahara T (2004) Speaker indexing and adaptation using speaker clustering based on statistical model selection. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2004, vol 1, pp 353–356

  • Nishida M, Kawahara T (2005) Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing. IEEE Trans Speech Audio Process 13(4):583–592

    Article  Google Scholar 

  • O’Shaughnessy D (2000) Speech communications: human and machine, 2nd edn. IEEE Press, New York

  • Park A, Hazen TJ (2002) ASR dependent techniques for speaker identification. In: International conference on spoken language processing, ICSLP-2002, pp 1337–1340

  • Pelecanos J, Slomka S, Sridharan S (1999) Enhancing automatic speaker identification using phoneme clustering and frame based parameter and frame size selection. In: Proceedings of the fifth international symposium on signal processing and its applications, ISSPA-1999, vol 2, pp 633–636

  • Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs

  • Ramírez J, Górriz JM, Segura JC (2007) Voice activity detection. Fundamentals and speech recognition system robustness. In: Grimm M, Kroschel K (eds) Robust speech recognition and understanding. I-Tech Education and Publishing, Vienna, pp 1–22

  • Reynolds DA (1995) Large population speaker identification using clean and telephone speech. IEEE Signal Process Lett 2(3):46–48

    Article  Google Scholar 

  • Reynolds DA, Carlson BA (1995) Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers. In: EUROSPEECH-1995, pp 647–650

  • Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83

    Article  Google Scholar 

  • Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10(1–3):19–41

    Article  Google Scholar 

  • Rodríguez-Liñares L, García-Mateo C (1997) On the use of acoustic segmentation in speaker identification. In: EUROSPEECH-1997, pp 2315–2318

  • Schmalenstroeer J, Haeb-Umbach R (2007) Joint speaker segmentation, localization and identification for streaming audio. In: INTERSPEECH-2007, pp 570–573

  • Schukat-Talamazzini EG (1995) Automatische Spracherkennung. Vieweg, Braunschweig (in German)

  • Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of the DARPA speech recognition workshop, pp 97–99

  • Thompson J, Mason JS (1993) Cepstral statistics within phonetic subgroups. In: International conference on signal processing, ICSP-1993, pp 737–740

  • Tritschler A, Gopinath R (1999) Improved speaker segmentation and segments clustering using the Bayesian information criterion. In: EUROSPEECH-1999, vol 2, pp 679–682

  • Wilcox L, Kimber D, Chen F (1994) Audio indexing using speaker identification. In: Proceedings of the SPIE conference on automatic systems for the inspection and identification of humans, vol 2277, pp 149–157

  • Wu T, Lu L, Chen K, Zhang H-J (2003) UBM-based real-time speaker segmentation for broad-casting news. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP-2003, vol 2, pp 193–196

  • Yella SH, Varma V, Prahallad K (2010) Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news. In: IEEE workshop on spoken language technology, SLT-2010, pp 13–18

  • Yin S-C, Rose R, Kenny P (2008) Adaptive score normalization for progressive model adaptation in text independent speaker verification. In: IEEE international conference on acoustics, speech and signal processing, ICASSP-2008, pp 4857–4860

  • Zhang Z-P, Furui S, Ohtsuki K (2000) On-line incremental speaker adaptation with automatic speaker change detection. In: IEEE international conference of acoustics, speech, and signal processing, ICASSP-2000, vol 2, pp 961–964

  • Zhou B, Hansen JHL (2000) Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. In: International conference on spoken language processing, ICSLP-2000, vol 3, pp 714–717

  • Zhu X, Barras C, Meignier S, Gauvain J-L (2005) Combining speaker identification and BIC for speaker diarization. In: INTERSPEECH-2005, pp 2441–2444

  • Zochová P, Radová V (2005) Modified DISTBIC algorithm for speaker change detection. In: INTERSPEECH-2005, pp 3073–3076

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tobias Herbig.

Additional information

This work was conducted at Harman-Becker. T. Herbig is now with Nuance Communications. F. Gerl is now with SVOX.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Herbig, T., Gerl, F., Minker, W. et al. Adaptive systems for unsupervised speaker tracking and speech recognition. Evolving Systems 2, 199–214 (2011). https://doi.org/10.1007/s12530-011-9034-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-011-9034-1

Keywords

Navigation