Skip to main content
Log in

Studies on inter-speaker variability in speech and its application in automatic speech recognition

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Acero A 1990 Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA

  • Acero A, Stern R M 1991 Robust speech recognition by normalization of the acoustic space. Proc. IEEE ICASSP, Toronto, Canada, 893–896

  • Adank P, Smits R, van Hout R 2004 A comparison of vowel normalization procedures for language variational research, J. Acoust. Soc. Am. 116(5): 1–9

    Article  Google Scholar 

  • Adank P M 2003 Vowel-normalization – a perceptual-acoustic study of dutch vowels. Ph.D. thesis, University of Nijmegen, The Netherlands

  • Akhil P T, Rath S P, Umesh S, Sanand D R 2008 A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics. Proc. of Interspeech, 1713–1716

  • Anastasakos T, McDonough J, Schwartz R, Makhoul J 1996 A compact model for speaker adaptive training. Proc. Int. Conf. Spoken Lang. Process.

  • Andreou A, Kamm T, Cohen J 1994 Experiments in vocal tract normalization. Proc. CAIP Workshop: Frontiers in Speech Recognition II

  • Bladon R A W, Henton C G, Pickering J B 1983 Towards an auditory theory of speaker normalization. Lang. Commun. 4: 59–69

    Article  Google Scholar 

  • Choukri K, Chollet G, Grenier Y 1986 Spectral transformations through canonical correlation analysis for speaker adaptation in ASR, vol. 11, 2659–2662

  • Claes T, Dologlou I, Bosch L, Compernolle D V 1998 A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans. Speech Audio Process. 6(6): 549–557

    Article  Google Scholar 

  • Cui X, Alwan A 2006 Adaptation of children’s speech with limited data based on formant-like peak alignment. Comput. Speech Lang. 20(4): 400–419

    Article  Google Scholar 

  • Digalakis V, Rtischev D, Neumeyer L, Sa E 1995 Speaker adaptation using constrained estimation of gaussian mixtures. IEEE Trans. Speech Audio Process. 3: 357–366

    Article  Google Scholar 

  • Fant G 1975 A non-uniform vowel normalization. Tech. Rep. 2-3, Speech Transmiss. Lab. Rep., Royal Inst. Tech., Stockholm, Sweden

  • Gales M, Woodland P 1996 Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10: 249–264

    Article  Google Scholar 

  • Gales M J F 1998 Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2): 75–98

    Article  Google Scholar 

  • Gales M J F 2000 Cluster adaptive training of hidden Markov model. IEEE Trans. Speech Audio Process. 8(4): 417–428

    Article  Google Scholar 

  • Gauvain J L, Lee C H 1994 Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2: 291–298

    Article  Google Scholar 

  • Gerstman L 1968 Classification of self-normalized vowels. IEEE Trans. Audio Electroacoust. AU-16: 78–80

    Google Scholar 

  • Harish A N, Sanand D R, Umesh S 2009 Characterizing speaker variability using spectral envelopes of vowel sounds. Proc. Interspeech, Brighton, UK, 1107–1110

  • Hewett A J 1989 Training and speaker adaptation in template-based speech recognition. Ph.D. thesis, Cambridge University

  • Jaschul J 1982 Speaker adaptation by a linear transformation with optimised parameters, vol. 7, 1657–1660

  • Kamm T, Andreou G, Cohen J 1994 Vocal tract normalization in speech recognition compensation for systematic speaker variability. Proc. of the 15th Annual Speech Research Symposium, CSLP, Johns Hopkins University, Baltimore, MD, 175–178

  • Kuhn R, Junqua J C, Nguyen P, Niedzielski N 2000 Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6): 695–707

    Article  Google Scholar 

  • Kumar S V B, Umesh S 2008 Nonuniform speaker normalization using affine transformation. J. Acoust. Soc. Am. 124(3): 1727–1738

    Article  Google Scholar 

  • Ladofoged P, Broadbent D 1957 Information conveyed by vowels. J. Acoust. Soc. Am. 29: 98–104

    Article  Google Scholar 

  • Lee C-H, Lin C-H, Juang B-H 1990 A study on speaker adaptation of continuous density HMM parameters. ICASSP-90. 1990 Int. Conf. Acoust. Speech Signal Process., vol. 1, 145–148

    Google Scholar 

  • Lee L, Rose R 1998 Frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6: 49–59

    Article  Google Scholar 

  • Leggetter C J, Woodland P 1995 Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2): 171–185

    Article  Google Scholar 

  • Lobanov B 1971 Classification of Russian vowels spoken by different speakers. J. Acoust. Soc. Am. 49: 606–608

    Article  Google Scholar 

  • McDonough J, Byrne W, Luo X 1998 Speaker normalization with all-pass transforms. Proc. Int. Conf. Spoken Lang. Process. vol. 6, 2307–2310

  • McDonough J, Schaaf T, Waibel A 2004 Speaker adaptation with all-pass transforms. Speech Commun. 42(1): 75–91

    Article  Google Scholar 

  • Miller J D 1989 Auditory-perceptual interpretation of the vowel. J. Acoust. Soc. Am. 85(5): 2114–2134

    Article  Google Scholar 

  • Molau S, Pitz M, Schluter R, Ney H 2001 Computing mel-frequency cepstral coefficients on the power spectrum. Proc. (ICASSP ’01), 2001 IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, 73–76

    Google Scholar 

  • Nearey T M 1978 Phonetic feature systems for vowels. Tech. rep., Indiana University Linguistics Club

  • Nearey T M 1992 Applications of generalized linear modeling to vowel data. Proc. ICSLP ’92, Canada

  • Nordström P E, Lindblom B 1975 A normalization procedure for vowel formant data. Int. Cong. Phonetic Sci., Leeds England

  • Oppenheim A, Johnson D 1972 Discrete representation of signals. Proc. IEEE 60(6): 681–691

    Article  Google Scholar 

  • Panchapagesan S 2006 Frequency warping by linear transformation of standard MFCC. Proc. Interspeech, Pittsburgh, Pennsylvania

  • Panchapagesan S, Alwan A 2009 Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC. Comput. Speech Lang. 23(1): 42–64

    Article  Google Scholar 

  • Peterson G E 1961 Parameters of vowel quality. J. Speech Hearing 4: 10–29

    Google Scholar 

  • Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24(2): 175–184

    Article  Google Scholar 

  • Pitz M 2005 Investigations on linear transformations for speaker adaptation and normalization. Ph.D. thesis, RWTH Aachen

  • Pitz M, Molau S, Schlter R, Uter R S, Ney H 2001 Vocal tract normalization equals linear transformation in cepstral space. Proc. Eurospeech, 2653–2656

  • Rath S P, Umesh S 2009 Acoustic class specific vtln-warping using regression class trees. Proc. Interspeech, Brighton, UK, 556–559

  • Rath S P, Umesh S, Sarkar A K 2009 Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors. Proc. Interspeech, Brighton, UK, 572–575

  • Sanand D R, Kumar D D, Umesh S 2007 Linear transformation approach to vtln using dynamic frequency warping. Proc. Interspeech, Antwerp, 1138–1141

  • Sanand D R, Rath S P, Umesh S 2009 A study on the influence of covariance adaptation on Jacobian compensation in vocal tract length normalization. Proc. Interspeech, Brighton, UK, 584–587

  • Sanand D R, Umesh S 2008 Study of jacobian compensation using linear transformation of conventional MFCC for VTLN. Proc. Interspeech, Brisbane, Australia, 1233–1236

  • Sondhi M 1986 Resonances of a bent vocal tract. J. Acoust. Soc. Am. 79: 1113–1116

    Article  Google Scholar 

  • Stevens S S, Volkman J 1940 The relation of pitch to frequency. Am. J. Psychol. 53: 329

    Article  Google Scholar 

  • Sussman H 1986 A neuronal model of vowel-normalization and representation. Brain Lang. 28: 12–23

    Article  Google Scholar 

  • Syrdal A K, Gopal H S 1983 Perceived critical distances between F 1 − F 0, F 2 − F 1, F 3 − F 2. J. Acoust. Soc. Am. 74(S1): S88–S89

    Article  Google Scholar 

  • Syrdal A K, Gopal H S 1986 A perceptual model of vowel recognition based on the auditory representation of American English vowels. J. Acoust. Soc. Am. 79(4): 1086–1100

    Article  Google Scholar 

  • Umesh S, Cohen L, Nelson D 2002a Frequency warping and the Mel scale. IEEE Signal Process. Lett. 9(3): 104–107

    Article  Google Scholar 

  • Umesh S, Cohen L, Nelson D 2002b The speech scale. Acoustics Research Letters Online of the J. Acoust. Soc. Am. 3(3): 83–88

    Article  Google Scholar 

  • Umesh S, Kumar S V B, Vinay M K, Sharma R, Sinha R 2002c A simple approach to non-uniform vowel normalization. Proc. IEEE ICASSP ’02, Orlando, USA, 517–520

  • Umesh S, Sinha R 2007 A study of filter-bank smoothing in MFCC features for recognition of children’s speech. IEEE Trans. Speech Audio Process. 15(8): 2418–2430

    Article  Google Scholar 

  • Umesh S, Zolnay A, Ney H 2005 Implementing frequency warping and VTLN through linear transformation of conventional MFCC. Interspeech, Lisbon, Portugal, 269–272

  • von Békésy G, Rosenblith W A 1951 in: S S Stevens (ed.), Handbook of experimental psychology (New York: John Wiley) 985–1039

    Google Scholar 

  • Wegmann S, McAllaster D, Orloff J, Peskin B 1996 Speaker normalization on conversational telephone speech. IEEE ICASSP ’96, Atlanta, USA, 339–341

  • Young S J, Woodland P C 1994 State clustering in hidden Markov model-based continuous speech recognition. Comput. Speech Lang. 8(4): 369–383

    Article  Google Scholar 

  • Zolfaghari P, Robinson T 1996 Formant analysis using mixtures of gaussians. ICSLP 96. Proc. Fourth Int. Conf. on Spoken Language 1996, vol. 2, 1229–1232

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S UMESH.

Rights and permissions

Reprints and permissions

About this article

Cite this article

UMESH, S. Studies on inter-speaker variability in speech and its application in automatic speech recognition. Sadhana 36, 853–883 (2011). https://doi.org/10.1007/s12046-011-0049-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12046-011-0049-x

Keywords

Navigation