Abstract
In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.
Similar content being viewed by others
References
Acero A 1990 Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA
Acero A, Stern R M 1991 Robust speech recognition by normalization of the acoustic space. Proc. IEEE ICASSP, Toronto, Canada, 893–896
Adank P, Smits R, van Hout R 2004 A comparison of vowel normalization procedures for language variational research, J. Acoust. Soc. Am. 116(5): 1–9
Adank P M 2003 Vowel-normalization – a perceptual-acoustic study of dutch vowels. Ph.D. thesis, University of Nijmegen, The Netherlands
Akhil P T, Rath S P, Umesh S, Sanand D R 2008 A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics. Proc. of Interspeech, 1713–1716
Anastasakos T, McDonough J, Schwartz R, Makhoul J 1996 A compact model for speaker adaptive training. Proc. Int. Conf. Spoken Lang. Process.
Andreou A, Kamm T, Cohen J 1994 Experiments in vocal tract normalization. Proc. CAIP Workshop: Frontiers in Speech Recognition II
Bladon R A W, Henton C G, Pickering J B 1983 Towards an auditory theory of speaker normalization. Lang. Commun. 4: 59–69
Choukri K, Chollet G, Grenier Y 1986 Spectral transformations through canonical correlation analysis for speaker adaptation in ASR, vol. 11, 2659–2662
Claes T, Dologlou I, Bosch L, Compernolle D V 1998 A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans. Speech Audio Process. 6(6): 549–557
Cui X, Alwan A 2006 Adaptation of children’s speech with limited data based on formant-like peak alignment. Comput. Speech Lang. 20(4): 400–419
Digalakis V, Rtischev D, Neumeyer L, Sa E 1995 Speaker adaptation using constrained estimation of gaussian mixtures. IEEE Trans. Speech Audio Process. 3: 357–366
Fant G 1975 A non-uniform vowel normalization. Tech. Rep. 2-3, Speech Transmiss. Lab. Rep., Royal Inst. Tech., Stockholm, Sweden
Gales M, Woodland P 1996 Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10: 249–264
Gales M J F 1998 Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2): 75–98
Gales M J F 2000 Cluster adaptive training of hidden Markov model. IEEE Trans. Speech Audio Process. 8(4): 417–428
Gauvain J L, Lee C H 1994 Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2: 291–298
Gerstman L 1968 Classification of self-normalized vowels. IEEE Trans. Audio Electroacoust. AU-16: 78–80
Harish A N, Sanand D R, Umesh S 2009 Characterizing speaker variability using spectral envelopes of vowel sounds. Proc. Interspeech, Brighton, UK, 1107–1110
Hewett A J 1989 Training and speaker adaptation in template-based speech recognition. Ph.D. thesis, Cambridge University
Jaschul J 1982 Speaker adaptation by a linear transformation with optimised parameters, vol. 7, 1657–1660
Kamm T, Andreou G, Cohen J 1994 Vocal tract normalization in speech recognition compensation for systematic speaker variability. Proc. of the 15th Annual Speech Research Symposium, CSLP, Johns Hopkins University, Baltimore, MD, 175–178
Kuhn R, Junqua J C, Nguyen P, Niedzielski N 2000 Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6): 695–707
Kumar S V B, Umesh S 2008 Nonuniform speaker normalization using affine transformation. J. Acoust. Soc. Am. 124(3): 1727–1738
Ladofoged P, Broadbent D 1957 Information conveyed by vowels. J. Acoust. Soc. Am. 29: 98–104
Lee C-H, Lin C-H, Juang B-H 1990 A study on speaker adaptation of continuous density HMM parameters. ICASSP-90. 1990 Int. Conf. Acoust. Speech Signal Process., vol. 1, 145–148
Lee L, Rose R 1998 Frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6: 49–59
Leggetter C J, Woodland P 1995 Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2): 171–185
Lobanov B 1971 Classification of Russian vowels spoken by different speakers. J. Acoust. Soc. Am. 49: 606–608
McDonough J, Byrne W, Luo X 1998 Speaker normalization with all-pass transforms. Proc. Int. Conf. Spoken Lang. Process. vol. 6, 2307–2310
McDonough J, Schaaf T, Waibel A 2004 Speaker adaptation with all-pass transforms. Speech Commun. 42(1): 75–91
Miller J D 1989 Auditory-perceptual interpretation of the vowel. J. Acoust. Soc. Am. 85(5): 2114–2134
Molau S, Pitz M, Schluter R, Ney H 2001 Computing mel-frequency cepstral coefficients on the power spectrum. Proc. (ICASSP ’01), 2001 IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, 73–76
Nearey T M 1978 Phonetic feature systems for vowels. Tech. rep., Indiana University Linguistics Club
Nearey T M 1992 Applications of generalized linear modeling to vowel data. Proc. ICSLP ’92, Canada
Nordström P E, Lindblom B 1975 A normalization procedure for vowel formant data. Int. Cong. Phonetic Sci., Leeds England
Oppenheim A, Johnson D 1972 Discrete representation of signals. Proc. IEEE 60(6): 681–691
Panchapagesan S 2006 Frequency warping by linear transformation of standard MFCC. Proc. Interspeech, Pittsburgh, Pennsylvania
Panchapagesan S, Alwan A 2009 Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC. Comput. Speech Lang. 23(1): 42–64
Peterson G E 1961 Parameters of vowel quality. J. Speech Hearing 4: 10–29
Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24(2): 175–184
Pitz M 2005 Investigations on linear transformations for speaker adaptation and normalization. Ph.D. thesis, RWTH Aachen
Pitz M, Molau S, Schlter R, Uter R S, Ney H 2001 Vocal tract normalization equals linear transformation in cepstral space. Proc. Eurospeech, 2653–2656
Rath S P, Umesh S 2009 Acoustic class specific vtln-warping using regression class trees. Proc. Interspeech, Brighton, UK, 556–559
Rath S P, Umesh S, Sarkar A K 2009 Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors. Proc. Interspeech, Brighton, UK, 572–575
Sanand D R, Kumar D D, Umesh S 2007 Linear transformation approach to vtln using dynamic frequency warping. Proc. Interspeech, Antwerp, 1138–1141
Sanand D R, Rath S P, Umesh S 2009 A study on the influence of covariance adaptation on Jacobian compensation in vocal tract length normalization. Proc. Interspeech, Brighton, UK, 584–587
Sanand D R, Umesh S 2008 Study of jacobian compensation using linear transformation of conventional MFCC for VTLN. Proc. Interspeech, Brisbane, Australia, 1233–1236
Sondhi M 1986 Resonances of a bent vocal tract. J. Acoust. Soc. Am. 79: 1113–1116
Stevens S S, Volkman J 1940 The relation of pitch to frequency. Am. J. Psychol. 53: 329
Sussman H 1986 A neuronal model of vowel-normalization and representation. Brain Lang. 28: 12–23
Syrdal A K, Gopal H S 1983 Perceived critical distances between F 1 − F 0, F 2 − F 1, F 3 − F 2. J. Acoust. Soc. Am. 74(S1): S88–S89
Syrdal A K, Gopal H S 1986 A perceptual model of vowel recognition based on the auditory representation of American English vowels. J. Acoust. Soc. Am. 79(4): 1086–1100
Umesh S, Cohen L, Nelson D 2002a Frequency warping and the Mel scale. IEEE Signal Process. Lett. 9(3): 104–107
Umesh S, Cohen L, Nelson D 2002b The speech scale. Acoustics Research Letters Online of the J. Acoust. Soc. Am. 3(3): 83–88
Umesh S, Kumar S V B, Vinay M K, Sharma R, Sinha R 2002c A simple approach to non-uniform vowel normalization. Proc. IEEE ICASSP ’02, Orlando, USA, 517–520
Umesh S, Sinha R 2007 A study of filter-bank smoothing in MFCC features for recognition of children’s speech. IEEE Trans. Speech Audio Process. 15(8): 2418–2430
Umesh S, Zolnay A, Ney H 2005 Implementing frequency warping and VTLN through linear transformation of conventional MFCC. Interspeech, Lisbon, Portugal, 269–272
von Békésy G, Rosenblith W A 1951 in: S S Stevens (ed.), Handbook of experimental psychology (New York: John Wiley) 985–1039
Wegmann S, McAllaster D, Orloff J, Peskin B 1996 Speaker normalization on conversational telephone speech. IEEE ICASSP ’96, Atlanta, USA, 339–341
Young S J, Woodland P C 1994 State clustering in hidden Markov model-based continuous speech recognition. Comput. Speech Lang. 8(4): 369–383
Zolfaghari P, Robinson T 1996 Formant analysis using mixtures of gaussians. ICSLP 96. Proc. Fourth Int. Conf. on Spoken Language 1996, vol. 2, 1229–1232
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
UMESH, S. Studies on inter-speaker variability in speech and its application in automatic speech recognition. Sadhana 36, 853–883 (2011). https://doi.org/10.1007/s12046-011-0049-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-011-0049-x