Studies on inter-speaker variability in speech and its application in automatic speech recognition

UMESH, S

doi:10.1007/s12046-011-0049-x

Studies on inter-speaker variability in speech and its application in automatic speech recognition

Published: 22 November 2011

Volume 36, pages 853–883, (2011)
Cite this article

Sadhana Aims and scope Submit manuscript

S UMESH¹

169 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transformers in Time-Series Analysis: A Tutorial

Article 25 July 2023

Faked speech detection with zero prior knowledge

Article Open access 22 May 2024

Strategy for developing a speech recognition model specialized for patients with depression or Parkinson’s disease with small size speech database

Article 23 May 2024

References

Acero A 1990 Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA
Acero A, Stern R M 1991 Robust speech recognition by normalization of the acoustic space. Proc. IEEE ICASSP, Toronto, Canada, 893–896
Adank P, Smits R, van Hout R 2004 A comparison of vowel normalization procedures for language variational research, J. Acoust. Soc. Am. 116(5): 1–9
Article Google Scholar
Adank P M 2003 Vowel-normalization – a perceptual-acoustic study of dutch vowels. Ph.D. thesis, University of Nijmegen, The Netherlands
Akhil P T, Rath S P, Umesh S, Sanand D R 2008 A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics. Proc. of Interspeech, 1713–1716
Anastasakos T, McDonough J, Schwartz R, Makhoul J 1996 A compact model for speaker adaptive training. Proc. Int. Conf. Spoken Lang. Process.
Andreou A, Kamm T, Cohen J 1994 Experiments in vocal tract normalization. Proc. CAIP Workshop: Frontiers in Speech Recognition II
Bladon R A W, Henton C G, Pickering J B 1983 Towards an auditory theory of speaker normalization. Lang. Commun. 4: 59–69
Article Google Scholar
Choukri K, Chollet G, Grenier Y 1986 Spectral transformations through canonical correlation analysis for speaker adaptation in ASR, vol. 11, 2659–2662
Claes T, Dologlou I, Bosch L, Compernolle D V 1998 A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans. Speech Audio Process. 6(6): 549–557
Article Google Scholar
Cui X, Alwan A 2006 Adaptation of children’s speech with limited data based on formant-like peak alignment. Comput. Speech Lang. 20(4): 400–419
Article Google Scholar
Digalakis V, Rtischev D, Neumeyer L, Sa E 1995 Speaker adaptation using constrained estimation of gaussian mixtures. IEEE Trans. Speech Audio Process. 3: 357–366
Article Google Scholar
Fant G 1975 A non-uniform vowel normalization. Tech. Rep. 2-3, Speech Transmiss. Lab. Rep., Royal Inst. Tech., Stockholm, Sweden
Gales M, Woodland P 1996 Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10: 249–264
Article Google Scholar
Gales M J F 1998 Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2): 75–98
Article Google Scholar
Gales M J F 2000 Cluster adaptive training of hidden Markov model. IEEE Trans. Speech Audio Process. 8(4): 417–428
Article Google Scholar
Gauvain J L, Lee C H 1994 Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2: 291–298
Article Google Scholar
Gerstman L 1968 Classification of self-normalized vowels. IEEE Trans. Audio Electroacoust. AU-16: 78–80
Google Scholar
Harish A N, Sanand D R, Umesh S 2009 Characterizing speaker variability using spectral envelopes of vowel sounds. Proc. Interspeech, Brighton, UK, 1107–1110
Hewett A J 1989 Training and speaker adaptation in template-based speech recognition. Ph.D. thesis, Cambridge University
Jaschul J 1982 Speaker adaptation by a linear transformation with optimised parameters, vol. 7, 1657–1660
Kamm T, Andreou G, Cohen J 1994 Vocal tract normalization in speech recognition compensation for systematic speaker variability. Proc. of the 15th Annual Speech Research Symposium, CSLP, Johns Hopkins University, Baltimore, MD, 175–178
Kuhn R, Junqua J C, Nguyen P, Niedzielski N 2000 Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6): 695–707
Article Google Scholar
Kumar S V B, Umesh S 2008 Nonuniform speaker normalization using affine transformation. J. Acoust. Soc. Am. 124(3): 1727–1738
Article Google Scholar
Ladofoged P, Broadbent D 1957 Information conveyed by vowels. J. Acoust. Soc. Am. 29: 98–104
Article Google Scholar
Lee C-H, Lin C-H, Juang B-H 1990 A study on speaker adaptation of continuous density HMM parameters. ICASSP-90. 1990 Int. Conf. Acoust. Speech Signal Process., vol. 1, 145–148
Google Scholar
Lee L, Rose R 1998 Frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6: 49–59
Article Google Scholar
Leggetter C J, Woodland P 1995 Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2): 171–185
Article Google Scholar
Lobanov B 1971 Classification of Russian vowels spoken by different speakers. J. Acoust. Soc. Am. 49: 606–608
Article Google Scholar
McDonough J, Byrne W, Luo X 1998 Speaker normalization with all-pass transforms. Proc. Int. Conf. Spoken Lang. Process. vol. 6, 2307–2310
McDonough J, Schaaf T, Waibel A 2004 Speaker adaptation with all-pass transforms. Speech Commun. 42(1): 75–91
Article Google Scholar
Miller J D 1989 Auditory-perceptual interpretation of the vowel. J. Acoust. Soc. Am. 85(5): 2114–2134
Article Google Scholar
Molau S, Pitz M, Schluter R, Ney H 2001 Computing mel-frequency cepstral coefficients on the power spectrum. Proc. (ICASSP ’01), 2001 IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, 73–76
Google Scholar
Nearey T M 1978 Phonetic feature systems for vowels. Tech. rep., Indiana University Linguistics Club
Nearey T M 1992 Applications of generalized linear modeling to vowel data. Proc. ICSLP ’92, Canada
Nordström P E, Lindblom B 1975 A normalization procedure for vowel formant data. Int. Cong. Phonetic Sci., Leeds England
Oppenheim A, Johnson D 1972 Discrete representation of signals. Proc. IEEE 60(6): 681–691
Article Google Scholar
Panchapagesan S 2006 Frequency warping by linear transformation of standard MFCC. Proc. Interspeech, Pittsburgh, Pennsylvania
Panchapagesan S, Alwan A 2009 Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC. Comput. Speech Lang. 23(1): 42–64
Article Google Scholar
Peterson G E 1961 Parameters of vowel quality. J. Speech Hearing 4: 10–29
Google Scholar
Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24(2): 175–184
Article Google Scholar
Pitz M 2005 Investigations on linear transformations for speaker adaptation and normalization. Ph.D. thesis, RWTH Aachen
Pitz M, Molau S, Schlter R, Uter R S, Ney H 2001 Vocal tract normalization equals linear transformation in cepstral space. Proc. Eurospeech, 2653–2656
Rath S P, Umesh S 2009 Acoustic class specific vtln-warping using regression class trees. Proc. Interspeech, Brighton, UK, 556–559
Rath S P, Umesh S, Sarkar A K 2009 Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors. Proc. Interspeech, Brighton, UK, 572–575
Sanand D R, Kumar D D, Umesh S 2007 Linear transformation approach to vtln using dynamic frequency warping. Proc. Interspeech, Antwerp, 1138–1141
Sanand D R, Rath S P, Umesh S 2009 A study on the influence of covariance adaptation on Jacobian compensation in vocal tract length normalization. Proc. Interspeech, Brighton, UK, 584–587
Sanand D R, Umesh S 2008 Study of jacobian compensation using linear transformation of conventional MFCC for VTLN. Proc. Interspeech, Brisbane, Australia, 1233–1236
Sondhi M 1986 Resonances of a bent vocal tract. J. Acoust. Soc. Am. 79: 1113–1116
Article Google Scholar
Stevens S S, Volkman J 1940 The relation of pitch to frequency. Am. J. Psychol. 53: 329
Article Google Scholar
Sussman H 1986 A neuronal model of vowel-normalization and representation. Brain Lang. 28: 12–23
Article Google Scholar
Syrdal A K, Gopal H S 1983 Perceived critical distances between F ₁ − F ₀, F ₂ − F ₁, F ₃ − F ₂. J. Acoust. Soc. Am. 74(S1): S88–S89
Article Google Scholar
Syrdal A K, Gopal H S 1986 A perceptual model of vowel recognition based on the auditory representation of American English vowels. J. Acoust. Soc. Am. 79(4): 1086–1100
Article Google Scholar
Umesh S, Cohen L, Nelson D 2002a Frequency warping and the Mel scale. IEEE Signal Process. Lett. 9(3): 104–107
Article Google Scholar
Umesh S, Cohen L, Nelson D 2002b The speech scale. Acoustics Research Letters Online of the J. Acoust. Soc. Am. 3(3): 83–88
Article Google Scholar
Umesh S, Kumar S V B, Vinay M K, Sharma R, Sinha R 2002c A simple approach to non-uniform vowel normalization. Proc. IEEE ICASSP ’02, Orlando, USA, 517–520
Umesh S, Sinha R 2007 A study of filter-bank smoothing in MFCC features for recognition of children’s speech. IEEE Trans. Speech Audio Process. 15(8): 2418–2430
Article Google Scholar
Umesh S, Zolnay A, Ney H 2005 Implementing frequency warping and VTLN through linear transformation of conventional MFCC. Interspeech, Lisbon, Portugal, 269–272
von Békésy G, Rosenblith W A 1951 in: S S Stevens (ed.), Handbook of experimental psychology (New York: John Wiley) 985–1039
Google Scholar
Wegmann S, McAllaster D, Orloff J, Peskin B 1996 Speaker normalization on conversational telephone speech. IEEE ICASSP ’96, Atlanta, USA, 339–341
Young S J, Woodland P C 1994 State clustering in hidden Markov model-based continuous speech recognition. Comput. Speech Lang. 8(4): 369–383
Article Google Scholar
Zolfaghari P, Robinson T 1996 Formant analysis using mixtures of gaussians. ICSLP 96. Proc. Fourth Int. Conf. on Spoken Language 1996, vol. 2, 1229–1232
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Indian Institute of Technology-Madras, Chennai, 600 036, India
S UMESH

Authors

S UMESH
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S UMESH.

Rights and permissions

Reprints and permissions

About this article

Cite this article

UMESH, S. Studies on inter-speaker variability in speech and its application in automatic speech recognition. Sadhana 36, 853–883 (2011). https://doi.org/10.1007/s12046-011-0049-x

Download citation

Published: 22 November 2011
Issue Date: October 2011
DOI: https://doi.org/10.1007/s12046-011-0049-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Studies on inter-speaker variability in speech and its application in automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

Transformers in Time-Series Analysis: A Tutorial

Faked speech detection with zero prior knowledge

Strategy for developing a speech recognition model specialized for patients with depression or Parkinson’s disease with small size speech database

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Studies on inter-speaker variability in speech and its application in automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

Transformers in Time-Series Analysis: A Tutorial

Faked speech detection with zero prior knowledge

Strategy for developing a speech recognition model specialized for patients with depression or Parkinson’s disease with small size speech database

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation