On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Hämäläinen, Annika; Boves, Lou; de Veth, Johan; ten Bosch, Louis

doi:10.1155/2007/46460

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Research Article
Open access
Published: 11 July 2007

Volume 2007, article number 046460, (2007)
Cite this article

Download PDF

You have full access to this open access article

EURASIP Journal on Audio, Speech, and Music Processing Submit manuscript

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Download PDF

Annika Hämäläinen¹,
Lou Boves¹,
Johan de Veth¹ &
…
Louis ten Bosch¹

3237 Accesses
5 Citations
Explore all metrics

Abstract

Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.

Acoustical Frame Rate and Pronunciation Variant Statistics

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Article Open access 04 November 2023

Improved Syllable-Based Text to Speech Synthesis for Tone Language Systems

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27]

References

Ostendorf M: Moving beyond the 'beads-on-a-string' model of speech. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '99), December 1999, Keystone, Colo, USA 79-84.
Google Scholar
Kessens JM, Cucchiarini C, Strik H: A data-driven method for modeling pronunciation variation. Speech Communication 2003,40(4):517-534. 10.1016/S0167-6393(02)00150-4
Article Google Scholar
Jurafsky D, Ward W, Banping Z, Herold K, Xiuyang Y, Sen Z: What kind of pronunciation variation is hard for triphones to model? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake, Utah, USA 1: 577-580.
Google Scholar
Ganapathiraju A, Hamaker J, Picone J, Ordowski M, Doddington GR: Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 2001,9(4):358-366. 10.1109/89.917681
Article Google Scholar
McAllaster D, Gillick L: Studies in acoustic training and language modeling using simulated speech data. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 1787-1790.
Google Scholar
Plannerer B, Ruske G: Recognition of demisyllable based units using semicontinuous hidden Markov models Plannerer. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 1: 581-584.
Google Scholar
Jones RJ, Downey S, Mason JS: Continuous speech recognition using syllables. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 3: 1171-1174.
Google Scholar
Sethy A, Narayanan S: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 772-775.
Article Google Scholar
Sethy A, Ramabhadran , Narayanan S: Improvements in English ASR for the MALACH project using syllable-centric models. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '03), November-December 2003, St. Thomas, Virgin Islands, USA 129-134.
Google Scholar
Jouvet D, Messina R: Context dependent "long units" for speech recognition. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 645-648.
Google Scholar
Hämäläinen A, de Veth J, Boves L: Longer-length acoustic units for continuous speech recognition. Proceedings of European Signal Processing Conference (EUSIPCO '05), September 2005, Antalya, Turkey
Google Scholar
Hämäläinen A, Boves L, de Veth J: Syllable-length acoustic units in large-vocabulary continuous speech recognition. Proceedings of the 10th International Conference on Speech and Computer (SPECOM '05), October 2005, Patras, Greece 499-502.
Google Scholar
Schiller NO, Meyer AS, Levelt WJM: The syllabic structure of spoken words: evidence from the syllabification of intervocalic consonants. Language and Speech 1997,40(2):103-140.
Google Scholar
Pallier C: Phonemes and syllables in speech perception: size of attentional focus in French. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 2159-2162.
Google Scholar
Greenberg S: Speaking in shorthand—a syllable-centric perspective for understanding pronunciation variation. Speech Communication 1999,29(2):159-176. 10.1016/S0167-6393(99)00050-3
Article Google Scholar
TIMIT acoustic-phonetic continuous speech corpus In NTIS Order PB91-505065. National Institute of Standards and Technology, Gaithersburg, Md, USA; 1990. Speech Disc 1-1.1
Oostdijk N, Goedertier W, Van Eynde F, et al.: Experiences from the spoken Dutch corpus project. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC '02), May 2002, Las Palmas, Canary Islands, Spain 1: 340-347.
Google Scholar
Kullback S, Leibler R: On information and sufficiency. Annals of Mathematical Statistics 1951,22(1):79-86. 10.1214/aoms/1177729694
Article MathSciNet MATH Google Scholar
Young S, Evermann G, Hain T, et al.: The HTK Book (for HTK Version 3.2.1). Cambridge University, Cambridge, UK; 2002.
Google Scholar
Fisher WM: tsylb2-1.1 syllabification software. 1996.http://www.nist.gov/speech/tools/index.htm
Google Scholar
Kahn D: Syllable-based generalisations in English phonology, Ph.D. thesis. Indiana University Linguistics Club, Bloomington, Ind, USA; 1976.
Google Scholar
Baayen RH, Piepenbrock R, Gulikers L: The CELEX Lexical Database (Release 2). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pa, USA; 1995.
Google Scholar
Printz H, Olsen P: Theory and practice of acoustic confusability. Proceedings of Automatic Speech Recognition: Challenges for the New Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 77-84.
Google Scholar
Wester M: Pronunciation variation modeling for Dutch automatic speech recognition, Ph.D. thesis. University of Nijmegen, Nijmegen, The Netherlands; 2002.
Google Scholar
Hain T: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 2005,46(2):171-188. 10.1016/j.specom.2005.03.008
Article MathSciNet Google Scholar
Greenberg S, Chang S: Linguistic dissection of switchboard-corpus automatic speech recognition systems. Proceedings of Automatic Speech Recognition: Challenges for the new Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 195-202.
Google Scholar
Sun J, Deng L: An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. Journal of the Acoustical Society of America 2002,111(2):1086-1101. 10.1121/1.1420380
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Language and Speech Technology (CLST), Faculty of Arts, Radboud University Nijmegen, P.O. Box 9103, Nijmegen, 6500 HD, The Netherlands
Annika Hämäläinen, Lou Boves, Johan de Veth & Louis ten Bosch

Authors

Annika Hämäläinen
View author publications
You can also search for this author in PubMed Google Scholar
Lou Boves
View author publications
You can also search for this author in PubMed Google Scholar
Johan de Veth
View author publications
You can also search for this author in PubMed Google Scholar
Louis ten Bosch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Annika Hämäläinen.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hämäläinen, A., Boves, L., de Veth, J. et al. On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling. J AUDIO SPEECH MUSIC PROC. 2007, 046460 (2007). https://doi.org/10.1155/2007/46460

Download citation

Received: 06 December 2006
Accepted: 18 May 2007
Published: 11 July 2007
DOI: https://doi.org/10.1155/2007/46460

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Abstract

Similar content being viewed by others

Acoustical Frame Rate and Pronunciation Variant Statistics

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Improved Syllable-Based Text to Speech Synthesis for Tone Language Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Abstract

Similar content being viewed by others

Acoustical Frame Rate and Pronunciation Variant Statistics

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Improved Syllable-Based Text to Speech Synthesis for Tone Language Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation