Abstract
Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.
Similar content being viewed by others
References
Ostendorf M: Moving beyond the 'beads-on-a-string' model of speech. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '99), December 1999, Keystone, Colo, USA 79-84.
Kessens JM, Cucchiarini C, Strik H: A data-driven method for modeling pronunciation variation. Speech Communication 2003,40(4):517-534. 10.1016/S0167-6393(02)00150-4
Jurafsky D, Ward W, Banping Z, Herold K, Xiuyang Y, Sen Z: What kind of pronunciation variation is hard for triphones to model? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake, Utah, USA 1: 577-580.
Ganapathiraju A, Hamaker J, Picone J, Ordowski M, Doddington GR: Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 2001,9(4):358-366. 10.1109/89.917681
McAllaster D, Gillick L: Studies in acoustic training and language modeling using simulated speech data. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 1787-1790.
Plannerer B, Ruske G: Recognition of demisyllable based units using semicontinuous hidden Markov models Plannerer. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 1: 581-584.
Jones RJ, Downey S, Mason JS: Continuous speech recognition using syllables. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 3: 1171-1174.
Sethy A, Narayanan S: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 772-775.
Sethy A, Ramabhadran , Narayanan S: Improvements in English ASR for the MALACH project using syllable-centric models. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '03), November-December 2003, St. Thomas, Virgin Islands, USA 129-134.
Jouvet D, Messina R: Context dependent "long units" for speech recognition. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 645-648.
Hämäläinen A, de Veth J, Boves L: Longer-length acoustic units for continuous speech recognition. Proceedings of European Signal Processing Conference (EUSIPCO '05), September 2005, Antalya, Turkey
Hämäläinen A, Boves L, de Veth J: Syllable-length acoustic units in large-vocabulary continuous speech recognition. Proceedings of the 10th International Conference on Speech and Computer (SPECOM '05), October 2005, Patras, Greece 499-502.
Schiller NO, Meyer AS, Levelt WJM: The syllabic structure of spoken words: evidence from the syllabification of intervocalic consonants. Language and Speech 1997,40(2):103-140.
Pallier C: Phonemes and syllables in speech perception: size of attentional focus in French. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 2159-2162.
Greenberg S: Speaking in shorthand—a syllable-centric perspective for understanding pronunciation variation. Speech Communication 1999,29(2):159-176. 10.1016/S0167-6393(99)00050-3
TIMIT acoustic-phonetic continuous speech corpus In NTIS Order PB91-505065. National Institute of Standards and Technology, Gaithersburg, Md, USA; 1990. Speech Disc 1-1.1
Oostdijk N, Goedertier W, Van Eynde F, et al.: Experiences from the spoken Dutch corpus project. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC '02), May 2002, Las Palmas, Canary Islands, Spain 1: 340-347.
Kullback S, Leibler R: On information and sufficiency. Annals of Mathematical Statistics 1951,22(1):79-86. 10.1214/aoms/1177729694
Young S, Evermann G, Hain T, et al.: The HTK Book (for HTK Version 3.2.1). Cambridge University, Cambridge, UK; 2002.
Fisher WM: tsylb2-1.1 syllabification software. 1996.http://www.nist.gov/speech/tools/index.htm
Kahn D: Syllable-based generalisations in English phonology, Ph.D. thesis. Indiana University Linguistics Club, Bloomington, Ind, USA; 1976.
Baayen RH, Piepenbrock R, Gulikers L: The CELEX Lexical Database (Release 2). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pa, USA; 1995.
Printz H, Olsen P: Theory and practice of acoustic confusability. Proceedings of Automatic Speech Recognition: Challenges for the New Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 77-84.
Wester M: Pronunciation variation modeling for Dutch automatic speech recognition, Ph.D. thesis. University of Nijmegen, Nijmegen, The Netherlands; 2002.
Hain T: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 2005,46(2):171-188. 10.1016/j.specom.2005.03.008
Greenberg S, Chang S: Linguistic dissection of switchboard-corpus automatic speech recognition systems. Proceedings of Automatic Speech Recognition: Challenges for the new Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 195-202.
Sun J, Deng L: An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. Journal of the Acoustical Society of America 2002,111(2):1086-1101. 10.1121/1.1420380
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hämäläinen, A., Boves, L., de Veth, J. et al. On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling. J AUDIO SPEECH MUSIC PROC. 2007, 046460 (2007). https://doi.org/10.1155/2007/46460
Received:
Accepted:
Published:
DOI: https://doi.org/10.1155/2007/46460