Abstract
This paper presents a language model adaptation framework for highly inflected languages that use sub-word units as basic units in a language model for large vocabulary speech recognition. The proposed adaptation method uses latent semantic analysis based information retrieval to find documents similar to a tiny adaptation corpus. The approach enables to use different language units for modeling document similarity. The method is tested on an Estonian broadcast news transcription task. We compare words, lemmas and morphemes as basic units for similarity modeling. We observe a drop in speech recognition error rate after building adapted language model for each news story. Morpheme-based adaptation is found to give significantly larger improvement than word and lemma-based adaptation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bellegarda, J.R.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing 6, 456–467 (1998)
Kwon, O.W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication 39, 287–300 (2003)
Alumäe, T.: Large vocabulary continuous speech recognition for Estonian using morpheme classes. In: Proceedings of ICSLP 2004 - Interspeech, Jeju, Korea, pp. 389–392 (2004)
Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2293–2296 (2003)
Chen, B.: Dynamic language model adaptation using latent topical information and automatic transcripts. In: IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, pp. 97–100 (2005)
Tam, Y.C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2206–2209 (2006)
Klakow, D.: Language model adaptation for tiny adaptation corpora. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2214–2217 (2006)
Maučec, M.S., Kačič, Z., Horvat, B.: A framework for language model adaptation for highly-inflected Slovenian. In: Proceedings of ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, pp. 211–214 (2001)
Turunen, V., Kurimo, M.: Using latent semantic indexing for morph-based spoken document retrieval. In: Proceedings of Interspeech, Pittsburgh, USA, pp. 341–344 (2006)
Kneser, R., Peters, J., Klakow, D.: Language model adaptation using dynamic marginals. In: Proceedings of Eurospeech, Rhodes, Greece, vol. 4, pp. 1971–1974 (1997)
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Iyer, R., Ostendorf, M.: Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language 13, 267–282 (1999)
Ristad, E.S.: A natural law of succession. Technical Report TR-495-95, Computer Science Department, Princeton University (1995)
Meister, E., Lasn, J., Meister, L.: Development of the Estonian SpeechDat-like database. In: Proceedings of Eurospeech, Geneva, Switzerland, vol. 2, pp. 1601–1604 (2003)
Alumäe, T.: Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology (2006)
Kaalep, H.J., Muischnek, K.: The corpora of Estonian at the University of Tartu: the current situation. In: The Second Baltic Conference on Human Language Technologies: Proceedings, Tallinn, Estonia, pp. 267–272 (2005)
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, USA, vol. 2, pp. 901–904 (2002)
Kaalep, H.J., Vaino, T.: Complete morphological analysis in the linguist’s toolbox. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V, Tartu, Estonia, pp. 9–16 (2001)
Čermák, F.: Some of the current problems of corpus and computational linguistics or fifteen commandments and general truths. In: The Third Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007)
Federico, M.: Language model adaptation through topic decomposition and MDI estimation. In: Proceedings of ICASSP, Orlando, FL, vol. 1, pp. 773–776 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alumäe, T. (2008). Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-540-78135-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78134-9
Online ISBN: 978-3-540-78135-6
eBook Packages: Computer ScienceComputer Science (R0)