Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages

Alumäe, Tanel

doi:10.1007/978-3-540-78135-6_42

Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages

Tanel Alumäe¹

Conference paper

1461 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4919))

Abstract

This paper presents a language model adaptation framework for highly inflected languages that use sub-word units as basic units in a language model for large vocabulary speech recognition. The proposed adaptation method uses latent semantic analysis based information retrieval to find documents similar to a tiny adaptation corpus. The approach enables to use different language units for modeling document similarity. The method is tested on an Estonian broadcast news transcription task. We compare words, lemmas and morphemes as basic units for similarity modeling. We observe a drop in speech recognition error rate after building adapted language model for each news story. Morpheme-based adaptation is found to give significantly larger improvement than word and lemma-based adaptation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bellegarda, J.R.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing 6, 456–467 (1998)
Article Google Scholar
Kwon, O.W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication 39, 287–300 (2003)
Article MATH Google Scholar
Alumäe, T.: Large vocabulary continuous speech recognition for Estonian using morpheme classes. In: Proceedings of ICSLP 2004 - Interspeech, Jeju, Korea, pp. 389–392 (2004)
Google Scholar
Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2293–2296 (2003)
Google Scholar
Chen, B.: Dynamic language model adaptation using latent topical information and automatic transcripts. In: IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, pp. 97–100 (2005)
Google Scholar
Tam, Y.C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2206–2209 (2006)
Google Scholar
Klakow, D.: Language model adaptation for tiny adaptation corpora. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2214–2217 (2006)
Google Scholar
Maučec, M.S., Kačič, Z., Horvat, B.: A framework for language model adaptation for highly-inflected Slovenian. In: Proceedings of ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, pp. 211–214 (2001)
Google Scholar
Turunen, V., Kurimo, M.: Using latent semantic indexing for morph-based spoken document retrieval. In: Proceedings of Interspeech, Pittsburgh, USA, pp. 341–344 (2006)
Google Scholar
Kneser, R., Peters, J., Klakow, D.: Language model adaptation using dynamic marginals. In: Proceedings of Eurospeech, Rhodes, Greece, vol. 4, pp. 1971–1974 (1997)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Iyer, R., Ostendorf, M.: Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language 13, 267–282 (1999)
Article Google Scholar
Ristad, E.S.: A natural law of succession. Technical Report TR-495-95, Computer Science Department, Princeton University (1995)
Google Scholar
Meister, E., Lasn, J., Meister, L.: Development of the Estonian SpeechDat-like database. In: Proceedings of Eurospeech, Geneva, Switzerland, vol. 2, pp. 1601–1604 (2003)
Google Scholar
Alumäe, T.: Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology (2006)
Google Scholar
Kaalep, H.J., Muischnek, K.: The corpora of Estonian at the University of Tartu: the current situation. In: The Second Baltic Conference on Human Language Technologies: Proceedings, Tallinn, Estonia, pp. 267–272 (2005)
Google Scholar
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, USA, vol. 2, pp. 901–904 (2002)
Google Scholar
Kaalep, H.J., Vaino, T.: Complete morphological analysis in the linguist’s toolbox. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V, Tartu, Estonia, pp. 9–16 (2001)
Google Scholar
Čermák, F.: Some of the current problems of corpus and computational linguistics or fifteen commandments and general truths. In: The Third Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007)
Google Scholar
Federico, M.: Language model adaptation through topic decomposition and MDI estimation. In: Proceedings of ICASSP, Orlando, FL, vol. 1, pp. 773–776 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Cybernetics at Tallinn University of Technology, Akadeemia tee 21, Tallinn, 12618, Estonia
Tanel Alumäe

Authors

Tanel Alumäe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alumäe, T. (2008). Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-540-78135-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78134-9
Online ISBN: 978-3-540-78135-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics