Skip to main content

Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages

  • Conference paper
  • 1461 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4919))

Abstract

This paper presents a language model adaptation framework for highly inflected languages that use sub-word units as basic units in a language model for large vocabulary speech recognition. The proposed adaptation method uses latent semantic analysis based information retrieval to find documents similar to a tiny adaptation corpus. The approach enables to use different language units for modeling document similarity. The method is tested on an Estonian broadcast news transcription task. We compare words, lemmas and morphemes as basic units for similarity modeling. We observe a drop in speech recognition error rate after building adapted language model for each news story. Morpheme-based adaptation is found to give significantly larger improvement than word and lemma-based adaptation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bellegarda, J.R.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing 6, 456–467 (1998)

    Article  Google Scholar 

  2. Kwon, O.W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication 39, 287–300 (2003)

    Article  MATH  Google Scholar 

  3. Alumäe, T.: Large vocabulary continuous speech recognition for Estonian using morpheme classes. In: Proceedings of ICSLP 2004 - Interspeech, Jeju, Korea, pp. 389–392 (2004)

    Google Scholar 

  4. Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2293–2296 (2003)

    Google Scholar 

  5. Chen, B.: Dynamic language model adaptation using latent topical information and automatic transcripts. In: IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, pp. 97–100 (2005)

    Google Scholar 

  6. Tam, Y.C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2206–2209 (2006)

    Google Scholar 

  7. Klakow, D.: Language model adaptation for tiny adaptation corpora. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2214–2217 (2006)

    Google Scholar 

  8. Maučec, M.S., Kačič, Z., Horvat, B.: A framework for language model adaptation for highly-inflected Slovenian. In: Proceedings of ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, pp. 211–214 (2001)

    Google Scholar 

  9. Turunen, V., Kurimo, M.: Using latent semantic indexing for morph-based spoken document retrieval. In: Proceedings of Interspeech, Pittsburgh, USA, pp. 341–344 (2006)

    Google Scholar 

  10. Kneser, R., Peters, J., Klakow, D.: Language model adaptation using dynamic marginals. In: Proceedings of Eurospeech, Rhodes, Greece, vol. 4, pp. 1971–1974 (1997)

    Google Scholar 

  11. Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)

    Article  Google Scholar 

  12. Iyer, R., Ostendorf, M.: Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language 13, 267–282 (1999)

    Article  Google Scholar 

  13. Ristad, E.S.: A natural law of succession. Technical Report TR-495-95, Computer Science Department, Princeton University (1995)

    Google Scholar 

  14. Meister, E., Lasn, J., Meister, L.: Development of the Estonian SpeechDat-like database. In: Proceedings of Eurospeech, Geneva, Switzerland, vol. 2, pp. 1601–1604 (2003)

    Google Scholar 

  15. Alumäe, T.: Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology (2006)

    Google Scholar 

  16. Kaalep, H.J., Muischnek, K.: The corpora of Estonian at the University of Tartu: the current situation. In: The Second Baltic Conference on Human Language Technologies: Proceedings, Tallinn, Estonia, pp. 267–272 (2005)

    Google Scholar 

  17. Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, USA, vol. 2, pp. 901–904 (2002)

    Google Scholar 

  18. Kaalep, H.J., Vaino, T.: Complete morphological analysis in the linguist’s toolbox. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V, Tartu, Estonia, pp. 9–16 (2001)

    Google Scholar 

  19. Čermák, F.: Some of the current problems of corpus and computational linguistics or fifteen commandments and general truths. In: The Third Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007)

    Google Scholar 

  20. Federico, M.: Language model adaptation through topic decomposition and MDI estimation. In: Proceedings of ICASSP, Orlando, FL, vol. 1, pp. 773–776 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Alumäe, T. (2008). Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78135-6_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78134-9

  • Online ISBN: 978-3-540-78135-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics