Comparison of Different Modeling Units for Language Model Adaptation for Inflected Languages

  • Tanel Alumäe
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4919)

Abstract

This paper presents a language model adaptation framework for highly inflected languages that use sub-word units as basic units in a language model for large vocabulary speech recognition. The proposed adaptation method uses latent semantic analysis based information retrieval to find documents similar to a tiny adaptation corpus. The approach enables to use different language units for modeling document similarity. The method is tested on an Estonian broadcast news transcription task. We compare words, lemmas and morphemes as basic units for similarity modeling. We observe a drop in speech recognition error rate after building adapted language model for each news story. Morpheme-based adaptation is found to give significantly larger improvement than word and lemma-based adaptation.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bellegarda, J.R.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing 6, 456–467 (1998)CrossRefGoogle Scholar
  2. 2.
    Kwon, O.W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication 39, 287–300 (2003)MATHCrossRefGoogle Scholar
  3. 3.
    Alumäe, T.: Large vocabulary continuous speech recognition for Estonian using morpheme classes. In: Proceedings of ICSLP 2004 - Interspeech, Jeju, Korea, pp. 389–392 (2004)Google Scholar
  4. 4.
    Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2293–2296 (2003)Google Scholar
  5. 5.
    Chen, B.: Dynamic language model adaptation using latent topical information and automatic transcripts. In: IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, pp. 97–100 (2005)Google Scholar
  6. 6.
    Tam, Y.C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2206–2209 (2006)Google Scholar
  7. 7.
    Klakow, D.: Language model adaptation for tiny adaptation corpora. In: Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, USA, pp. 2214–2217 (2006)Google Scholar
  8. 8.
    Maučec, M.S., Kačič, Z., Horvat, B.: A framework for language model adaptation for highly-inflected Slovenian. In: Proceedings of ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, pp. 211–214 (2001)Google Scholar
  9. 9.
    Turunen, V., Kurimo, M.: Using latent semantic indexing for morph-based spoken document retrieval. In: Proceedings of Interspeech, Pittsburgh, USA, pp. 341–344 (2006)Google Scholar
  10. 10.
    Kneser, R., Peters, J., Klakow, D.: Language model adaptation using dynamic marginals. In: Proceedings of Eurospeech, Rhodes, Greece, vol. 4, pp. 1971–1974 (1997)Google Scholar
  11. 11.
    Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)CrossRefGoogle Scholar
  12. 12.
    Iyer, R., Ostendorf, M.: Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language 13, 267–282 (1999)CrossRefGoogle Scholar
  13. 13.
    Ristad, E.S.: A natural law of succession. Technical Report TR-495-95, Computer Science Department, Princeton University (1995)Google Scholar
  14. 14.
    Meister, E., Lasn, J., Meister, L.: Development of the Estonian SpeechDat-like database. In: Proceedings of Eurospeech, Geneva, Switzerland, vol. 2, pp. 1601–1604 (2003)Google Scholar
  15. 15.
    Alumäe, T.: Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology (2006)Google Scholar
  16. 16.
    Kaalep, H.J., Muischnek, K.: The corpora of Estonian at the University of Tartu: the current situation. In: The Second Baltic Conference on Human Language Technologies: Proceedings, Tallinn, Estonia, pp. 267–272 (2005)Google Scholar
  17. 17.
    Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, USA, vol. 2, pp. 901–904 (2002)Google Scholar
  18. 18.
    Kaalep, H.J., Vaino, T.: Complete morphological analysis in the linguist’s toolbox. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V, Tartu, Estonia, pp. 9–16 (2001)Google Scholar
  19. 19.
    Čermák, F.: Some of the current problems of corpus and computational linguistics or fifteen commandments and general truths. In: The Third Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007)Google Scholar
  20. 20.
    Federico, M.: Language model adaptation through topic decomposition and MDI estimation. In: Proceedings of ICASSP, Orlando, FL, vol. 1, pp. 773–776 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Tanel Alumäe
    • 1
  1. 1.Institute of Cybernetics at Tallinn University of TechnologyTallinnEstonia

Personalised recommendations