Skip to main content

Large-Scale Language Modeling with Random Forests for Mandarin Chinese Speech-to-Text

  • Conference paper
Book cover Advances in Natural Language Processing (NLP 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6233))

Included in the following conference series:

Abstract

In this work the random forest language modeling approach is applied with the aim of improving the performance of the LIMSI, highly competitive, Mandarin Chinese speech-to-text system. The experimental setup is that of the GALE Phase 4 evaluation. This setup is characterized by a large amount of available language model training data (over 3.2 billion segmented words). A conventional unpruned 4-gram language model with a vocabulary of 56K words serves as a baseline that is challenging to improve upon. However moderate perplexity and CER improvements over this model were obtained with a random forest language model. Different random forest training strategies were explored so as to attain the maximal gain in performance and Forest of Random Forest language modeling scheme is introduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bahl, L.R., Brown, P.F., de Souza, P.V., Mercer, R.L.: A Tree-Based Statistical Language Model for Natural Language Speech Recognition. CSL 37, 1001–1008 (1989)

    Google Scholar 

  2. Xu, P., Jelinek, F.: Random Forests in Language Modeling. In: Proc. of EMNLP 2004, Barcelona, pp. 325–332 (2004)

    Google Scholar 

  3. Navratil, J., Jin, Q., Andrews, W., Campbell, J.P.: Phonetic Speaker Recognition Using Maximum-Likelihood Binary Decision Tree Models. In: Proc. of ICASSP 2003, Hon Kong, pp. 796–799 (2003)

    Google Scholar 

  4. Xu, P.: Random Forests and the Data Sparseness Problem in Language Modeling. PhD Thesis, Johns Hopkins University, Baltimore (2005)

    Google Scholar 

  5. Su, Y., Jelinek, F., Khudanpur, S.: Large-Scale Random Forest Language Models for Speech Recognition. In: Proc. of Interspeech 2007, Antwerp, pp. 598–601 (2007)

    Google Scholar 

  6. Oparin, I.: Language Models for Automatic Speech Recognition of Inflectional Languages. PhD Thesis, University of West Bohemia, Plzen, Czech Republic (2009)

    Google Scholar 

  7. Oparin, I., Glembek, O., Burget, L., Černocký, J.: Morphological Random Forests for Language Modeling of Inflectional Languages. In: Proc. of IEEE Spoken Language Technology Workshop, SLT 2008, Goa, pp. 189–192 (2008)

    Google Scholar 

  8. Su, Y.: Knowledge Integration Into Language Models: A Random Forest Approach. PhD thesis, Johns Hopkins University, Baltimore (2009)

    Google Scholar 

  9. Luo, J., Lamel, L., Gauvain, J.-L.: Modeling Characters Versus Words for Mandarin Speech Recognition. In: Proc. of ICASSP 2009, Taipei, pp. 4325–4328 (2009)

    Google Scholar 

  10. Wu, D., Fung, P.: Improving Chinese Tokenization with Linguistic Filters on Statistical Lexical Acquisition. In: Proc. of ANLP 1994, pp. 180–181 (1994)

    Google Scholar 

  11. Sproat, R., Chilin, S., Gale, W., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22(3), 218–228 (1996)

    Google Scholar 

  12. Gauvain, J.L., Lamel, L., Adda, G.: The LIMSI Broadcast News Transcription System. Speech Communication 37(1-2), 89–108 (2002)

    Article  MATH  Google Scholar 

  13. Lamel, L., Messaoudi, A., Gauvain, J.-L.: Improved Acoustic Modeling for Transcribing Arabic Broadcast Data. In: Proc. of Interspeech 2007, Antwerp, pp. 2077–2800 (2007)

    Google Scholar 

  14. Leggetter, C.J., Woodland, P.C.: Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. CSL 9(2), 171–185 (1995)

    Google Scholar 

  15. Hieronymus, J.L., Liu, X., Gales, M.J.F., Woodland, P.C.: Exploiting Chinese Character Models to Improve Speech Recognition Performance. In: Proc. of Interspeech 2010, Brighton, pp. 367–370 (2010)

    Google Scholar 

  16. Kneser, R., Ney, H.: Improved Backing-off for M-gram Language Modeling. In: Proc. of ICASSP 1995, Detroit, pp. 181–184 (1995)

    Google Scholar 

  17. Schwenk, H., Gauvain, J.-L.: Training Neural Network Language Models on Very Large Corpora. In: Proc. of EMNLP, Vancouver, pp. 201–208 (2005)

    Google Scholar 

  18. Su, Y.: Random Forest Language Model Toolkit, http://www.clsp.jhu.edu/~yisu/rflm.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Oparin, I., Lamel, L., Gauvain, JL. (2010). Large-Scale Language Modeling with Random Forests for Mandarin Chinese Speech-to-Text. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds) Advances in Natural Language Processing. NLP 2010. Lecture Notes in Computer Science(), vol 6233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14770-8_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14770-8_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14769-2

  • Online ISBN: 978-3-642-14770-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics