Statistical and Linguistic Clustering for Language Modeling in ASR

  • R. Justo
  • I. Torres
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3773)


In this work several sets of categories obtained by a statistical clustering algorithm, as well as a linguistic set, were used to design category-based language models. The language models proposed were evaluated, as usual, in terms of perplexity of the text corpus. Then they were integrated into an ASR system and also evaluated in terms of system performance. It can be seen that category-based language models can perform better, also in terms of WER, when categories are obtained through statistical models instead of using linguistic techniques. They also show that better system performance are obtained when the language model interpolates category based and word based models.


Language Model Statistical Cluster Training Corpus Text Corpus Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Niesler, T.: Category-based statistical language models. PhD thesis, Department of Engineering, University of Cambridge, U.K. (1997)Google Scholar
  2. 2.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)Google Scholar
  3. 3.
    Linares, D., Benedí, J., Sánchez, J.: A hybrid language model based on a combination of n-grams and stochastic context-free grammars. ACM Trans. on Asian Language Information Processing 3, 113–127 (2004)CrossRefGoogle Scholar
  4. 4.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)Google Scholar
  5. 5.
    Martin, S., Liermann, J., Ney, H.: Algorithms for bigram and trigram word clustering. Speech Communication 24, 19–37 (1998)CrossRefGoogle Scholar
  6. 6.
    Barrachina, S.: Técnicas de agrupamiento bilingue aplicada a la inferencia de traductores. PhD thesis, Universidad Jaume I, Departamento de Ingeniería y Ciencia de los Computadores (2003)Google Scholar
  7. 7.
    Niesler, T.R., Woodland, P.C.: A variable-length category-based n-gram language model. In: IEEE ICASSP 1996, Atlanta, GA, vol. I, pp. 164–167. IEEE, Los Alamitos (1996)Google Scholar
  8. 8.
    Nevado, F., Sánchez, J., Benedí, J.: Lexical decoding based on the combination of category-based stochastic models and word-category distribution models. In: IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellón, Spain, vol. 1, pp. 183–188. Publicacions de la Universitat Jaume I (2001)Google Scholar
  9. 9.
    Proyecto BASURDE: Spontaneus-Speech Dialogue System in Limited Domains. Comisión Interministerial de Ciencia y Tecnología TIC98-423-C06 (1998-2001)
  10. 10.
    Torres, I., Varona, A.: k-TSS language models in speech recognition systems. Computer Speech and Language 15, 127–149 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • R. Justo
    • 1
  • I. Torres
    • 1
  1. 1.Departamento de Electricidad y Electrónica, Facultad de Ciencia y TecnologíaUniversidad del País Vasco 

Personalised recommendations