A Corpus Balancing Method for Language Model Construction

  • Luis Villaseñor-Pineda
  • Manuel Montes-y-Gómez
  • Manuel Alberto Pérez-Coutiño
  • Dominique Vaufreydaz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2588)


The language model is an important component of any speech recognition system. In this paper, we present a lexical enrichment methodology of corpora focused on the construction of statistical language models. This methodology considers, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given corpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper describes the formal details about this methodology; the second part presents some experiments and results that validate our method.


Language Model Automatic Speech Recognition Critical Word Training Corpus Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bernsen, N., H. Dybkjaer and L. Dybkjaer. Designing Interactive Speech Systems. From First Ideas to User Testing. Springer-Verlag. 1998.Google Scholar
  2. 2.
    Gelbukh, A., G. Sidorov and L. Chanona. Compilation of a Spanish Representative Corpus. In A. Gelbukh (Ed.) Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science N 2276, Springer-Verlag, 2002.Google Scholar
  3. 3.
    Jurafsky, D. and J. Martin. Speech and Language Processing. Prentice Hall. 2000.Google Scholar
  4. 4.
    Kowalski, G. Information Retrieval Systems: Theory and implementation. Kluwer Academic Publishers, 1997.Google Scholar
  5. 5.
    Montes y Góméz, M., A. Gelbukh and A. López-López. Mining the News: Trends, Associations and Deviations. Computación y Sistemas, Vol. 5, No. 1, IPN 2001.Google Scholar
  6. 6.
    Pineda, L. A., A. Massé, I. Meza, M. Salas, E. Schwarz, E. Uraga and L. Villaseñor. The DIME Project. Mexican International Conference on Artificial Intelligence MICAI-2002, Lecture Notes in Artificial Intelligence 2313, Springer-Verlag, 2002Google Scholar
  7. 7.
    Vaufreydaz, D., M. Akbar and J. Rouillard. Internet Documents: A Rich Source for Spoken Language Modeling. Automatic Speech Recognition and Understanding (ASRU`99), Keystone, Colorado, USA, 1999.Google Scholar
  8. 8.
    Villaseñor, L., A. Massé and L.A. Pineda. The DIME corpus. 3er Encuentro Internacional de Ciencias de la Computación ENC-01, Aguascalientes, México, 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Luis Villaseñor-Pineda
    • 1
  • Manuel Montes-y-Gómez
    • 1
  • Manuel Alberto Pérez-Coutiño
    • 1
  • Dominique Vaufreydaz
    • 2
  1. 1.Instituto Nacional de AstrofísicaÓptica y Electrónica (INAOE)Mexico
  2. 2.Laboratoire CLIPS-IMAGUniversité Joseph FourierFrance

Personalised recommendations