Multiple Model Text Normalization for the Polish Language

  • Łukasz Brocki
  • Krzysztof Marasek
  • Danijel Koržinek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7661)


The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.


Language Model Machine Translation Automatic Speech Recognition Text Corpus Word Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Filip, G., Krzysztof, J., Agnieszka, W., Mikołaj, W.: Text Normalization as a Special Case of Machine Translation. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, vol. 1 (2006)Google Scholar
  2. 2.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Google Scholar
  3. 3.
    Dumke, R.R., Abran, A. (eds.): IWSM 2000. LNCS, vol. 2006. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  4. 4.
    Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs. Springer (1994)Google Scholar
  5. 5.
    Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics. Springer (1999)Google Scholar
  6. 6.
    Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)Google Scholar
  7. 7.
    Savary, A., Rabiega-Wiśniewska, J., Woliński, M.: Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex. In: Marciniak, M., Mykowiecka, A. (eds.) Aspects of Natural Language Processing. LNCS, vol. 5070, pp. 111–141. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  8. 8.
  9. 9.
    Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of HLT/NACCL, pp. 4–6 (2003)Google Scholar
  10. 10.
    Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language 393, 359–393 (1999)Google Scholar
  11. 11.
    Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3, 400–401 (1987)CrossRefGoogle Scholar
  12. 12.
    Kneser, R., Ney, H.: Improved backing-off for n-gram language modeling. In: International Conference on Acoustics, Speech and Signal Processing, pp. 181–184 (1995)Google Scholar
  13. 13.
    Chung, G., Seneff, S., Wang, C.: Automatic Induction of Language Model Data for A Spoken Dialogue System. In: 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal, September 2-3 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Łukasz Brocki
    • 1
  • Krzysztof Marasek
    • 1
  • Danijel Koržinek
    • 1
  1. 1.Polish-Japanese Institute of Information TechnologyWarsawPoland

Personalised recommendations