Multiple Model Text Normalization for the Polish Language
The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.
KeywordsLanguage Model Machine Translation Automatic Speech Recognition Text Corpus Word Sequence
Unable to display preview. Download preview PDF.
- 1.Filip, G., Krzysztof, J., Agnieszka, W., Mikołaj, W.: Text Normalization as a Special Case of Machine Translation. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, vol. 1 (2006)Google Scholar
- 2.Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Google Scholar
- 4.Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs. Springer (1994)Google Scholar
- 5.Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics. Springer (1999)Google Scholar
- 6.Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)Google Scholar
- 9.Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of HLT/NACCL, pp. 4–6 (2003)Google Scholar
- 10.Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language 393, 359–393 (1999)Google Scholar
- 12.Kneser, R., Ney, H.: Improved backing-off for n-gram language modeling. In: International Conference on Acoustics, Speech and Signal Processing, pp. 181–184 (1995)Google Scholar
- 13.Chung, G., Seneff, S., Wang, C.: Automatic Induction of Language Model Data for A Spoken Dialogue System. In: 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal, September 2-3 (2005)Google Scholar