Analysis of EU Languages Through Text Compression
In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.
KeywordsMachine Translation Word Order Kolmogorov Complexity Head Noun Statistical Machine Translation
Unable to display preview. Download preview PDF.
- 2.Haarman, H.: Kleines Lexikon der Sprachen. Von Albanisch bis Zulu. Verlag C.H. Beck, München, 2, überarbeitete Auflage (2002) Google Scholar
- 3.Tiedemann, J., Nygaard, L.: The OPUS Corpus - Parallel & Free. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 26-28 (2004) (accessed January 30, 2006), http://www.let.rug.nl/~tiedeman/blog/paper/opus_lrec04.pdf
- 5.Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applicatrions. Springer, Heidelberg (1994)Google Scholar
- 8.Juola, P.: Compression-Based Analysis of Language Complexity. Approaches to Complexity in Language, abstracts (2005) (accessed January 15, 2006), http://www.ling.helsinki.fi/sky/tapahtumat/complexity/Abstracts.pdf
- 9.Bakker, D.: Flexibility and Consistency in Word Order Patterns in the Languages of Europe. In: Siewierska, A. (ed.) Constituent Order in the Languages of Europe. Empirical Approaches to Language Typology, pp. 381–419. Mouton de Gruyter, Berlin (1998)Google Scholar
- 11.Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Espoo: Publications in Computer and Information Science, Helsinki University of Technology, Report A81 (2005)Google Scholar