Analysis of EU Languages Through Text Compression

  • Kimmo Kettunen
  • Markus Sadeniemi
  • Tiina Lindh-Knuutila
  • Timo Honkela
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)


In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.


Machine Translation Word Order Kolmogorov Complexity Head Noun Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gordon Jr., R.G. (ed.): Ethnologue: Languages of the World, 15th edn. SIL International, Dallas (2005), Google Scholar
  2. 2.
    Haarman, H.: Kleines Lexikon der Sprachen. Von Albanisch bis Zulu. Verlag C.H. Beck, München, 2, überarbeitete Auflage (2002) Google Scholar
  3. 3.
    Tiedemann, J., Nygaard, L.: The OPUS Corpus - Parallel & Free. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 26-28 (2004) (accessed January 30, 2006),
  4. 4.
    Juola, P.: Measuring Linguistic Complexity: the Morphological Tier. Journal of Quantitative Linguistics 5, 206–213 (1998)CrossRefGoogle Scholar
  5. 5.
    Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applicatrions. Springer, Heidelberg (1994)Google Scholar
  6. 6.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The Similarity Metric. IEEE Transactions on Information Theory 50, 3250–3264 (2004)CrossRefGoogle Scholar
  7. 7.
    Bennet, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information Distance. IEEE Transactions on Information Theory 44, 1407–1423 (1998)CrossRefGoogle Scholar
  8. 8.
    Juola, P.: Compression-Based Analysis of Language Complexity. Approaches to Complexity in Language, abstracts (2005) (accessed January 15, 2006),
  9. 9.
    Bakker, D.: Flexibility and Consistency in Word Order Patterns in the Languages of Europe. In: Siewierska, A. (ed.) Constituent Order in the Languages of Europe. Empirical Approaches to Language Typology, pp. 381–419. Mouton de Gruyter, Berlin (1998)Google Scholar
  10. 10.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by Compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)CrossRefGoogle Scholar
  11. 11.
    Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Espoo: Publications in Computer and Information Science, Helsinki University of Technology, Report A81 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kimmo Kettunen
    • 1
  • Markus Sadeniemi
    • 2
  • Tiina Lindh-Knuutila
    • 2
  • Timo Honkela
    • 2
  1. 1.Department of Information StudiesUniversity of TampereFinland
  2. 2.Laboratory of Computer and Information ScienceHelsinki University of TechnologyFinland

Personalised recommendations