Language Identification on the Web: Extending the Dictionary Method

  • Radim Řehůřek
  • Milan Kolkus
Conference paper

DOI: 10.1007/978-3-642-00382-0_29

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5449)
Cite this paper as:
Řehůřek R., Kolkus M. (2009) Language Identification on the Web: Extending the Dictionary Method. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg

Abstract

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Radim Řehůřek
    • 1
  • Milan Kolkus
    • 2
  1. 1.Masaryk University in BrnoCzech Republic
  2. 2.Seznam.cz, a.s.Czech Republic

Personalised recommendations