Language Identification in Multi-lingual Web-Documents

  • Thomas Mandl
  • Margaryta Shramko
  • Olga Tartakovski
  • Christa Womser-Hacker
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3999)

Abstract

Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are shown. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on both short synthetic documents and real world web documents. Our tool is able to recognize the languages present as well as the location of the language change with reasonable accuracy.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Thomas Mandl
    • 1
  • Margaryta Shramko
    • 1
  • Olga Tartakovski
    • 1
  • Christa Womser-Hacker
    • 1
  1. 1.Information ScienceUniversität HildesheimHildesheimGermany

Personalised recommendations