A Comparison of Language Identification Approaches on Short, Query-Style Texts

  • Thomas Gottron
  • Nedim Lipka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5993)


In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Oakes, M., Xu, Y.: A search engine based on query logs, and search log analysis at the university of Sunderland. In: CLEF 2009: Proceedings of the 10th Cross Language Evaluation Forum (2009)Google Scholar
  2. 2.
    Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, Computing Research Laboratory, New Mexico State University (1994)Google Scholar
  3. 3.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: SDAIR 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)Google Scholar
  4. 4.
    Vojtek, P., Bieliková, M.: Comparing natural language identification methods based on Markov processes. In: Computer Treatment of Slavic and East European Languages, 4th Int. Seminar, pp. 271–282 (2007)Google Scholar
  5. 5.
    Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Ingelligence PAMI-1(2), 164–172 (1979)CrossRefGoogle Scholar
  6. 6.
    Sibun, P., Reynar, J.C.: Language identification: Examining the issues (1996)Google Scholar
  7. 7.
    Řehūřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Computational Linguistics and Intelligent Text Processing. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)Google Scholar
  8. 8.
    Berkling, K., Arai, T., Barnard, E.: Analysis of phoneme-based features for language identification. In: Proc. ICASSP, pp. 289–292 (1994)Google Scholar
  9. 9.
    Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: RIAO 2000, vol. 2, pp. 943–961 (2000)Google Scholar
  10. 10.
    Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Thomas Gottron
    • 1
  • Nedim Lipka
    • 2
  1. 1.Institut für InformatikJohannes Gutenberg-Universität MainzMainzGermany
  2. 2.Faculty of Media, Media SystemsBauhaus University WeimarWeimarGermany

Personalised recommendations