Abstract
In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Oakes, M., Xu, Y.: A search engine based on query logs, and search log analysis at the university of Sunderland. In: CLEF 2009: Proceedings of the 10th Cross Language Evaluation Forum (2009)
Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, Computing Research Laboratory, New Mexico State University (1994)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: SDAIR 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)
Vojtek, P., Bieliková, M.: Comparing natural language identification methods based on Markov processes. In: Computer Treatment of Slavic and East European Languages, 4th Int. Seminar, pp. 271–282 (2007)
Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Ingelligence PAMI-1(2), 164–172 (1979)
Sibun, P., Reynar, J.C.: Language identification: Examining the issues (1996)
Řehūřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Computational Linguistics and Intelligent Text Processing. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
Berkling, K., Arai, T., Barnard, E.: Analysis of phoneme-based features for language identification. In: Proc. ICASSP, pp. 289–292 (1994)
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: RIAO 2000, vol. 2, pp. 943–961 (2000)
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gottron, T., Lipka, N. (2010). A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_59
Download citation
DOI: https://doi.org/10.1007/978-3-642-12275-0_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12274-3
Online ISBN: 978-3-642-12275-0
eBook Packages: Computer ScienceComputer Science (R0)