Language Resources and Evaluation

, Volume 50, Issue 4, pp 729–766 | Cite as

TweetLID: a benchmark for tweet language identification

  • Arkaitz Zubiaga
  • Iñaki San Vicente
  • Pablo Gamallo
  • José Ramom Pichel
  • Iñaki Alegria
  • Nora Aranberri
  • Aitzol Ezeiza
  • Víctor Fresno
Original Paper

Abstract

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.

Keywords

Language identification Tweets Short texts Multilingualism Similar languages 

References

  1. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. (2011). Sentiment analysis of twitter data. In Proceedings of the workshop on languages in social media (pp. 30–38). Association for Computational Linguistics.Google Scholar
  2. Alegria, I., Aranberri, N., Comas, P. R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., & Zubiaga, A. (2014). Tweetnorm\_es corpus: An annotated corpus for spanish microtext normalization. In Proceedings of the language resources and evaluation conference.Google Scholar
  3. Baldwin, T., & Lui, M. (2010). Language identification: The long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 229–237). Association for Computational Linguistics.Google Scholar
  4. Baykan, E., Henzinger, M., & Weber, I. (2008). Web page language identification based on urls. Proceedings of the VLDB Endowment, 1(1), 176–187.CrossRefGoogle Scholar
  5. Beesley, K. R. (1988). Language identifier: A computer program for automatic natural-language identification of on-line text. In Proceedings of the 29th annual conference of the American Translators Association (Vol. 47, p. 54). Citeseer.Google Scholar
  6. Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., & Wilson, T. (2012). Language identification for creating language-specific twitter collections. In Workshop on language in social media (pp. 65–74). ACL.Google Scholar
  7. Brown, R. D. (2012). Finding and identifying text in 900+ languages. Digital Investigation, 9, S34–S43.CrossRefGoogle Scholar
  8. Brown, R. D. (2013). Selecting and weighting n-grams to identify 1100 languages. Text, Speech, and Dialogue, 8082, 475–483.Google Scholar
  9. Cárdenas-Claros, M., & Isharyanti, N. (2009). Code-switching and code-mixing in internet chatting: Between ’yes’, ya’, and ’si’—A case study. The Jalt Call Journal, 5(3), 67–78.Google Scholar
  10. Carter, S., Weerkamp, W., & Tsagkias, M. (2013). Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1), 195–215.CrossRefGoogle Scholar
  11. Cassidy, T., Ji, H., Ratinov, L. A., Zubiaga, A., & Huang, H. (2012). Analysis and enhancement of wikification for microblogs with context expansion. In Proceedings of COLING, the 24th international conference on computational linguistics (Vol. 12, pp. 441–456).Google Scholar
  12. Cavnar, W. B., Trenkle, J. M., et al. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161–175.Google Scholar
  13. Chepovskiy, A., Gusev, S., & Kurbatova, M. (2012). Language identification for texts written in transliteration. CDUD 2012—Concept discovery in unstructured data (p. 13).Google Scholar
  14. Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., et al. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.CrossRefGoogle Scholar
  15. Druck, G. (2011). Generalized expectation criteria for lightly supervised learning. Ph.D. thesis, University of Massachusetts, Amherst.Google Scholar
  16. Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory, New Mexico State University, Las Cruces.Google Scholar
  17. Gamallo, P., Garcia, M., Sotelo, S., & Pichel, J. R. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. In TweetLID@SEPLN.Google Scholar
  18. Gella, S., Bali, K., & Choudhury, M. (2014). “ye word kis lang ka hai bhai?” Testing the limits of word level language identification. In Proceedings of ICON2014, the 11th International Conference on Natural Language Processing.Google Scholar
  19. Goldszmidt, M., Najork, M., & Paparizos, S. (2013). Boot-strapping language identifiers for short colloquial postings. In H. Blockeel, K. Kersting, S. Siegfried, F. Železný (Eds.), Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science (Vol. 8189, pp. 95–111). Berlin: Springer.Google Scholar
  20. Gottron, T., & Lipka, N. (2010). A comparison of language identification approaches on short, query-style texts. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, K. van Rijsbergen (Eds.), Advances in information retrieval, Lecture Notes in Computer Science (Vol. 5993, pp. 611–614). Berlin: Springer.Google Scholar
  21. Grefenstette, G. (1995). Comparing two language identification schemes. International Conference on Statistical Analysis of Textual Data.Google Scholar
  22. Guo, S., Chang, M. W., & Kiciman, E. (2013). To link or not to link? A study on end-to-end tweet entity linking. In HLT-NAACL (pp. 1020–1030).Google Scholar
  23. Hammarström, H. (2007). A finegrained model for language identification. In Proceedings of improving non english web searching (iNEWS’07) (pp. 14–20).Google Scholar
  24. Hughes, B., Baldwin, T., Bird, S. G., Nicholson, J., & MacKinlay, A. (2006). Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation. European Language Resources Association.Google Scholar
  25. Hurtado, L. F., Pla, F., Giménez, M., & Sanchis, E. (2014). Elirf-upv en tweetlid: Identificación del idioma en twitter. In TweetLID@SEPLN.Google Scholar
  26. Ingle, N. (1980). Language identification table. London: Technical Translation International Ltd.Google Scholar
  27. Jehl, L., Hieber, F., & Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval. In Proceedings of the seventh workshop on statistical machine translation (pp. 410–421). Association for Computational Linguistics.Google Scholar
  28. Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge: MIT Press.Google Scholar
  29. Kaufmann, M., & Kalita, J. (2010). Syntactic normalization of twitter messages. In International conference on natural language processing. Kharagpur, India.Google Scholar
  30. Keesan, C. (1987). Identification of written slavic languages. In Proceedings of the 28th annual conference of the American Translators Association (pp. 517–528).Google Scholar
  31. Kikui, G. i. (1996). Identifying, the coding system and language, of on-line documents on the internet. In Proceedings of the 16th conference on Computational linguistics (Vol. 2, pp. 652–657). Association for Computational Linguistics.Google Scholar
  32. King, B., & Abney, S. P. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies (pp. 1110–1119).Google Scholar
  33. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79–86).Google Scholar
  34. Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the international conference on weblogs and socila media (pp. 538–541).Google Scholar
  35. Laboreiro, G., Bošnjak, M., Sarmento, L., Rodrigues, E. M., & Oliveira, E. (2013). Determining language variant in microblog messages. In Proceedings of the 28th ACM/SIGAPP symposium on applied computing (pp. 902–907). ACM.Google Scholar
  36. Lehman, B. (2014). The evolution of languages on twitter. http://blog.gnip.com/twitter-language-visualization/.
  37. Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., & Lee, B. S. (2012). Twiner: Named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 721–730). ACM.Google Scholar
  38. Ljubes̆ić, N., Mikelić, N., & Boras, D. (2007). Language indentification: How to distinguish similar languages? In Proceedings of the 29th international conference on information technology interfaces (pp. 541–546). IEEE.Google Scholar
  39. Lui, M., & Baldwin, T. (2010). Multilingual language identification: Altw 2010 shared task dataset. In Australasian Language Technology Association Workshop 2010 (p. 4).Google Scholar
  40. Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification. In Proceedings of 5th international joint conference on natural language processing. Citeseer.Google Scholar
  41. Lui, M., & Baldwin, T. (2012). Langid. py: An off-the-shelf language identification tool. In Proceedings of ACL (pp. 25–30). ACL.Google Scholar
  42. Lui, M., & Baldwin, T. (2014). Accurate language identification of twitter messages. In Proceedings of the 5th workshop on language analysis for social media (LASM) (pp. 17–25). Association for Computational Linguistics, Gothenburg, Sweden.Google Scholar
  43. Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27–40.Google Scholar
  44. Majliš, M. (2012). Yet another language identifier. In Student Research Workshop at EACL’12 (pp. 46–54). ACL.Google Scholar
  45. Martins, B., & Silva, M. J. (2005). Language identification in web pages. In Proceedings of SAC (pp. 764–768). ACM.Google Scholar
  46. McNamee, P. (2005). Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3), 94–101.Google Scholar
  47. Mendizabal, I., Carandell, J., & Horowitz, D. (2014). Tweetsafa: Tweet language identification. In TweetLID@SEPLN.Google Scholar
  48. Mosquera, Y. D., Vilares, D., & Vilares, J. (2014). Identificación automática del idioma en twitter: Adaptación de identificadores del estado del arte al contexto ibérico. In TweetLID@SEPLN.Google Scholar
  49. Murthy, K. N., & Kumar, G. B. (2006). Language identification from small text samples. Journal of Quantitative Linguistics, 13(1), 57–80.CrossRefGoogle Scholar
  50. Myers-Scotton, C. (2002). Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford: Oxford University Press.CrossRefGoogle Scholar
  51. Newman, P. (1987). Foreign language identification: First step in the translation process. Technical report, Sandia National Labs., Albuquerque, NM, USA.Google Scholar
  52. Nguyen, D., & Doğruöz, A.S. (2014). Word level language identification in online multilingual communication. In Proceedings of the conference on empirical methods on natural language processing.Google Scholar
  53. Nowak, S., Lukashevich, H., Dunker, P., & Rüger, S. (2010). Performance measures for multilabel evaluation: A case study in the area of image classification. In Proceedings of the international conference on multimedia information retrieval (pp. 35–44). ACM.Google Scholar
  54. O’Connor, B., Krieger, M., & Ahn, D. (2010). Tweetmotif: Exploratory search and topic summarization for twitter. In ICWSM.Google Scholar
  55. Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the language resources and evaluation conference.Google Scholar
  56. Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural, 33, 155–162.Google Scholar
  57. Paolillo, J. C. (2011). Conversational codeswitching on usenet and internet relay chat. Language@ Internet, 8(3), 1–2.Google Scholar
  58. Porta, J. (2014). Twitter language identification using rational kernels and its potential application to sociolinguistics. In TweetLID@SEPLN.Google Scholar
  59. Prager, J. M. (1999). Linguini: Language identification for multilingual documents. In Proceedings of the 32nd annual Hawaii international conference on systems sciences, 1999 (HICSS-32) (pp. 11–pp). IEEE.Google Scholar
  60. R̆ehůr̆ek, R., & Kolkus, M. (2009). Language identification on the web: Extending the dictionary method. In Computational linguistics and intelligent text processing (pp. 357–368). Springer.Google Scholar
  61. Scannell, K. (2007). The Crúbadán Project: Corpus building for underresourced languages. In Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop, incorporating Cleaneval (Vol. 5, p. 5).Google Scholar
  62. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.CrossRefGoogle Scholar
  63. Shuyo, N. (2010). Language detection library for java. https://code.google.com/p/language-detection/.
  64. Sibun, P., & Reynar, J. C. (1996). Language identification: Examining the issues. In Proceedings of SDAIR-96, the 5th Symposium on Document Analysis and Information Retrieval.Google Scholar
  65. Sibun, P., & Spitz, A. L. (1994). Language determination: Natural language processing from scanned document images. In Proceedings of the fourth conference on applied natural language processing (pp. 15–21). Association for Computational Linguistics.Google Scholar
  66. Singh, A. K. (2006). Study of some distance measures for language and encoding identification. In Workshop on linguistic distances (pp. 63–72). ACL.Google Scholar
  67. Singh, A. K., & Goyal, P. (2014). A language identification method applied to twitter data. In TweetLID@SEPLN.Google Scholar
  68. Tetreault, J., Blanchard, D., & Cahill, A. (2013). A report on the first native language identification shared task. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 48–57). Citeseer.Google Scholar
  69. Tromp, E., & Pechenizkiy, M. (2011). Graph-based n-gram language identification on short texts. In Proceedings of 20th machine learning conference of Belgium and The Netherlands (pp. 27–34).Google Scholar
  70. Vatanen, T., Väyrynen, J. J., & Virpioja, S. (2010). Language identification of short text segments with n-gram models. In LREC, Citeseer.Google Scholar
  71. Vogel, J., & Tresner-Kirsch, D. (2012). Robust Language Identification in short, noisy texts: Improvements to LIGA. In Proceedings of the 3rd international workshop on mining ubiquitous and social environments (MUSE) (pp. 1–9). Bristol, UK.Google Scholar
  72. Winkelmolen, F., & Mascardi, V. (2011). Statistical language identification of short texts. In Proceedings of the 3rd international conference on agents and artificial intelligence (pp. 498–503). Rome, Italy.Google Scholar
  73. Xafopoulos, A., Kotropoulos, C., Almpanidis, G., & Pitas, I. (2004). Language identification in web documents using discrete hmms. Pattern Recognition, 37(3), 583–594.CrossRefGoogle Scholar
  74. Xia, F., Lewis, W. D., & Poon, H. (2009). Language id in the context of harvesting language data off the web. In Proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics (pp. 870–878). Association for Computational Linguistics.Google Scholar
  75. Zamora, J. D., Bruzón, A. F., & Bueno, R. O. (2014). Tweets language identification using feature weighting. In TweetLID@SEPLN.Google Scholar
  76. Zampieri, M. (2013). Using bag-of-words to distinguish similar languages: How efficient are they? In 2013 IEEE 14th international symposium on computational intelligence and informatics (CINTI) (pp. 37–41). IEEE.Google Scholar
  77. Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza, A., & Fresno, V. (2014). Overview of tweetlid: Tweet language identification at sepln 2014. In TweetLID@SEPLN.Google Scholar
  78. Zubiaga, A., Spina, D., Amigó, E., & Gonzalo, J. (2012). Towards real-time summarization of scheduled events from twitter streams. In Proceedings of the 23rd ACM conference on hypertext and social media (pp. 319–320). ACM.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Arkaitz Zubiaga
    • 1
  • Iñaki San Vicente
    • 2
  • Pablo Gamallo
    • 3
  • José Ramom Pichel
    • 4
  • Iñaki Alegria
    • 5
  • Nora Aranberri
    • 5
  • Aitzol Ezeiza
    • 5
  • Víctor Fresno
    • 6
  1. 1.University of WarwickCoventryUK
  2. 2.ElhuyarUsurbilSpain
  3. 3.USCSantiago de CompostelaSpain
  4. 4.imaxin|softwareSantiago de CompostelaSpain
  5. 5.University of the Basque CountryDonostia-San SebastiánSpain
  6. 6.UNEDMadridSpain

Personalised recommendations