Language Resources and Evaluation

, Volume 50, Issue 4, pp 729–766

TweetLID: a benchmark for tweet language identification

  • Arkaitz Zubiaga
  • Iñaki San Vicente
  • Pablo Gamallo
  • José Ramom Pichel
  • Iñaki Alegria
  • Nora Aranberri
  • Aitzol Ezeiza
  • Víctor Fresno
Original Paper

DOI: 10.1007/s10579-015-9317-4

Cite this article as:
Zubiaga, A., Vicente, I.S., Gamallo, P. et al. Lang Resources & Evaluation (2016) 50: 729. doi:10.1007/s10579-015-9317-4

Abstract

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.

Keywords

Language identification Tweets Short texts Multilingualism Similar languages 

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Arkaitz Zubiaga
    • 1
  • Iñaki San Vicente
    • 2
  • Pablo Gamallo
    • 3
  • José Ramom Pichel
    • 4
  • Iñaki Alegria
    • 5
  • Nora Aranberri
    • 5
  • Aitzol Ezeiza
    • 5
  • Víctor Fresno
    • 6
  1. 1.University of WarwickCoventryUK
  2. 2.ElhuyarUsurbilSpain
  3. 3.USCSantiago de CompostelaSpain
  4. 4.imaxin|softwareSantiago de CompostelaSpain
  5. 5.University of the Basque CountryDonostia-San SebastiánSpain
  6. 6.UNEDMadridSpain

Personalised recommendations