Language Resources and Evaluation

, Volume 49, Issue 4, pp 883–905 | Cite as

TweetNorm: a benchmark for lexical normalization of Spanish tweets

  • Iñaki Alegria
  • Nora Aranberri
  • Pere R. Comas
  • Víctor Fresno
  • Pablo Gamallo
  • Lluis Padró
  • Iñaki San Vicente
  • Jordi Turmo
  • Arkaitz Zubiaga
Original Paper

Abstract

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.

Keywords

Lexical normalization Twitter Social media Corpus Evaluation 

References

  1. Ageno, A., Comas, P. R., Padró, L., & Turmo, J. (2013). The talp-upc approach to tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  2. Alegria, I., Etxeberria, I., & Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  3. Beaufort, R., Roekhaut, S., Cougnon, L. A., & Fairon, C. (2010). A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics (ACL) (pp. 770–779), Uppsala, Sweden.Google Scholar
  4. Chakrabarti, D., & Punera, K. (2011). Event summarization using tweets. In Proceedings of the fifth International Conference on Weblogs and Social Media (ICWSM).Google Scholar
  5. Costa-Jussà, M. R., & Banchs, R. E. (2013). Automatic normalization of short texts by combining statistical and rule-based techniques. Language Resources and Evaluation, 47(1), 179–193.Google Scholar
  6. Cotelo-Moya, J. M., Cruz, F. L., & Troyano, J. A. (2013). Resource-based lexical approach to tweet-norm task. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  7. Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 359–369).Google Scholar
  8. Gamallo, P., Garcia, M., & Pichel, J. R. (2013) A method to lexical normalisation of tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  9. Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–12). CS224N Project Report, Stanford.Google Scholar
  10. Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 368–378).Google Scholar
  11. Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalisation for social media text. ACM Transactions on Intelligent Systems and Technology, 43(1), 15–27.Google Scholar
  12. Han, B., Cook, P., & Baldwin, T. (2013). unimelb: Spanish text normalisation. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  13. Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80–88), ACM.Google Scholar
  14. Hulden, M., & Francom, J. (2013). Weighted and unweighted transducers for tweet normalization. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  15. Inouye, D., & Kalita, J.K. (2011). Comparing twitter summarization algorithms for multiple post summaries. In Proceedings of the IEEE third international conference on social computing (SocialCom) (pp. 298–306), IEEE.Google Scholar
  16. Jiang, L., Yu, M., Zhou, M., Liu, X., & Zhao, T. (2011). Target-dependent twitter sentiment classification. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 151–160).Google Scholar
  17. Kaufmann, J., & Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the international conference on natural language processing, Kharagpur, India.Google Scholar
  18. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.CrossRefGoogle Scholar
  19. Lin, J., Snow, R., & Morgan, W. (2011). Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 422–429), ACM.Google Scholar
  20. Ling, W., Dyer, C., Black, A. W., & Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the 2014 conference on empirical methods on natural language processing (EMNLP) (pp. 73–84).Google Scholar
  21. Liu, F., Weng, F., & Jiang, X. (2012). A broad-coverage normalization system for social media language. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papers (vol. 1, pp. 1035–1044), Association for Computational Linguistics.Google Scholar
  22. Liu, X., Wei, F., Zhang, S., & Zhou, M. (2013). Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology, 4(1), 3.Google Scholar
  23. Montejo-Ráez, A., Díaz-Galiano, M., Martínez-Cámara, E., Martín-Valdivia, T., García-Cumbreras, M. A., & Ureña-López, A. (2013). Sinai at twitter-normalization 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  24. Mosquera-López, A., & Moreda, P. (2013). Dlsi en tweet-norm 2013: Normalización de tweets en español. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  25. Muñoz-García, O., Suárez, S. V., & Bel, N. (2013). Exploiting web-based collective knowledge for micropost normalisation. In Proceedings of the tweet normalization workshop at the Conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  26. Oliva, J., Serrano, J. I., del Castillo, M. D., & Iglesias, Á. (2013). A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19(1), 121–141.CrossRefGoogle Scholar
  27. Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th international conference on language resources and evaluation (LREC).Google Scholar
  28. Porta, J., & Sancho, J. L. (2013). Word normalization in twitter using finite-state transducers. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  29. Ruiz, P., Cuadros, M., & Etchegoyhen, T. (2013). Lexical normalization of spanish tweets with preprocessing rules, domain-specific edit distances, and language models. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  30. Saralegi, X., & San-Vicente, I. (2013). Elhuyar at tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  31. Vilares, J., Alonso, M. A., & Vilares, D. (2013). Prototipado rápido de un sistema de normalización de tuits: Una aproximación léxica. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  32. Villena Román, J., Lana Serrano, S., Martínez Cámara, E., & González Cristóbal, J. C. (2013). TASS-workshop on sentiment analysis at SEPLN. In Proceedings of the Spanish Society for Natural Language Processing (SEPLN).Google Scholar
  33. Wang, A., Kan, M. Y., Andrade, D., Onishi, T., & Ishikawa, K. (2013). Chinese informal word normalization: An experimental study. Proceedings of the Sixth International Joint Conference on Natural Language Processing, 13, 127–135.Google Scholar
  34. Wei, Z., Zhou, L., Li, B., Wong, K. F., Gao, W., & Wong, K. F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the text REtrieval conference (TREC).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Iñaki Alegria
    • 1
  • Nora Aranberri
    • 1
  • Pere R. Comas
    • 2
  • Víctor Fresno
    • 3
  • Pablo Gamallo
    • 4
  • Lluis Padró
    • 2
  • Iñaki San Vicente
    • 5
  • Jordi Turmo
    • 2
  • Arkaitz Zubiaga
    • 6
  1. 1.IXA. UPV/EHUSan SebastianSpain
  2. 2.UPCBarcelonaSpain
  3. 3.UNEDMadridSpain
  4. 4.USCSantiago de CompostelaSpain
  5. 5.ElhuyarUsurbilSpain
  6. 6.University of WarwickCoventryUK

Personalised recommendations