Automatic Detection of Regional Words for Pan-Hispanic Spanish on Twitter

  • Sergio JimenezEmail author
  • George Dueñas
  • Alexander Gelbukh
  • Carlos A. Rodriguez-Diaz
  • Sergio Mancera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11238)


Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.


Spanish regionalisms Automatic regional words detection Regionalisms meaning HSIC TF-IDF Word2vec 


  1. 1.
    Baeza-Yates, R., et al.: Modern Information Retrieval, vol. 463. ACM press, New York (1999)Google Scholar
  2. 2.
    Calvo, H.: Simple TF\(\cdot \) IDF is not the best you can get for regionalism classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8403, pp. 92–101. Springer, Heidelberg (2014). Scholar
  3. 3.
    Donoso, G., Sanchez, D.: Dialectometric analysis of language variation in twitter. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 16–25. Association for Computational Linguistics, Valencia, Spain (April 2017)Google Scholar
  4. 4.
    Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.J.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems, pp. 585–592 (2008)Google Scholar
  5. 5.
    Grieve, J., Speelman, D., Geeraerts, D.: A statistical method for the identification and aggregation of regional linguistic variation. Lang. Var. Change 23(2), 193–221 (2011)CrossRefGoogle Scholar
  6. 6.
    Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Ann. Stat., pp. 1171–1220 (2008)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Huang, Y., Guo, D., Kasakoff, A., Grieve, J.: Understanding us regional linguistic variation with twitter data analysis. Comput. Environ. Urban Syst. 59, 244–255 (2016)CrossRefGoogle Scholar
  8. 8.
    Lee, J., Kretzschmar Jr., W.A.: Spatial analysis of linguistic data with GIS functions. Int. J. Geogr. Inf. Sci. 7(6), 541–560 (1993)CrossRefGoogle Scholar
  9. 9.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  10. 10.
    Nguyen, D., Eisenstein, J.: A kernel independence test for geographical language variation. Comput. Linguist. 43(3), 567–592 (2017)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Rodriguez-Diaz, C.A., Jimenez, S., Dueñas, G., Bonilla, J.E., Gelbukh, A.: Dialectones: Finding statistically significant dialectal boundaries using twitter data. In: International Conference on Intelligent Text Processing and Computational Linguistics Springer (2018). (in press)Google Scholar
  12. 12.
    Scherrer, Y.: Recovering dialect geography from an unaligned comparable corpus. In: Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pp. 63–71. Association for Computational Linguistics (2012)Google Scholar
  13. 13.
    Spärck Jones, K.: IDF term weighting and IR research lessons. J. Doc. 60(5), 521–523 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Sergio Jimenez
    • 1
    Email author
  • George Dueñas
    • 1
  • Alexander Gelbukh
    • 2
  • Carlos A. Rodriguez-Diaz
    • 1
  • Sergio Mancera
    • 1
    • 2
  1. 1.Instituto Caro y CuervoBogotá D.C.Colombia
  2. 2.Centro de Investigación en ComputaciónInstituto Politécnico NacionalMexico CityMexico

Personalised recommendations