Language Variety Identification Using Distributed Representations of Words and Documents

  • Marc Franco-Salvador
  • Francisco Rangel
  • Paolo Rosso
  • Mariona Taulé
  • M. Antònia Martít
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9283)


Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we focus on the use of distributed representations of words and documents using the continuous Skip-gram model. We compare this model with three recent approaches: Information Gain Word-Patterns, TF-IDF graphs and Emotion-labeled Graphs, in addition to several baselines. We evaluate the models introducing the Hispablogs dataset, a new collection of Spanish blogs from five different countries: Argentina, Chile, Mexico, Peru and Spain. Experimental results show state-of-the-art performance in language variety identification. In addition, our empirical analysis provides interesting insights on the use of the evaluated approaches.


Author profiling Language variety identification Distributed representations Information Gain Word-Patterns TF-IDF graphs Emotion-labeled Graphs 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barto, A.G.: Reinforcement learning: An introduction. MIT press (1998)Google Scholar
  2. 2.
    Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. The Journal of Machine Learning Research 3, 1137–1155 (2003)zbMATHGoogle Scholar
  3. 3.
    Dumais, S.T.: Latent semantic analysis. Annual Review of Information Science and Technology 38(1), 188–230 (2004)CrossRefGoogle Scholar
  4. 4.
    Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research 13(1), 307–361 (2012)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Distributed representations. In: Rumelhart, D.E., McClelland, J.L., (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press (1986)Google Scholar
  6. 6.
    Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the International Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  7. 7.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (2014)Google Scholar
  8. 8.
    Levin, B.: English verb classes and alternations. University of Chicago Press, Chicago (1993)Google Scholar
  9. 9.
    Maier, W., Gómez-Rodríguez, C.: Language variety identification in Spanish tweets. In: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pp. 25–35. Association for Computational Linguistics, Doha, Qatar, October 2014.
  10. 10.
    Martí, M.A., Bertran, M., Taulé, M., Salamó, M.: Distributional approach based on syntactic dependencies for discovering constructions. Computational Linguistics (2015, under review)Google Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (2013)Google Scholar
  12. 12.
    Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1045–1048, September 26–30, 2010Google Scholar
  13. 13.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013)Google Scholar
  14. 14.
    Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012)
  15. 15.
    Mohammad, S.M., Yang, T.: Tracking sentiment in mail: how gender differ on emotional axes. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (2011)Google Scholar
  16. 16.
    Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: Proceedings of the International Workshop on Artificial Intelligence and Statistics, pp. 246–252. Citeseer (2005)Google Scholar
  17. 17.
    Pennebaker, J.W.: The secret life of pronouns: What our words say about us. Bloomsbury Press (2011)Google Scholar
  18. 18.
    Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Information Processing & Management, Special Issue on Emotion and Sentiment in Social and Expressive Media (2015, in press)Google Scholar
  19. 19.
    Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Labs and Workshops, Notebook Papers., vol. 1180 (2014)Google Scholar
  20. 20.
    Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at pan 2013. In: Forner P., Navigli R., Tufis, D. (eds.) Notebook Papers of CLEF 2013 LABs and Workshops., vol. 1179 (2013)Google Scholar
  21. 21.
    Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of arabic language varieties and dialects in social media. In: Proceeding of the 1st International Workshop on Social Media Retrieval and Analysis SoMeRa (2014)Google Scholar
  22. 22.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  23. 23.
    Sidorov, G., Miranda-Jimnez, S., Viveros-Jimnez, F., Gelbukh, F., Castro-Snchez, N., Velsquez, F., Daz-Rangel, I., Surez-Guerra, S., Trevio, A., Gordon-Miranda, J.: Empirical study of opinion mining in spanish tweets. In: 11th Mexican International Conference on Artificial Intelligence, MICAI, pp. 1–4 (2012)Google Scholar
  24. 24.
    Zampieri, M., Gebrekidan-Gebre, B.: Automatic identification of language varieties: the case of portuguese. In: Proceedings of the Conference on Natural Language Processing (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Marc Franco-Salvador
    • 1
  • Francisco Rangel
    • 1
    • 2
  • Paolo Rosso
    • 1
  • Mariona Taulé
    • 3
  • M. Antònia Martít
    • 3
  1. 1.Universitat Politècnica de ValènciaValenciaSpain
  2. 2.Autoritas Consulting S.A.MadridSpain
  3. 3.Universitat de BarcelonaBarcelonaSpain

Personalised recommendations