Skip to main content
Log in

Spanish corpora for sentiment analysis: a survey

  • Survey
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the characterization of corpora. This includes a number of features to help analyze resources at both corpus level and document level. This survey—besides depicting the overall landscape of corpora in Spanish—supports sentiment analysis practitioners with the task of selecting the most suitable resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.tripadvisor.es/.

  2. http://blade10.cs.upc.edu/freeling-old/doc/tagsets/tagset-es.html.

  3. http://clic.ub.edu/corpus/es/node/106.

  4. http://johnrbto.com/hopinion/.

  5. http://sinai.ujaen.es/coar/.

  6. These are the numbers in the corpus description; however, the data itself varied slightly: 34,615 tweets (17,311 negative and 17,304 positive).

  7. http://sinai.ujaen.es/cost/.

  8. http://masquemedicos.com.

  9. The source website asks for most and least positive aspects of users’ experiences.

  10. http://sinai.ujaen.es/copos-2/.

  11. http://hlt.isti.cnr.it/trip-maml/.

  12. https://www.mimedicamento.es.

  13. http://sinai.ujaen.es/dos/.

  14. http://alt.qcri.org/semeval2016/.

  15. http://www.bcnrestaurantes.com/, http://www.restaurantes-zaragoza.es.

  16. http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools.

  17. http://metashare.ilsp.gr:8080/repository/search/?q=semeval+2016+Spanish.

  18. https://gplsi.dlsi.ua.es/gplsi13/es/node/344.

  19. http://www.muchocine.net.

  20. http://www.lsi.us.es/~fermin/corpusCine.zip.

  21. http://www.ciao.es/.

  22. http://www.sfu.ca/~mtaboada/download/downloadCorpusSpa.html.

  23. http://sinai.ujaen.es/sfu-review-sp-neg/, http://clic.ub.edu/corpus/es/node/171.

  24. http://www.evall.uned.es/.

  25. http://nlp.uned.es/replab2013/.

  26. http://ow.ly/uQWEs.

  27. https://permid.org/.

  28. http://dbpedia.org/.

  29. http://sabcorpus.linkeddata.es/.

  30. http://mascorpus.linkeddata.es/.

  31. https://zenodo.org/record/1293493#.W3O_V-gzbIU.

  32. An annual Spanish workshop, TASS for sentiment analysis, releases datasets for different tasks every year, some of them newly built for that year and others reused from previous editions. The tagged datasets come from different editions of this workshop. In order to be granted access to TASS corpora, a Research/Non-Commercial License Agreement must be signed and sent to the organizers; more information can be found at the website of each edition (where the schema for reading the XML files is also provided).

  33. http://www.sepln.org/workshops/tass/tass_data/download.php.

  34. http://www.sepln.org/workshops/tass/2017/.

  35. http://www.sepln.org/workshops/tass/2013/corpus.php.

  36. http://www.sepln.org/workshops/tass/2015/tass2015.php#corpus., http://www.sepln.org/workshops/tass/2016/tass2016.php#corpus.

  37. It is considered to be freely available if not explicitly mentioned to be otherwise.

  38. http://catalogo.retele.linkeddata.es.

References

  • Amigó, E., Carrillo de Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Martín, et al. (2013). Overview of RepLab 2013: evaluating online reputation monitoring systems. In Proceedings of the fourth international conference of the clef initiative (pp. 333–352).

  • Atserias, J., Casas, B., Comelles, E., González, M., Padró, L., & Padró, M. (2006). FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. In Proceedings of LREC (Vol. 6, pp. 48–55).

  • Boldrini, E., Balahur, A., Martínez-Barco, P., & Montoyo, A. (2012). Using EmotiBlog to annotate and analyse subjectivity in the new textual genres. Data Mining and Knowledge Discovery, 25(3), 603–634.

    Article  Google Scholar 

  • Breslin, J. G., Decker, S., et al. (2006). SIOC: An approach to connect web-based communities. International Journal of Web Based Communities, 2(2), 133–142.

    Article  Google Scholar 

  • Brooke, J., Tofiloski, M., & Taboada, M. (2009). Cross-linguistic sentiment analysis: From english to spanish. In Proceedings of the international conference RANLP-2009 (pp. 50–54). Borovets: Association for Computational Linguistics.

  • Cámara, E. M., Cumbreras, M. Á. G., Román, J. V., & Morera, J. G. (2016). Tass 2015-the evolution of the spanish opinion mining systems. Procesamiento del Lenguaje Natural, 56, 33–40.

    Google Scholar 

  • Cambria, E., Livingstone, A., & Hussain, A. (2012). The hourglass of emotions (pp. 144–157). Berlin: Springer.

    Google Scholar 

  • Cochrane, T. (2009). Eight dimensions for the emotions. Social Science Information, 48(3), 379–420.

    Article  Google Scholar 

  • Cruz, F. L., Troyano, J. A., et al. (2008). Clasificación de documentos basada en la opinión: Experimentos con un corpus de crıticas de cine en espanol. Procesamiento de Lenguaje Natural, 41, 73–80.

    Google Scholar 

  • Cunningham, H., et al. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLOS Computational Biology, 9(2), 1–16.

    Article  Google Scholar 

  • Ekman, P., Friesen, W. V., & Ellsworth, P. (1972). Emotion in the human face: Guidelines for research and an integration of findings. Oxford: Pergamon Press.

    Google Scholar 

  • Fontaine, J. R. J., Scherer, K. R., Roesch, E. B., Ellsworth, P. C., Fontaine, J. R. J., Scherer, K. R., et al. (2007). The world of emotions is not. Psychological Science, 18(12), 1050–1057.

    Article  Google Scholar 

  • Garcia-Moya, L., Anaya-Sanchez, H., & Berlanga-Llavori, R. (2013). Retrieving product features and opinions from customer reviews. IEEE Intelligent Systems, 28(3), 19–27.

    Article  Google Scholar 

  • Hepp, M. (2008). Goodrelations: An ontology for describing products and services offers on the web. In International conference on knowledge engineering and knowledge management (pp. 329–346). Springer.

  • Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Maks, I., & Izquierdo, R. (2017). Analysis of patient satisfaction in dutch and spanish online reviews. Procesamiento del Lenguaje Natural, 58, 101–108.

    Google Scholar 

  • Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2017). Corpus annotation for aspect based sentiment analysis in medical domain. In Proceedings of the 2nd international workshop on extraction and processing of rich semantics from medical texts.

  • Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2018). Relevance of the SFU ReviewSP-NEG corpus annotated with the scope of negation for supervised polarity classification in Spanish. Information Processing and Management, 54(2), 240–251. https://doi.org/10.1016/j.ipm.2017.11.007.

    Article  Google Scholar 

  • Lövheim, H. (2012). A new three-dimensional model for emotions and monoamine neurotransmitters. Medical Hypotheses, 78(2), 341–348.

    Article  Google Scholar 

  • Marcheggiani, D., Täckström, O., Esuli, A., & Sebastiani, F. (2014). Hierarchical multi-label conditional random fields for aspect-oriented opinion mining. In Advances in information retrieval (pp. 273–285). Springer.

  • Martí, M. A., Martín-Valdivia, M. T., Taulé, M., Jiménez-Zafra, S. M., Nofre, M., & Marsó, L. (2016). La negación en español: análisis y tipología de patrones de negación. Procesamiento del Lenguaje Natural, 57, 41–48.

    Google Scholar 

  • Martín-Valdivia, M. T., Martínez-Cámara, E., Perea-Ortega, J. M., & Ureña-López, L. A. (2013). Sentiment polarity detection in spanish reviews combining supervised and unsupervised approaches. Expert Systems with Applications, 40(10), 3934–3942.

    Article  Google Scholar 

  • Martínez-Cámara, E., Martín-Valdivia, M. T., & Ureña-López, L. A. (2011). Opinion classification techniques applied to a spanish corpus (pp. 169–176). Berlin: Springer. https://doi.org/10.1007/978-3-642-22327-3_17.

    Book  Google Scholar 

  • Martínez-Cámara, E., Martín-Valdivia, M. T., et al. (2015). Polarity classification for Spanish tweets using the COST corpus. Journal of Information Science, 41(3), 263–272. https://doi.org/10.1177/0165551514566564.

    Article  Google Scholar 

  • Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.

    Article  Google Scholar 

  • Molina-González, M. D., & Martínez-Cámara, E., et al. (2014). Cross-domain sentiment analysis using Spanish opinionated words. In Proceedings of NLDB (pp. 214–219). https://doi.org/10.1007/978-3-319-07983-7_28

  • Nakamura, A. (1993). Kanjo hyogen jiten. Khet Khlong Toei: Tokyodo Publishing.

    Google Scholar 

  • Navas-Loro, M., & Rodríguez-Doncel, V. (2017). Oeg at tass 2017: Spanish sentiment analysis of tweets at document level. TASS 2017: Workshop on Semantic Analysis at SEPLN, Septiembre 2017 (pp. 43–49). http://ceur-ws.org/Vol-1896/p4_oeg_tass2017.pdf.

  • Navas-Loro, M., Rodríguez-Doncel, V., Santana-Pérez, I., Fernández-Izquierdo, A., & Sánchez, A. (2018). Mas: A corpus of tweets for marketing in spanish. In A. Gangemi, A. L. Gentile, A. G. Nuzzolese, S. Rudolph, M. Maleshkova, H. Paulheim, J. Z. Pan, & M. Alam (Eds.), The semantic web: ESWC 2018 satellite events (pp. 363–375). Cham: Springer.

    Chapter  Google Scholar 

  • Navas-Loro, M., Rodríguez-Doncel, V., Santana-Perez, I., & Sánchez, A. (2017). Spanish corpus for sentiment analysis towards brands. In Proceedings of the 19th international conference on speech and computer (SPECOM) (pp. 680–689).

  • Periñán-Pascual, C., & Arcas-Túnez, F. (2017). A knowledge-based approach to social sensors for environmentally-related problems. In Intelligent environments 2017: Workshop proceedings of the 13th international conference on intelligent environments (Vol. 22, pp. 49). IOS Press.

  • Plaza-Del-Arco, F. M., Martín-Valdivia, M. T., et al. (2016). COPOS: Corpus of patient opinions in Spanish. Application of sentiment analysis techniques. Procesamiento de Lenguaje Natural, 57, 83–90.

    Google Scholar 

  • Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots. American Scientist, 89(4), 344–350.

    Article  Google Scholar 

  • Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi, M., et al. (2016). Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 19–30). San Diego, CA: Association for Computational Linguistics.

  • Rangel, F., Rosso, P., & Reyes, A. (2014). Emotions and irony per gender in Facebook. In Proceedings of workshop ES3LOD, LREC-2014 (pp. 1–6).

  • Reyes, A., & Rosso, P. (2011). Mining subjective knowledge from customer reviews: A specific case of irony detection. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (pp. 118–124). Association for Computational Linguistics.

  • Rivera Pastor, R., Tarín Quirós, C., Villar García, J. P., Badía Cardús, T., & Melero Nogués, M. (2017). Language equality in the digital age—Towards a human language project. https://doi.org/10.2861/136527.

  • Roberto, J. A., Martí, M. A., & Llorente, M. S. (2012). Análisis de la riqueza léxica en el contexto de la clasificación de atributos demográficos latentes. Procesamiento del Lenguaje Natural, 48, 97–104.

    Google Scholar 

  • Roberto, J. A., Salamó, M. M., & Antònia, M. (2013). Clasificación automática del registro lingüístico en textos del español: un análisis contrastivo. LinguaMática, 5(1), 59–67.

    Google Scholar 

  • Rodriguez-Doncel, V., & Labropoulou, P. (2015). Digital representation of rights for language resources. In Proceedings of the 4th workshop on linked data in linguistics (LDL-2015), ACL-IJCNLP 2015 (pp. 49–58).

  • Román, J. V., Morera, J. G., Cámara, E. M., & Zafra, S. M. J. (2015). Tass 2014-the challenge of aspect-based sentiment analysis. Procesamiento del Lenguaje Natural, 54, 61–68.

    Google Scholar 

  • Rosso, P., & Rangel, F. (2017). Author profiling in social media: The impact of emotions on discourse analysis (pp. 3–18). Cham: Springer. https://doi.org/10.1007/978-3-319-68456-7_1.

    Book  Google Scholar 

  • Russell, J. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

    Article  Google Scholar 

  • Sánchez Rada, J. F., & Torres, M., et al. (2014). A linked data approach to sentiment and emotion analysis of twitter in the financial domain. In 2nd international workshop on finance and economics on the semantic web.

  • Shaver, P., Schwartz, J., et al. (1987). Emotion knowledge: Further exploration of a prototype approach. Journal of personality and social psychology, 52(6), 1061–1086. https://doi.org/10.1037/0022-3514.52.6.1061.

    Article  Google Scholar 

  • Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307.

    Article  Google Scholar 

  • Tomkins, S. (1962). Affect imagery consciousness: Volume I: The positive affects (Vol. 1). Berlin: Springer.

    Google Scholar 

  • Vazquez, K. L., Tovar, M., Vilariño, D., & Beltrán, B. (2016). Un algoritmo para detectar la polaridad de opiniones en los dominios de laptops y restaurantes. In Advances in intelligent technologies and its applications (pp. 91–98).

  • Vilares, D. (2012). Sentiment analysis for reviews and microtexts based on lexico-syntactic knowledge. In FDIA’13 (pp. 38–43).

  • Vilares, D., & Alonso, M. A. (2013). Goméz-Rodríguez Carlos: A syntactic approach for opinion mining on Spanish reviews. Natural Language Engineering, 1(1), 1–26.

    Google Scholar 

  • Villena-Román, J., García-Morera, J., Lana-Serrano, S., & González-Cristóbal, J. C. (2014). Tass 2013—A second step in reputation analysis in Spanish. Procesamiento del Lenguaje Natural, 52, 37–44.

    Google Scholar 

  • Villena-Román, J., Lana-Serrano, S., Martínez-Cámara, E., & González-Cristóbal, J. C. (2013). Tass-workshop on sentiment analysis at sepln. Procesamiento del Lenguaje Natural, 50, 37–44.

    Google Scholar 

  • Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2), 165–210.

    Article  Google Scholar 

  • Zafra, S. M. J., Berardi, G., Esuli, A., Marcheggiani, D., Martín-Valdivia, M. T., & Fernández, A. M. (2015). A multi-lingual annotated dataset for aspect-oriented opinion mining. In EMNLP (pp. 2533–2538).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to María Navas-Loro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been partially funded by a Predoctoral Grant from the I+D+i program of the Universidad Politécnica de Madrid and by Project Datos 4.0 (TIN2016-78011-C4-2-R).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Navas-Loro, M., Rodríguez-Doncel, V. Spanish corpora for sentiment analysis: a survey. Lang Resources & Evaluation 54, 303–340 (2020). https://doi.org/10.1007/s10579-019-09470-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09470-8

Keywords

Navigation