Abstract
Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the characterization of corpora. This includes a number of features to help analyze resources at both corpus level and document level. This survey—besides depicting the overall landscape of corpora in Spanish—supports sentiment analysis practitioners with the task of selecting the most suitable resources.
Similar content being viewed by others
Notes
These are the numbers in the corpus description; however, the data itself varied slightly: 34,615 tweets (17,311 negative and 17,304 positive).
The source website asks for most and least positive aspects of users’ experiences.
An annual Spanish workshop, TASS for sentiment analysis, releases datasets for different tasks every year, some of them newly built for that year and others reused from previous editions. The tagged datasets come from different editions of this workshop. In order to be granted access to TASS corpora, a Research/Non-Commercial License Agreement must be signed and sent to the organizers; more information can be found at the website of each edition (where the schema for reading the XML files is also provided).
It is considered to be freely available if not explicitly mentioned to be otherwise.
References
Amigó, E., Carrillo de Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Martín, et al. (2013). Overview of RepLab 2013: evaluating online reputation monitoring systems. In Proceedings of the fourth international conference of the clef initiative (pp. 333–352).
Atserias, J., Casas, B., Comelles, E., González, M., Padró, L., & Padró, M. (2006). FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. In Proceedings of LREC (Vol. 6, pp. 48–55).
Boldrini, E., Balahur, A., Martínez-Barco, P., & Montoyo, A. (2012). Using EmotiBlog to annotate and analyse subjectivity in the new textual genres. Data Mining and Knowledge Discovery, 25(3), 603–634.
Breslin, J. G., Decker, S., et al. (2006). SIOC: An approach to connect web-based communities. International Journal of Web Based Communities, 2(2), 133–142.
Brooke, J., Tofiloski, M., & Taboada, M. (2009). Cross-linguistic sentiment analysis: From english to spanish. In Proceedings of the international conference RANLP-2009 (pp. 50–54). Borovets: Association for Computational Linguistics.
Cámara, E. M., Cumbreras, M. Á. G., Román, J. V., & Morera, J. G. (2016). Tass 2015-the evolution of the spanish opinion mining systems. Procesamiento del Lenguaje Natural, 56, 33–40.
Cambria, E., Livingstone, A., & Hussain, A. (2012). The hourglass of emotions (pp. 144–157). Berlin: Springer.
Cochrane, T. (2009). Eight dimensions for the emotions. Social Science Information, 48(3), 379–420.
Cruz, F. L., Troyano, J. A., et al. (2008). Clasificación de documentos basada en la opinión: Experimentos con un corpus de crıticas de cine en espanol. Procesamiento de Lenguaje Natural, 41, 73–80.
Cunningham, H., et al. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLOS Computational Biology, 9(2), 1–16.
Ekman, P., Friesen, W. V., & Ellsworth, P. (1972). Emotion in the human face: Guidelines for research and an integration of findings. Oxford: Pergamon Press.
Fontaine, J. R. J., Scherer, K. R., Roesch, E. B., Ellsworth, P. C., Fontaine, J. R. J., Scherer, K. R., et al. (2007). The world of emotions is not. Psychological Science, 18(12), 1050–1057.
Garcia-Moya, L., Anaya-Sanchez, H., & Berlanga-Llavori, R. (2013). Retrieving product features and opinions from customer reviews. IEEE Intelligent Systems, 28(3), 19–27.
Hepp, M. (2008). Goodrelations: An ontology for describing products and services offers on the web. In International conference on knowledge engineering and knowledge management (pp. 329–346). Springer.
Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Maks, I., & Izquierdo, R. (2017). Analysis of patient satisfaction in dutch and spanish online reviews. Procesamiento del Lenguaje Natural, 58, 101–108.
Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2017). Corpus annotation for aspect based sentiment analysis in medical domain. In Proceedings of the 2nd international workshop on extraction and processing of rich semantics from medical texts.
Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2018). Relevance of the SFU ReviewSP-NEG corpus annotated with the scope of negation for supervised polarity classification in Spanish. Information Processing and Management, 54(2), 240–251. https://doi.org/10.1016/j.ipm.2017.11.007.
Lövheim, H. (2012). A new three-dimensional model for emotions and monoamine neurotransmitters. Medical Hypotheses, 78(2), 341–348.
Marcheggiani, D., Täckström, O., Esuli, A., & Sebastiani, F. (2014). Hierarchical multi-label conditional random fields for aspect-oriented opinion mining. In Advances in information retrieval (pp. 273–285). Springer.
Martí, M. A., Martín-Valdivia, M. T., Taulé, M., Jiménez-Zafra, S. M., Nofre, M., & Marsó, L. (2016). La negación en español: análisis y tipología de patrones de negación. Procesamiento del Lenguaje Natural, 57, 41–48.
Martín-Valdivia, M. T., Martínez-Cámara, E., Perea-Ortega, J. M., & Ureña-López, L. A. (2013). Sentiment polarity detection in spanish reviews combining supervised and unsupervised approaches. Expert Systems with Applications, 40(10), 3934–3942.
Martínez-Cámara, E., Martín-Valdivia, M. T., & Ureña-López, L. A. (2011). Opinion classification techniques applied to a spanish corpus (pp. 169–176). Berlin: Springer. https://doi.org/10.1007/978-3-642-22327-3_17.
Martínez-Cámara, E., Martín-Valdivia, M. T., et al. (2015). Polarity classification for Spanish tweets using the COST corpus. Journal of Information Science, 41(3), 263–272. https://doi.org/10.1177/0165551514566564.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.
Molina-González, M. D., & Martínez-Cámara, E., et al. (2014). Cross-domain sentiment analysis using Spanish opinionated words. In Proceedings of NLDB (pp. 214–219). https://doi.org/10.1007/978-3-319-07983-7_28
Nakamura, A. (1993). Kanjo hyogen jiten. Khet Khlong Toei: Tokyodo Publishing.
Navas-Loro, M., & Rodríguez-Doncel, V. (2017). Oeg at tass 2017: Spanish sentiment analysis of tweets at document level. TASS 2017: Workshop on Semantic Analysis at SEPLN, Septiembre 2017 (pp. 43–49). http://ceur-ws.org/Vol-1896/p4_oeg_tass2017.pdf.
Navas-Loro, M., Rodríguez-Doncel, V., Santana-Pérez, I., Fernández-Izquierdo, A., & Sánchez, A. (2018). Mas: A corpus of tweets for marketing in spanish. In A. Gangemi, A. L. Gentile, A. G. Nuzzolese, S. Rudolph, M. Maleshkova, H. Paulheim, J. Z. Pan, & M. Alam (Eds.), The semantic web: ESWC 2018 satellite events (pp. 363–375). Cham: Springer.
Navas-Loro, M., Rodríguez-Doncel, V., Santana-Perez, I., & Sánchez, A. (2017). Spanish corpus for sentiment analysis towards brands. In Proceedings of the 19th international conference on speech and computer (SPECOM) (pp. 680–689).
Periñán-Pascual, C., & Arcas-Túnez, F. (2017). A knowledge-based approach to social sensors for environmentally-related problems. In Intelligent environments 2017: Workshop proceedings of the 13th international conference on intelligent environments (Vol. 22, pp. 49). IOS Press.
Plaza-Del-Arco, F. M., Martín-Valdivia, M. T., et al. (2016). COPOS: Corpus of patient opinions in Spanish. Application of sentiment analysis techniques. Procesamiento de Lenguaje Natural, 57, 83–90.
Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots. American Scientist, 89(4), 344–350.
Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi, M., et al. (2016). Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 19–30). San Diego, CA: Association for Computational Linguistics.
Rangel, F., Rosso, P., & Reyes, A. (2014). Emotions and irony per gender in Facebook. In Proceedings of workshop ES3LOD, LREC-2014 (pp. 1–6).
Reyes, A., & Rosso, P. (2011). Mining subjective knowledge from customer reviews: A specific case of irony detection. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (pp. 118–124). Association for Computational Linguistics.
Rivera Pastor, R., Tarín Quirós, C., Villar García, J. P., Badía Cardús, T., & Melero Nogués, M. (2017). Language equality in the digital age—Towards a human language project. https://doi.org/10.2861/136527.
Roberto, J. A., Martí, M. A., & Llorente, M. S. (2012). Análisis de la riqueza léxica en el contexto de la clasificación de atributos demográficos latentes. Procesamiento del Lenguaje Natural, 48, 97–104.
Roberto, J. A., Salamó, M. M., & Antònia, M. (2013). Clasificación automática del registro lingüístico en textos del español: un análisis contrastivo. LinguaMática, 5(1), 59–67.
Rodriguez-Doncel, V., & Labropoulou, P. (2015). Digital representation of rights for language resources. In Proceedings of the 4th workshop on linked data in linguistics (LDL-2015), ACL-IJCNLP 2015 (pp. 49–58).
Román, J. V., Morera, J. G., Cámara, E. M., & Zafra, S. M. J. (2015). Tass 2014-the challenge of aspect-based sentiment analysis. Procesamiento del Lenguaje Natural, 54, 61–68.
Rosso, P., & Rangel, F. (2017). Author profiling in social media: The impact of emotions on discourse analysis (pp. 3–18). Cham: Springer. https://doi.org/10.1007/978-3-319-68456-7_1.
Russell, J. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
Sánchez Rada, J. F., & Torres, M., et al. (2014). A linked data approach to sentiment and emotion analysis of twitter in the financial domain. In 2nd international workshop on finance and economics on the semantic web.
Shaver, P., Schwartz, J., et al. (1987). Emotion knowledge: Further exploration of a prototype approach. Journal of personality and social psychology, 52(6), 1061–1086. https://doi.org/10.1037/0022-3514.52.6.1061.
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307.
Tomkins, S. (1962). Affect imagery consciousness: Volume I: The positive affects (Vol. 1). Berlin: Springer.
Vazquez, K. L., Tovar, M., Vilariño, D., & Beltrán, B. (2016). Un algoritmo para detectar la polaridad de opiniones en los dominios de laptops y restaurantes. In Advances in intelligent technologies and its applications (pp. 91–98).
Vilares, D. (2012). Sentiment analysis for reviews and microtexts based on lexico-syntactic knowledge. In FDIA’13 (pp. 38–43).
Vilares, D., & Alonso, M. A. (2013). Goméz-Rodríguez Carlos: A syntactic approach for opinion mining on Spanish reviews. Natural Language Engineering, 1(1), 1–26.
Villena-Román, J., García-Morera, J., Lana-Serrano, S., & González-Cristóbal, J. C. (2014). Tass 2013—A second step in reputation analysis in Spanish. Procesamiento del Lenguaje Natural, 52, 37–44.
Villena-Román, J., Lana-Serrano, S., Martínez-Cámara, E., & González-Cristóbal, J. C. (2013). Tass-workshop on sentiment analysis at sepln. Procesamiento del Lenguaje Natural, 50, 37–44.
Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2), 165–210.
Zafra, S. M. J., Berardi, G., Esuli, A., Marcheggiani, D., Martín-Valdivia, M. T., & Fernández, A. M. (2015). A multi-lingual annotated dataset for aspect-oriented opinion mining. In EMNLP (pp. 2533–2538).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work has been partially funded by a Predoctoral Grant from the I+D+i program of the Universidad Politécnica de Madrid and by Project Datos 4.0 (TIN2016-78011-C4-2-R).
Rights and permissions
About this article
Cite this article
Navas-Loro, M., Rodríguez-Doncel, V. Spanish corpora for sentiment analysis: a survey. Lang Resources & Evaluation 54, 303–340 (2020). https://doi.org/10.1007/s10579-019-09470-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-019-09470-8