DISCO PAL: Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels

Nowadays, there are many applications of text mining over corpora from different languages. However, most of them are based on texts in prose, lacking applications that work with poetry texts. An example of an application of text mining in poetry is the usage of features derived from their individual words in order to capture the lexical, sublexical and interlexical meaning, and infer the General Affective Meaning (GAM) of the text. However, even though this proposal has been proved as useful for poetry in some languages, there is a lack of studies for both Spanish poetry and for highly-structured poetic compositions such as sonnets. This article presents a study over an annotated corpus of Spanish sonnets, in order to analyse if it is possible to build features from their individual words for predicting their GAM. The purpose of this is to model sonnets at an affective level. The article also analyses the relationship between the GAM of the sonnets and the content itself. For this, we consider the content from a psychological perspective, identifying with tags when a sonnet is related to a specific term. Then, we study how GAM changes according to each of those psychological terms. The corpus used contains 274 Spanish sonnets from authors of different centuries, from 15th to 19th. This corpus was annotated by different domain experts. The experts annotated the poems with affective and lexico-semantic features, as well as with domain concepts that belong to psychology. Thanks to this, the corpus of sonnets can be used in different applications, such as poetry recommender systems, personality text mining studies of the authors, or the usage of poetry for therapeutic purposes.


Introduction
Text mining techniques aim to extract insights from a text and discover patterns within it using different kinds of information from that text. As an example, the information contained in a text could be related to its syntactical structure, to its semantical meaning, or it can even consider information sources such as its affective value (e.g. if a text inspires a certain emotion when read). This is the base for many pieces of research, such as sentiment analysis. Sentiment analysis, also called "opinion mining", is the field of study that analyses people's opinions, feelings, assessments, attitudes, and emotions towards entities such as products, services, organizations, individuals, problems, events, topics and their attributes (Liu, 2012).
Thus, this is applicable to the field of text mining, where these feelings can be inferred using input information derived from the text itself. This is the case of the inference of the General Affective Meaning (GAM) (Aryani et al., 2016) of the text. In a text, GAM can be obtained from its semantic information, the affective information of the individual words which compose it, the type of text used and its syntactic characteristics. . . That initial information generates features that represent a text, and those features serve as an input for a function that yields an output corresponding to GAM tags. An example of such features are valence or arousal (Russell, 2003;Tsur, 1992;Watson & Tellegen, 1985;Wundt, 1874).
Regarding those functions, they can use prior information about the GAM tags (supervised) or not (unsupervised). An example for the first case is the usage of Supervised Machine Learning (ML) algorithms. Here, we need to know the GAM of some texts in order to train the supervised ML model with the input features so as to be able to obtain the GAM for the texts where it is not known.
However, those approaches are not applicable for the unsupervised scenario where there are no GAM tags available.
GAM are not the only type of variables that can be used to model the global meaning of a text. Other variables may be considered, including words related to the semantic meaning of the text, such as the relation between definitions and their associated words in a dictionary (Noraset et al., 2017).
All the different kinds of information contained within a text (semantic, syntactic, affective. . . ) depend on the type of text considered. Because of this, the approach will be different depending on whether the text is, for example, prose or verse. It will also depend on the language used in the text.
That said, there are not many corpora available to perform data mining tasks on poetry texts, and much less for the Spanish language. There are even fewer options related to GAM and poetry. It is true that there are available corpora for Spanish poetry, such as the corpus DISCO (Ruiz et al., 2018), but the annotations included within it do not provide information that can be used directly for text modelling tasks, such as obtaining the GAM. The reason is that DISCO only includes metadata about authors, sonnet scansion, rhyme-scheme and enjambment. This is the reason why the present research increases the available copora for text mining tasks with Spanish poetry by presenting a corpus of Spanish sonnets from different time periods annotated with both affective and lexicosemantic labels in order to contribute to the research of text mining in both areas. The article will present DISCO PAL, Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels (together with this paper), a corpus annotated by POSTDATA 1 experts in both literature and digital humanities. POSTDATA project aims to make "poetry available online as machinereadable data will open a great world of possibilities of linking, indexing and extracting new information.". This corpus includes binary labels for a group of concepts depending on whether that concept appears within the text or not. The concepts used all belong to the psychological domain.
Overall, the main contributions of this article are to: -Define a methodology for unsupervised GAM modelling of a corpus of Spanish sonnets, based on previous works of GAM modelling for poetry in other languages. The proposal uses as input data sources public lexicons with the affective meaning of individual words in Spanish in order to build affective and semantic features that infer the GAM for the whole text. -Validate the unsupervised GAM proposal by using an annotated corpus of Spanish sonnets (DISCO PAL) by different domain experts. This corpus contains annotations for the same features generated by the GAM modelling. The annotations values depend on the intensity of that variable within each sonnet. -Analyse how the content influence the GAM generated. For this, the experts also annotated values for labels of psychological concepts that are expressed through that sonnet. -Provide the DISCO PAL corpus for future research, highlighting possible ways to use it for data mining on poetry, mainly through the affective and semantic modelling of texts.
The structure of this article is as follows: after the Introduction presented in this first Section, the second Section summarizes the state of the art (SOTA) for the areas relevant for this article. First, the SOTA related to data mining of affective information on poems. Then, the SOTA related to affective modelling of Spanish language by using public lexicons for modelling individual words. After that, the third Section presents the DISCO PAL corpus annotated by POSTDATA experts in digital humanities, analysing the agreement between the annotators and the reliability of the corpus. It also follows the research applied on poetry in other languages in order to build features based on the affective and lexico-semantic values of individual words (using public lexicons for affective and lexico-semantic word modelling in Spanish). Then, we see if those features generated capture the GAM of the sonnets by checking the values inferred against the ones annotated by the POSTDATA experts. The last Section mentions the potential lines of research that could be carried out thanks to this corpus. It also includes a summary with the conclusions of this article.

Related Work
This Section presents a brief review of the related work. As we mentioned in the Introduction, a text contains information related to different areas such as semantics, syntactic or affective states. This is applicable to all kinds of texts, including poetic ones. Since this article provides a research related to the affective modelling of Spanish poetry, the main area covered in this Section is related to the data mining process of affective information in poetry. We complement this with a subsection describing some public lexicons for the affective modelling of individual Spanish words.

Data mining of affective information in poetry
As previously indicated, texts in general, and particularly poetic ones, contain affective information that can be extracted using different techniques. An example of this is aggregating the affective values of the individual words present in the text. It is important to quantify this affective contribution of poetic texts in order to work with them computationally. Thus, the task consists of detecting which elements of the text are especially relevant in order to calculate through them the affective contribution of the whole poem. The articles shown below analyse different ways of extracting and quantifying affective aspects from poetic texts. In order to model the GAM of a poetry text, that poem needs to be expressed through a set of relevant features that are linked to GAM using a relationship that is expressed with a mathematical function. From here, there are two possibilities. First, if there is information about the GAM value within some poems, the objective function may consist in generalizing the relationship between those values and the features extracted from the poem. This can be used later on to infer the GAM in poems where there is no prior information about it (and only the features extracted are available). This scenario is approached in (Sreeja & Mahalakshmi, 2018) with the usage of supervised ML models. Here, the authors provide a corpus of 736 English poems annotated with 9 affective labels (love, anger, hate, sadness, joy, surprise), and use it to train an ensemble of supervised ML models. They begin extracting a set of relevant features from the poems related to semantic, linguistic and orthographic aspects, as well as some statistical features (term frequency and inverse document frequency). They also use poetic features extracted with rule-based methods which include information related to simile and metaphors. Those features are used together with the annotated affective labels in order to train ML models that can predict the GAM value for new sonnets. It is also worth mentioning how the authors state that this article is "the first attempt to identify emotions from English poems".
A similar approach was considered before for Arabic poetry in (Alsharif et al., 2013). Here, the authors built a corpus of Arabic poems annotating them with different emotions. Then, they extracted a set of relevant features from the poems based on the occurrences of different words within them (unigrams). With these feature vectors, they trained different supervised ML models (Support Vector Machines, Naïve Bayes, Voting Features Intervals and Hyperpipes) to predict the emotion values.
In (A. M. Jacobs et al., 2017), the authors first use a Quantitative Narrative Analysis tool and 11 questions to find relevant features that appear and model 154 Shakespeare's sonnets. These features include affective variables (such as the valence or the mood potential). Then, they use a ML model trained with these features to classify the models into two categories, "young man" poems or "dark lady" ones.
There are some available corpora of sonnets that include affective annotations by experts that can be used for tasks like the ones mentioned before. In (Sreeja & Mahalakshmi, 2019), the authors built a benchmark corpus for poetry named PERC (Poem Emotion Recognition Corpus). This corpus includes poems from Indian authors in English. For those poems, they include 9 emotions that are annotated in the corpus by several experts (Love, Sadness, Joy, Fear, Hate, Courage, Anger, Surprise and Peace). The corpus provided by the authors includes 1850 poems from 10 authors from 1850 to the present day. Similarly, in (Haider et al., 2020), the authors provide an annotated corpus that includes 158 German poems along with 64 poems in English. The annotations include 9 affective features (Vitality, Uneasiness, Suspense, Sadness, Nostalgia, Humor, Beauty/Joy, Awe/Sublime and Annoyance).
Beyond these supervised proposals, other authors have tackled the problem in an unsupervised manner. In (Barros et al., 2013), the authors obtain the GAM by counting how many instances of words such as fear or joy appear within a set of Quevedo's poems. Therefore, no prior annotated values are used to infer the GAM of a poem. The final extracted GAM values are used to automatically annotate that corpus.
This last paper, in fact, deals with the GAM extraction from Spanish poems. However, there are no more corpora beyond this one to the best of our knowledge. In fact, more recent research studies of the topic for data mining with poetry, such as (Kaur & Saini, 2017) or (Sreeja & Mahalakshmi, 2019), only list that corpus of Quevedo's poems annotated with sentiment labels according to the presence of certain words as Spanish corpora sources for data mining and GAM modelling.
Features can be obtained by modelling the whole text or by modelling the individual stanzas of the poem. For the case of affective features, they can be inferred using as input the individual affective values of the words that appear in the text as long as there is an available lexicon that contains those individual affective values, such as BAWL in (Ullrich et al., 2017). The authors perform data mining of affective content in poetic texts for German language. The authors explore how the features of a poetic text (at sub-lexical, lexical and inter-lexical level) influence the GAM that is perceived. Thus, their research serves as an example to see which affective features are relevant to a text based on how related they are to the GAM. To calculate those features they use the BAWL database for German words. This database contains affective values for individual words that belong to German, and they aggregate these individual values into a global value that models the GAM for the whole poem.
Their poem corpus consists of 57 poems from the German author H.M. Enzenberg. These poems are annotated by a group of readers with the following features: 1. Rating on a scale of 7 for the valence, where -3 would be very negative, 0 neutral and 3 very positive. 2. Rating on a scale of 5 for the arousal (level of excitement of the text of the poem, which goes from texts that inspire peacefulness or calmness, to others that seek to motivate or are more exciting), where 1 is very calm and 5 very exciting. 3. Rating on a scale of 1 to 5 for the level of friendliness, where 1 indicates that the text is not friendly and 5 that it is very friendly. 4. Rating on a scale of 1 to 5 for the level of sadness, where 1 would be that the text is not sad, and 5 that it is very sad. 5. Rating from 1 to 5 for the level of malevolence. 6. Rating from 1 to 5 indicating if they liked the poem a lot or not (5, a lot). 7. Rating from 1 to 5 for the level of poeticity, where 5 would indicate that the poem is very poetic and 1 that it is not very poetic. 8. Rating from 1 to 5 for the level of onomatopoeia (level that quantifies the use of this literary resource). 5 would indicate a very frequent usage.
These annotations by users at a global level serve to analyse their correlation against different features derived from the individual words that appear within the text (not considering stopwords). The purpose of this study is to check if the features could be used for predicting the GAM of the poem. As mentioned before, the features belong to three different levels: sub-lexical, lexical and inter-lexical. The lexical level captures the average valence and arousal values from the words present in the text, the inter-lexical level quantifies peaks, ranges and changes within the lexical affective content, and the sub-lexical level considers aspects such as the phonological information of the poems. All this defines 55 affective features (using the 3 levels described above). Approximately the 50 percent of the explained variance is reached using only the lexical features, and together with the inter-lexical ones, the explained variance reaches 75 percent. This indicates that the best predictors would be the ones related to these two levels, particularly the average of valence and the average of arousal derived from the individual words.
Of course, considering only these two features would indicate that the order of the words in the text is irrelevant for the affective impact, and that is not the case; the order matters, and experiencing crescendos or affective decrescendos is something fundamental, so the span of the level of excitation is another key aspect to take into account. Together with that, the article also considers how the valence and arousal levels evolve during the poem. This is important because, for example, poems are generally perceived as sadder when the valence of words is decreasing (more negative) and when the arousal at the end is lower, and poems are perceived as friendlier when the valence of the last words of the text is more positive. In this way it is important to consider the correlation coefficient between the vector of affectivity (arousal / valence) of the individual words with the vector of their positions in the text.
We find this article particularly relevant for our studies since it presents a thorough methodology for GAM extraction that concludes in good results.
It is important to remember that poetry is a huge genre, where there are different types of styles, and that will influence the affective modelling and the GAM extracted. This is indicated in the work of (Obermeier et al., 2013), where a study of the influence of poetry on affective states is presented thanks to certain aesthetic and emotional elements such as the metric of the poem and its rhyme. Thus, the starting hypothesis is that metrics and rhyme have an impact on aesthetic perception, emotional involvement and valence. This indicates that the GAM of a poem will vary depending on whether the poem's style includes metric and rhyme or not. To verify this, the authors analyse the influence of metrics and rhyme in the aesthetic and emotional perception of poetry, as well as their interaction with the lexicon, using the stanzas of the poems as references for the study. For that, they work with a group of 60 adults who listened to audios of German poems (100 poems from the 19th and 20th centuries). The poems had stanzas of 4 verses in which there were sets of poems with lexical differences (for instance, real words vs pseudowords; pseudowords were words modified that kept the vowels but changed some consonant, ensuring that they were still pronounceable). Poems also were divided depending on whether they had rhymes or not, or if they had accent or not. With this, the users rated four metrics for the poems that they were listening to: liking (aesthetic appreciation), intensity (power of emotional response), perceived emotion (emotion that was expressed within the stanza) and felt emotion (emotion experienced by the users).
The results are as follows: -Liking: results had better aesthetic ratings for poetry with metric as well as for stanzas with rhyme compared to those without it.
-Intensity rating: for all kind of poems the results were better with the stanzas that contain real words and not pseudowords. -Perceived emotion: influence of lexicon, metric and rhyme (especially the last two); best score for stanzas with pseudowords if they don't have metric versus those that do have it. This last difference does not appear for poems with only real words. -Felt emotion: the main influence is the rhyme. There is also a triple interaction between lexical-metric-rhyme. When there is rhyme the emotion felt is stronger.
Thus, this means that metric and rhyme reinforce the perceived emotion of a poem, which is expressed through the GAM. This serves as a basis to consider sonnets as good candidates for our studies regarding GAM extraction, since they are structured poems with rhyme and metric. Due to this, we will focus our analyses not only on Spanish poetry, following some of the steps of (Ullrich et al., 2017), but particularly in sonnets, as they will always guarantee the metric structure that enhances the text GAM.
As a last comment, however, the literature indicates some caveats and difficulties regarding the affective modelling of poetry. This appears in (Eastman, 2015), where the authors propose a solution for affective computing in relationship to poetry. This article addresses two relevant issues in this regard. On the one hand, it reminds us how poetry widely uses metaphors and figurative language (words open to many meanings and interpretations). This makes the extraction of affective information not always as obvious as simply assigning to each word a value contained in a repository and then composing all the individual values. Metaphors are also interpreted in a large part from the subjectivity of the reader and from their personal experience, so it is not trivial and immediate to incorporate all the possible information. On the other hand, it also mentions that the understanding of the words of a poetic text should not only be done based on the text itself; a poem by a given author can be understood in greater depth when compared with other poems by that author or with poems of other authors. Due to this, it is important to note that the comprehension of a text, and hence the context for the individual words, is best achieved if the words are understood not only within the context of a specific poem or a specific author but in a bigger context that includes poems from other authors. This is something important in any text comprehension task, but it is even more critical for poems where the language used is sometimes full of metaphors and other stylistic figures not so easily understood. The proper comprehension of the text is important for both the semantic modelling of the poem as well as for the affective one, which means that the GAM extraction will be influenced by the context considered.
These previous works show how GAM extraction for poetry is tackled both with supervised and unsupervised approaches, covering poems from many different languages. However, there are few studies regarding Spanish poetry, with no references to sonnets in particular. Just like there are works that both analyse and provide an annotated poetry corpus with GAM values for German, Arabic and English texts, there are no equivalent, to the best of our knowledge, for Spanish. Therefore, we find a research need regarding both GAM extraction process for Spanish poetry, as well as offering an annotated corpus for future researches. Due to this, we focus our analysis in Spanish poetry, using sonnets in particular because of their stable structure and the presence of meter and rhyme. We follow the steps of (Ullrich et al., 2017), since they reach good results in the GAM extraction process while also referring to an annotated corpus. We will extract the GAM for sonnets in an unsupervised manner, and check the quality of those GAM values comparing the results against their counterpart values annotated by different experts.

Lexicons for affective word modelling in Spanish
Just like BAWL, as mentioned in (Ullrich et al., 2017), is a lexicon used as source information for the affective modelling of individual words, there are similar lexicons for Spanish vocabulary. Some of these lexicons are described below.
In (Ferré et al., 2017) 2267 words are written in Spanish (along with their English translation) with the following variables 2 : -Spanish Word : word in Spanish.
-English Translation: English translation of that word.
-Hap Mean: mean value associated with this affective state (happiness) for that word from the individual ratings given by the subjects. -Hap SD: standard deviation associated with this affective state (happiness) for that word from the individual ratings given by the subjects. In (Guasch et al., 2016) 1400 words are written in Spanish with the following variables: -ID: auto incremental field.
-Word : word in Spanish.
-English Trans.: English translation of the words.
-POS : Part of Speech (POS) tag for that word.
-VAL M : mean value of the valence for that word from the individual ratings given by the subjects.
-VAL SD: standard deviation of the valence for that word from the individual ratings given by the subjects -VAL N : number of subjects used to obtain valence values. Regarding the concepts used, Concreteness is defined as the degree of specificity of the word, with 1 representing when the word is very abstract and 7 when it is very concrete. Words like 'object' are more abstract than others like 'table'.
Imageability is defined as the easiness or difficulty of constructing a mental image associated with that word, with 1 representing when the word is very difficult to imagine and 7 when it is very easy. It is easier to imagine something with words like 'flag' than with others like 'charity'.
Context availability is defined as the easiness or difficulty in associating that word with a context in which it could appear, with 1 representing when the word is very difficult to associate with a context and 7 when it is very easy. It is easier to construct sentences or search for usage examples for words like 'table' than for others like 'citizenship'.
Familiarity is defined as the degree of familiarity, with 1 representing when the word is not very familiar and 7 otherwise. A word like 'fish' is more familiar than others like 'quark'.
-ValenceMean: mean value of the valence for that word from the individual ratings given by the subjects. -ArousalMean: mean value of the arousal for that word from the individual ratings given by the subjects. -ValenceSD: standard deviation of the valence values for that word from the individual ratings given by the subjects -ArousalSD: standard deviation of the arousal values for that word from the individual ratings given by the subjects -% ValenceRaters: percentage of total subjects that have given a value to the valence. -% ArousalRaters: percentage of the total of subjects that has given a value to the arousal.
This last work is complemented with (Stadthagen-González et al., 2018), where the authors provide a larger lexicon (10491 words) that includes, together with valence and arousal variables (mean and deviation), the mean and standard deviation values for the following affective states: happiness, disgust, anger, fear, sadness. The authors also include a column called "Few Raters" that indicates whether the number of subjects used for that word is small or not, together with the dominant POS associated to that word.
Similarly, (Hinojosa et al., 2016) also includes affective values for valence, arousal, happiness, disgust, anger, fear, sadness but for 875 words. This last lexicon also includes the value for concreteness.
In, the authors (Alonso et al., 2015) describe for 7040 words other characteristics such as the average age at which a word is usually learned (aver-ageAoA, Age of Acquisition), the minimum (Min) and maximum (Max) age and the deviation in these age data (SD), as well as the literary frequency with which it is usually found.
Finally, in (Pérez-Sánchez et al., 2021) the authors provide the affective values for 1286 words. It includes some of the variables already seen, like the affectives states in terms of valence, arousal, happiness, disgust, anger, fear, sadness. Some of these words already appeared in other previous lexicons, like (Stadthagen-González et al., 2018), and the authors reuse the affective value from those sources. When that happens, they indicate it with a specific column (e.g. "Fear Source"). Some of the words are new and did not appear before. Along with these variables, they include new ones, like the AoA, and the dominant emotion associated to that word (e.g. "amar" (love) to "happiness"). They also include a new field called "emotionality" that indicates how much that word indicates an emotion.
This are, to the best of our knowledge, the main lexicons for affective values of Spanish words that will serve as an equivalent to BAWL. We also consider these lexicons since the affective values associated to the words were obtained considering a general public from different ages, as opposed to more recent lexicons like (Sabater et al., 2020), where the people involved were children and adolescents.

Poetry and Psychology
As we mentioned before, poetry contains an affective dimension that may evoke different sentiments, which can be quantified by inferring its GAM. But the affective dimension is not the only one present in a poem. Poems are also a way to express the psychological state of the author, as indicated in (Czernianin, 2016). Here, the article shows how poetry is used as a way to discharge the mood of the authors. In fact, they analyse several poems to see how some of its content reflect psychological states such as suffering, happiness or hedonism. Following this, the psychological state of the author is reflected in the poem, and that also evokes a particular psychological state in the reader, as mentioned in (Kao & Jurafsky, 2012). Here, the authors mention both how poetry is used as a way to explore and express emotions, as well as how it causes in the readers psychological states such as catharsis. In fact, (Parastoo et al., 2016) conduct a study in which they analyse how reading poetry can be used as a therapy to treat psychotic patients. Thus, poetry can influence the reader's state to a point that it can even be used as a therapy to change or mitigate a particularly pernicious psychological state. Complementing this, (Shapiro & Rucker, 2003) show how including poetry within a medical student program enhances dimensions such as empathy, altruism, compassion, and caring toward patients. The connection between psychology and emotions also appears in (A. Jacobs et al., 2016), where the authors indicate how "psychological pleasure" is connected to the beauty of the text, which is often expressed through several emotions that are provoked in the reader. Indeed, in this work the authors conduct a research on Elementary Affective Decisions, and analyse how the decision process is influenced by basics emotions (e.g. happiness or disgust) within the context of Neurocognitive Poetics. Following this, (A. M. Jacobs, 2019) shows that the connection between psychological and affective states can also appear within the characters of a literary work. Here, even though it is studied for prose texts, the authors model the affective state of characters (valence, arousal...), as well as the personality profile through a model inspired in the Big-5 (friendly, affectionate, hostile...).
Thus, it is interesting to know not only what affective states and sentiments are evoked by a poem (captured in the GAM), but also what psychological state the poem evokes, in order to contribute to its usage within all those contexts aforementioned. However, to the best of our knowledge there are no corpora that identify different groups of poems according to the psychological states that they reflect. Due to this, we find a research need in providing an annotated corpus of poems that identifies different subsets according to some psychological states, identified by tags.
Also, since poetry both evoke affective and psychological states intertwined, it is important to quantify how GAM changes according to the psychological state represented in its content.

Methodology
The methodology proposed consists of inferring the GAM of a sonnet based on the individual contribution of its words, and then validating that using a gold standard labelled corpus. Thus, we define an unsupervised approach to build the GAM and then we use domain knowledge to check it. This Section first introduces the corpus included in this paper, Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels, DISCO PAL. We begin by presenting the participants who annotated the corpus, and after that we will describe the corpus itself. We conclude introducing the methodology used, which includes the input data sources and the features built from them.

Participants
As mentioned before, the features were annotated by three experts in digital humanities, literature and linguistics, belonging to POSTDATA project.
The experts have annotated the sonnets independently (without knowing the annotations from the other experts) and following the same sonnet order. They did not know the author or the time period of the different sonnets; they only had access to the text itself. This was done in order to mitigate bias in their judgement. They used a csv file with rows containing the sonnet texts and columns with the different variables. Each of them assigned a value within the available range in the corresponding column. Those experts have individually annotated the same 274 sonnets for all the features described below.

Materials
DISCO PAL is a subset of a larger corpus, DISCO (Ruiz et al., 2018) DISCO that consists of 4085 sonnets in Spanish language from the 15th to 19th century. From that corpus, in order to create DISCO PAL, the experts of POST-DATA have annotated a subset of 274 sonnets, with 184 belonging to the 19th century, 9 belonging to the 18th century and the other 81 belonging to the interval of 15th to 17th century. This is a relevant fact to consider because some sonnets are written in old Spanish, something that can significantly affect all the text mining analysis applied to the poems. Also, the number of authors used is 52, and from them only 3 are women (covering 12% of the total sonnets). With that, the corpus provided is rich, with many different authors belonging to different centuries, in line with the proposals of the scientific literature (Eastman, 2015).
The mean number of words per sonnet 51.6, the standard deviation is 5.9, and its associated histogram can be seen in Figure 1.
There are three types of annotated features: affective, lexico-semantic and psychological. Affective features are detailed in Table 1 and have a range of 1 to 4, with 1 being the minimum value (the sonnet does not inspire very much that state) and 4 the maximum (the sonnet inspires it very much). The scale only uses integer values. The scale is the same for Lexico-sematic features, which are described in Table 2. Psychological features are binary and indicate whether the sonnet is related to that concept (1) or not (0). These features are described in Table 3.   Regarding the psychological features, they were chosen considering their relevance in the literature (García Franco & Manjarrés Riesco, 2016). All these annotations can be used for calculating different metrics in the recovery of poems (such as precision, for example).
We show below a sonnet example along with its English translation and the median annotations by the experts:  Alta Hoguera te eriges, que así amas afectos recogiendo enamorado, que el Pecho, en sacro amor, todo abrasado, hoguera es elevada, en que te inflamas.
Rare Phoenix of Love, that in living flames, immortal splendor you have achieved, aroma logs are the ones you have gathered in the smell of virtues that you pour out.
High Bonfire you erect, that is how you love collecting affections in love, that the chest, in sacred love, all burned, bonfire is high, in which you get inflamed.
In the rays from Sun Christ, magnificent bird, from the heart the wings, swiftly you bat, to see you in fire reborn.
Phoenix I consider you, in Burning Pyre, that in his death he is born to new life, and it's your Sunset on Earth, to Heaven, East.
This sonnet (sonnet no. 18 in the corpus) belongs to Juan de Aguilar (S. XIX). The median annotations between the 3 annotators appear in Table 4.

Procedure
The methodology is divided into two parts. First, we build the GAM values from the individual words within the sonnets. This is done by using several external lexicons to assign an affective or lexico-semantic values to the individual Spanish. These lexicons include some of the ones already introduced in the Related Work Section. We use: (Hinojosa et al., 2016), (Ferré et al., 2017), (Guasch et al., 2016), (Stadthagen-González et al., 2018) and (Pérez-Sánchez et al., 2021). These lexicons will be combined into one lexicon. When there are several possible values for the same word, we will use the median value between them. We use those affective or lexico-semantic values at a word level and ag-   Then, we define the features to be annotated in the DISCO PAL sonnet corpus in order to analyse the quality of the inferred GAM values. These features were the ones described in the Materials Section. Thus, we will compare every feature associated to anger, sadness, disgust, arousal, valence, concreteness, imageability and context availability to the value annotated by the experts. We will also analyse these comparisons considering the psychological tags. For that, we will consider separately sonnets that belong to a particular psychological category and analyse the GAM against the annotated values of the affective or lexico-semantic features for that subset of sonnets only.

Evaluation
The evaluation steps are the following ones: -Study the reliability of the DISCO PAL corpus annotated by the POST-DATA experts.
-We analyse the bivariate correlation between annotators for the affective features in order to see if they are logical. For instance, "valence" in one of the annotators should be positively correlated with the value of "valence" from another expert. Also, a variable like "valence" should also be positively correlated with another one that represents positive affective states, such as "happiness". -Then, we check the level of agreement between the annotators in order to see if there are significant discrepancies between them. If the level of agreement is enough, we can proceed to the next point. -Analyse the relationship between the features annotated by the experts and the ones obtained through the GAM infer methodology shown before. This analysis is carried out at three levels.
-First, we analyse the bivariate correlation between the GAM inferred features and their annotated counterpart. We check if that correlation is over a minimum threshold. Literature (Schober et al., 2018)  -Then, we analyse the partial correlation between those same GAM inferred features and the annotated ones. This is done by building a regression model over the GAM features (independent variables) and each label at a time (dependent feature). We check the level of significance using the p-value for the inferred feature, the r-squared value, and the feature coefficient. -The previous analysis is done considering all the annotated DISCO PAL corpus, as well as separating it by the annotated psychological categories. We want to see if the results differ significantly. -Finally, in order to analyse differences in the GAM depending on the psychological category, we perform a One-way ANOVA hypothesis contrast. We compare the mean values for a particular GAM inferred feature between the subset that belongs to a specific psychological category against the other subset of sonnets. There will be differences if the p-value is less than 0.05.

Supplementary material
The materials included in this article are three csv files with the annotations made by the experts, as well as a csv file with metadata information about the annotated sonnets. This metadata csv is included in order to allow the reference between the DISCO PAL and the original source DISCO. The fields included in the metadata csv are: author: author of the sonnet.
year: year or century of publication.
title: title of the sonnet.
id sonnet: unique id used by DISCO for that sonnet.
file path: file name path to that sonnet in the per-sonnet folder in DISCO.
We also include a csv with the aggregated annotations by the three experts. All data provided is located at (Barbado et al., 2019).

Reliability and validity of DISCO PAL corpus
The first approach to study the reliability and validity of DISCO PAL is to see that the correlations between annotators and between specific features are logical. In Figure 2 we see the correlations for two features, "arousal" and "anger", considering the three annotators. We see how these two features are positively correlated. Both "arousal" and "anger" have positive correlations between the annotators. They also have positive correlations when the values are compared for the same annotator. In Figure 3 we see the correlations for "valence", "disgust", "fear", "happiness" and "sadness". For annotators 1 and 2, we see how "valence" is negatively correlated with that same feature in annotator 3. We also see how it is negatively correlated with "happiness", while having a positive correlation with the remaining features. This indicates that annotators 1 and 2 have used a reversed scale in this feature. Because of that, for further analyses, we will reverse their results for the "valence" feature. For annotator 3 we see that the correlations are correct. The remaining features also seem to have logical correlations (e.g. positive for "disgust" and "fear", negative for "happiness" and "fear").
The second step is analyzing the agreement between the three annotators. This is accomplished by obtaining the Krippenndorff Alpha (Krippendorff, 2011), or k-alpha, between the annotations made by the 3 experts for each of the features.
K-alpha is a metric that generalizes other metrics that are responsible for quantifying the reliability between annotators (inter-rater reliability). It can be used for both ordinal and nominal annotations, as well as with any number of annotators. K-alpha yields a value between 0 and 1, where 1 represents full agreement. However, there are different criteria regarding when to consider that there is enough agreement between annotators. If the acceptance criteria is strict, only expert annotations are accepted as truly valid if there is a kalpha of at least 0.8 (Carletta, 1996). Other laxer criteria set the minimum at 0.21, defining the following thresholds (Landis & Koch, 1977): -K < 0: Very low -0 < K < 2: Light -0.21 < K < 0.4: Acceptable -0.41 < K < 0.60: Moderate -0.61 < K < 0.80: Substantial -0.81 < K < 1: Perfect The k-alpha results considering the three annotators together are shown in Table 10. "k 12" represents the agreement between annotators 1 and 2 (analogous to "k 13" and "k 23"). In bold we see the k-alpha values that are below the "acceptable" threshold. Between most of the features and annotators, the level of agreement is above K >= 0.21, with some cases that reach the "substantial" level (K >= 0.61). However, there are combinations that yield levels below that 0.21 threshold, particularly for some features from annotator 3 when compared to either annotator 2 or annotator 1. For annotators 1 and 2 all the features are validated. For annotators 1 and 3, and for annotators 2 and 3, the same 7 features have a k-alpha below 0.21 (with an additional feature for the case of 1 and 3). Considering together all the annotators, the k-alpha results are all above 0.21, with the exception of the feature "happiness" (as seen in "k all" column). This means that 97% of the features have a moderate level of agreement (or better).
In order to conduct further analyses, those three annotation sets should be combined into only one vector. A proposal for doing it is by using the median value between the values of the three experts. In that way, if there is a discrepancy between two annotators and a third one, the final value used will be the one that agrees with most of them. This median value will act as a proxy "annotator" than agrees with the three experts. Indeed, as also shown in Table 10, the agreement versus each annotator is very high. It is still above 0.21 for annotators 1 and 2, while reducing the discrepancy against annotator 3 in 4 features out of 8. Some of the annotators left with nulls some of the psychological tags (annotator 1 left 38 sonnets with an average of 3.6 psychological tags without annotations, annotator 2 left 4, and annotator 3 left 1). For building this proxy annotator, we filled those missing categories with 0, assuming that if the concept was not explicit by an annotators, it does not appear. This only applies for cases where there is one sonnet-psychological tag missing for only one of the annotators, but not for the other two. The affective and lexico-semantic features are fully annotated in all the sonnets.
Using this proxy annotator, we get a number of sonnets per psychological category as shown in Table 5. We see how the categories "Obsession" and "Prejudice" are the ones that appear in less sonnets.
In Table 5 we see that "Prejudice" and "Obsession" are the categories with fewer sonnets (30 and 32 respectively). Although this number is smaller when compared to other categories (e.g. "Dramatisation" has 108 sonnets), it is enough from a statistical point of view based on a power analysis with an alpha of 0.05, a Cohen's d of 0.8 and the default statistical power of 0.8 (which sets the minimum in 26) (Cohen, 1992;Sullivan & Feinn, 2012). The number of sonnets per category is also higher when compared to other similar analyses, such as (Aryani et al., 2016;Ullrich et al., 2017), where the authors work with the categories "friendliness", "sadness" and "spitefulness" and they are associated to 19, 21 and 17 poems respectively.

Analysis of DISCO PAL corpus for individual affective word modelling
As previously mentioned, the original corpus consists of 4085 sonnets in Castilian language from 15th to 19th century, collected from the corpus DISCO from POSTDATA (UNED). From there, we have selected 274 sonnets, which have been annotated with specific affective features, inspired by the literature, in particular (Ullrich et al., 2017).
That article indicates how they modeled the GAM for the poems by using the BAWL lexicon as input source. This lexicon contains 6000 words in  German. In order to associate the value of features to individual words (for modeling the GAM later on), they use the different words available in the poems. To increase the number of words that match the entries in these tables, the words of the poems are lemmatized, and stopwords are removed. In this way, it is possible to find a match for 90% of the words that appear in the poems with a word within the BAWL lexicon. The remaining 10% of words that do not appear in these tables are usually proper names.
With that, we consider in this paper how many words from DISCO PAL appear in input lexicons used for assigning the affective or lexico-semantic value to the individual words. Table 6 shows how many unique words are considering all the sonnets within DISCO PAL, as well as in each of the subsets associated to a psychological category. It also shows how many unique words are when they are lemmatized or stemmed (with SnowBall stemming algorithm (Porter, 2001)). As expected, both lemmatization and stemming techniques reduce the number of words (with stemming reducing them more than the lemmatization). Table 7 shows the words of the DISCO PAL corpus that match the ones in the different source lexicons, and Table 11 shows the same but when the words from DISCO PAL, as well as the words from the source lexicons, are either stemmed or lemmatized. It can be seen that using lemmatization or stemming techniques improves, as expected, the number of words that match the input lexicons. Since stemming improves the matching, we will use this technique for  the subsequent analyses. Lemmatization and stemming scenarios also include the elimination of stopwords. Tables 7 and 11 show that when all the lexicons are combined together, the percentage of matching words increases, and, for lemmatization and stemming, is above 50% for both all DISCO PAL, as well as for the individual psychological categories.
Considering lemmatization, the percentage of matching words is 56% (when all the source lexicons are combined) versus the 90% from (Ullrich et al., 2017). Table 8 shows the top 14 most common missing words from the DISCO PAL corpus in all of the source lexicons.
This scenario will probably hinder the results from the GAM in comparison to (Ullrich et al., 2017) since there are more absent words in the source lexicons, even after removing stopwords and performing stemming (where the matching percentage is 68%, still below 90%).

GAM analysis
Following a similar approach to (Ullrich et al., 2017), the source lexicons mentioned are going to be used as an input source in order to infer the GAM value  As mentioned previously, the evaluation is going to be assessed against the median value derived from the three annotators. The words from the source lexicons, as well as the words from DISCO PAL, are stemmed. Using the source lexicons, we build the features described in the Methodology Section for each of the sonnets. Since a stemmed word can appear multiple times in the source lexicons (p.e. "bees" and "bee" will be the same word after stemming), the final value assigned to that word is the average between all the words with the same stem. These features represent the inferred GAM for that sonnet.
Considering that, Figure 4 shows the Spearman's bivariate correlation between the annotated feature and their inferred GAM counterparts. We see that "arousal mean" has a significant correlation (albeit a weak one). "va-  Table 8: Most common words in DISCO PAL corpus missing in the source lexicons (excluding stopwords). It shows how many times that word appears.
lence mean" has a moderate correlation. There are other features, such as "max arousal", "sigma aro", "min valence", and "sigma val" that also have weak correlations. Figure 5 shows the bivariate correlations for the remaining features. 5 out of the 8 features inferred have significant correlations with respect to their annotated counterparts. For "sadness", the correlation is moderate, and for "happiness", "anger", "fear" and "disgust", the correlation is weak (though close to moderate). It is also interesting to see that some inferred GAM features have significant correlations to other annotated features that could be expected. This is the case of "sadness mean", "anger mean", "fear mean" and "disgust mean". They all have significant and positive correlations when compared to features annotated like "fear", "anger", "sadness" or "disgust". They also have significant and negative correlations when compared to "happiness". The remaining features ("concreteness", "imageability" and "context availability") do not yield significant bivariate correlations (they are all below 0.1). Thus, the lexico-semantic features are the ones that have a lower correlation.
Following this analysis, Tables 9 and 12 show the partial dependence between each GAM feature and their counterpart annotated by the experts. As mentioned before, a linear regression model is trained over all sonnets, using all GAM features as independent variables, and using one of the annotated features as dependent variable. Then, we get the p-value of the corresponding GAM feature, and see if that value is relevant, using a threshold of 0.05. (p < 0.05 meaning it is significant). When the p-value if below 0.05, we indicate if in the "sign" column with a "yes". We also check the coefficient of that feature to see that it is > 0 (a negative coefficient would mean that even if the model is fitted properly, the relationship between both features is not coherent). We also include the adjusted r-squared value in order to see if the model is well-fitted. In Table 9 we see the results for the whole DISCO PAL corpus. All the features have a p-value below the threshold, a positive coefficient while also having a high adjusted r-squared. In Table 12, we also see that in all the subset of sonnets belonging to each of the psychological categories, the features' coefficient is positive and the features are significant for predicting the corresponding annotated value. This indicates that GAM could be inferred whether we use the whole corpus or only a subset for a specific psychological category.

Category
Feature (  If we compare the results of our GAM extraction process against the ones on (Ullrich et al., 2017), we need to focus on the subset of features that appear in both of the papers. Those features are Valence and Arousal. Thus, we can compare the annotated GAM value for those features against their inferred counterparts, as well as to other features related to them, like CorAro or ValenceSpan. For Valence, the bivariate correlation of the inferred Valence value in (Ullrich et al., 2017) is 0.65. For Arousal is 0.54. In both cases the partial correlation analysis shows statistical significance while using those inferred features as predictors for the annotated one. In our case, the bivariate correlation values are smaller, but they are still significant. This is also true for some other related features, as previously mentioned. Regarding the partial dependence analysis, we have validated those features.
Finally, we analyse if there are significant differences in the GAM (using the annotated value) between subsets depending on whether they refer to a specific psychological label or not. We perform a One-way ANOVA hypothesis contrast for each combination between a GAM feature and psychological label. The results for those combination that had p-values less than 0.05 are included in Table 13. That table also includes the mean value for the GAM considering the sonnets annotated with that psychological tag, M (=1), and the other ones, M (=0). As we can see, from among the 210 possible combinations, 127 of them yielded significant differences in the GAM depending on the subset considered.

Limitations of our Approach
There are several limitations within our approach. The first one is that the analysis is limited to the size of 274 sonnets from DISCO PAL. It would be interesting to perform it over a bigger corpus of annotated sonnets. However, as we mentioned earlier, we think that the corpus size is big enough for carrying out these analyses and obtain statistically meaningful results, since our corpus is has more poems per independent category than other corpora from the literature (Ullrich et al., 2017, Aryani et al., 2016, Obermeier et al., 2013, Haider et al., 2020. Also, the threshold found using the statistical power serves as another reason to support the statistical analyses carried out.
Another limitation is that the analysis is applied only over a group of sonnets in Castilian from the 15th to 19th century. Those sonnets contain many archaic words, and that reduces the matching between them against the source lexicons used to assign the affective or lexico-semantic value for individual words. In fact, the ratio of words in those lexicons, as already mentioned, is lower than other analyses within the literature, influenced in part by this aspect.
Also, though there is an acceptable agreement between the annotators for most of the features, the agreement is not perfect. This is something that also influences the results obtained.
Finally, there could be a possible bias due to the fact that the expert annotators have a profile specialized in digital humanities. If the annotators were experts in psychology, for instance, the results may differ.

Conclusion and Future Work
This Section concludes with a final reflection based on the results of the analyses carried out, as well as indicating possible lines of research that can be pursued.

Conclusions
This article presents a methodology to infer GAM feature values for Spanish poetry, using available lexicons that contains affective or lexico-semantic feature values for individual words. This GAM methodology is unsupervised, needing no prior information about the sonnets themselves.
The proposal is evaluated using a subset of sonnets annotated by domain experts. This article includes a corpus of 274 sonnets with features annotated. The sonnets are from Spanish authors from different time periods (from 15th to 19th century). These sonnets are annotated using both affective or lexico-semantic features that indicate the intensity level of that feature within the sonnet, and concepts that belong to the psychological domain, indicating whether a sonnet content is related to that concept or not. They were annotated by three domain experts who belong to POSTDATA project (UNED). This corpus is shared as part of this article.
Then, we conduct an analysis on the level of agreement of the features annotated by the three experts. The result is that at least 97% of the features have an adequate level of agreement. The results improve when we use the median value for the three annotators.
Using the median vector, we validate that it is feasible to model the GAM of a sonnet through several affective or lexico-semantic features built from their individual words. This is checked by analysing the bivariate correlation, as well as the partial dependence, between those features and their annotated counterparts. The results are particularly good for valence, arousal, happiness, sadness, fear, anger and disgust.
Finally, after considering results for all the sonnets together, we analyse if the GAM modelled for each of the subset of sonnets that belong to the different psychological categories differ significantly. We saw significant differences for some features and some psychological categories between the GAM of the sonnets that belong to it and the remaining ones.

Future Work
This subsection details the possible lines of research that can be pursued following the results presented in this article. There are two main group of research lines that are considered at this point. One is related to the improvement of the data quality involved in the GAM methodology, and the other is related to the applications of the DISCO PAL corpus.
Related to the data quality research areas, there are two fields of improvement. First, all the source lexicons used for the feature values of the individual words lack many of archaic words that are present in the sonnets. It would be useful to enrich those lexicons with these missing words in order to check if there is an improvement over the results shown in this paper. Second, as shown in the agreement analysis between annotators, there are some discrepancies in the values assigned for the features, something that potentially affected the results obtained in this paper. Though we proposed using the median value and this yielded robust results for some features, it would be interesting to see other proposals to combine those annotations and mitigate the differences. Finally, the analysis could be enriched if the corpus of sonnets is increased, as well as if the annotations also include sentence-level or stanza-level annotations.
Regarding the usage of the DISCO PAL corpus itself, there are two possible approaches. First, there are research lines that can be pursued related to the psychological tags provided. As we mentioned before, to the best of our knowledge there are no poetry corpora that include annotations regarding psychological states evoked by the poems. This article provides a curated corpus that may help the research regarding the usage of poetry for therapeutic purposes. This corpus may also help studying the relationship between figurative language (e.g. metaphors) and their contribution to emotions. Even though the presence of figurative language is not explicitly annotated (though it could be included to enhance the corpus), the lexico-semantic features could act as a proxy for it.
The other approach is related to the affective modelling of poetry. DISCO PAL includes 10 affective labels that can be used for studying how to infer the GAM of a Spanish sonnet. This could be accomplished by using Machine Learning models that predict the GAM labels based on the semantic vector of the sonnet. Watson, D., & Tellegen, A. (1985). Toward a consensual structure of mood.