Introduction

From ages, the spirit of satire have seek to make, through humor and mockery, a constructive criticism of society. Satire has the ability to make the weak strong and the strong weak. There are no relevant people who are immune to the sharpen comments of satyrs, that have not hesitated to parody the figure of kings, pontiffs or politicians. This way, satire reveals the vices of the society and raises people’s awareness concerning rooted social conventions that are taboo and difficult to report by means other than humor. Apart from traditional satirical press, the rise of social media have allowed individuals to impersonate and parody relevant people with the same critical sense that satire possesses. It is important, however, to not to confuse satire with fake news and propaganda, which objective is to confuse people and influence political opinion with half-truths or hoaxes.

According to the Poe’s law,Footnote 1 popular on the Internet culture, parody and extreme views are impossible to discern without clear indicators of the writer’s intentions. This fact occurs very frequently when satirical, real news, and hoaxes are shared in social media by users or even by other news media who get confused and take satirical news and hoaxes as real. If satire is hard to identify for humans, their automatic identification is a even more challenging task. Satyrs rely on figurative language, in which words move away from their conventional meaning. Specifically, satire makes use of sarcasm and irony to play with words to hide criticism. Figurative language hinders Information Retrieval (IR) tasks that may misinterpret satire as truthful information. Moreover, there is a need to improve resources related to languages other than English in order to improve automatic satire identification.

As far as our knowledge goes, automatic satire identification in Spanish was first addressed by [3] and then by [21], in which the authors proposed, respectively, novel datasets containing tweets from media sites and build automatic classifiers by means of linguistic and statistical features together with traditional machine learning approaches such as Random Forest and Support Vector Machines. Other researches have focused on identifying linguistic phenomena related to satire, such as sarcasm or irony and recently new resources have been published such as IroSVA 2019 [18].

Regarding the techniques involved on satire identification, the authors of [2] compared traditional machine-learning with deep-learning classifiers in order to outperform the results proposed by [21]. They discover, however, that modern Spanish pre-trained word embeddings do not outperform the results achieved by classical term-counting features. This fact suggests that the presence of certain keywords influences too much in the classification, which may be indicative of certain biases in the corpus. Accordingly, we examined the existing datasets regarding satire identification and discover some biases, most of them related to the distant supervision techniques employed and the limited number of different accounts. In order to address these issues, our contribution is two fold. On the one hand, we present the Spanish SatiCorpus 2021, a balanced dataset for identifying satire with 18,207 satiric utterances and, on the other, we evaluate this dataset with several deep-learning architectures and feature sets, including linguistic features and contextual and non-contextual word and sentence embeddings.

The rest of the manuscript is organised as follows: First, Sect. 2 summarises research focused on satire identification but including also research concerning automatic sarcasm and irony detection, due to their close relationship with satire. Next, Sect. 3 presents and describes the compilation process of the Spanish SatiCorpus 2021. In Sect. 4, the reader can find details as regard of the pipeline used for evaluating the corpus, including the feature sets and the deep-learning architectures involved. The results achieved by each model, and its consequently analysis, are described, respectively, in Sects. 5 and 6. Finally, the conclusions of this research and some promising research directions are listed in Sect. 7.

Background information

First, it is important to remark the similitude and differences between verbal irony, sarcasm, and satire because these terms are sometimes misinterpreted. Satire and sarcasm are both forms of expression. However, whereas satire has moralising purpose in which vices and abuses are caricatured, sarcasm is intended to be hurtful and scathing and it is used as a rhetorical device to subdue an adversary. Both, satire and sarcasm, make use of figurative language to their end. Examples of figurative language are understatements, similes, metaphors, personifications, or idioms among others. Of these literary devices, the verbal irony stands out. Verbal irony consists in that the writer intends to be understood with expressions that contrasts with the literal meaning of what he says. The reader can find a detail overview regarding figurative language, its discriminant features, and techniques employed for its identification in [20].

Regarding sarcasm identification, in [17] the authors present a framework for sarcasm identification based on bidirectional recurrent neural networks and term-weighted trigrams, that they refer as inverse gravity moment. The objective for the term weight strategy is to boost critical words and joint words at the same time that word order is preserved. During their experimentation, the authors compare their framework with models based on word embeddings as well as supervised and unsupervised weighting functions, such as term frequency, odds ratio, balanced distributional concentration or regularised entropy, just to name a few. This framework was evaluated with three sarcasm corpus. The first one is based on Twitter, annotated using distant supervision. The second one is based on the Internet Argument Corpus, and the third one is created from extracting news headlines from satirical news sites such as The Onion.Footnote 2

In [12], the authors proposed a BERT-based model called ViLBERT and evaluate it against a dataset composed of images and headlines from satirical web sites. This study was conducted from a multi-modal perspective as it incorporates the analysis of images in satirical news that contains images. They observed that these images were created using photo manipulation techniques to create fictional and unreal scenarios. Their results suggests that multi-modal perspective outperforms text-based solutions.

The identification of satire, verbal irony, and sarcasm have been explored in other languages apart from English. Besides, other researches have focused on multi-lingual approaches. The work presented in [9], for instance, focuses on French, English and Arabic for irony detection. It is worth mentioning that the identification of complex language phenomena, such as irony, is heavily cultural dependant. Therefore, the authors compare multicultural languages Indo-European languages with less culturally close languages. Their proposal indicate that monolingual-based models trained from multilingual word representation are beneficial for irony detection in situations in which annotated datasets are not available.

It is possible to find novel research and datasets concerning satire, sarcasm, and irony in other languages apart from English, such as Turkish [16], Bangla [1], or Persian [10]. As far as our knowledge goes, the principal research focused on satire identification in Spanish was performed first by Barbieri [3] and next by Salas-Zárate [21], in which both authors evaluated satire identification from datasets based on satirical and non-satirical headlines. The approach followed by [21] was based on linguistic features extracted from Linguistic Inquery Word Count (LIWC) [25] whereas the approach followed by [3] compared statistical features and manually crafted linguistic features. It is worth noting that the dataset from [21] were revisited by [2], in which the authors evaluated other features based on count-term features with traditional machine learning classifiers and word and sentence embeddings from novel Spanish pretrained word embeddings. However, their results show an increment in the accuracy over the Mexican Spanish but not in the European Spanish dataset by employing term-counting features. The non-contextual features achieved lower accuracy in the European Spanish dataset compared with approaches based on linguistic features or term-counting features. The reason of this was that the Mexican Spanish dataset was slightly biased because, on the one hand, there were fewer different Twitter accounts so classifiers can focus on specific linguistic patterns from the community managers and, on the other, as the editorial line of each media is focused on particular topics, these topics biased automatic classifiers. The problems identified for the compilation of dataset related to satire identification are taken into account for the compilation of the Spanish SatiCorpus 2021, which is described below.

Corpus

We rely on the UMUCorpusClassifier tool [6] for extracting satirical and non-satirical texts from Twitter. We named this novel dataset as the Spanish SatiCorpus 2021, a balanced dataset consisted into satirical and non-satirical tweets mostly compiled from news sites (from Europe and Latin America). We select four satirical news sites from Spain, namely, El Mundo Today,Footnote 3 El Jueves,Footnote 4 El Intermedio,Footnote 5 and Revista Mogolia.Footnote 6 Three from Mexico, namely, El Deforma,Footnote 7 El Dizque,Footnote 8 and El Univerfail.Footnote 9 Finally, there is one more from Venezuela: and the Venezuelan medium El Chigüire Bipolar.Footnote 10 In addition, to prevent bias from satirical news sites, we added some accounts from Twitter used for impersonate and satirise real political actors and relevant people.

We use distant supervision for corpus annotation, following a similar criteria as described in [21] and [3], assuming that all documents written for satiric news media are satiric. However, we observe that there are several tweets from the satirical accounts used for promoting events. In order to avoid the problems previously identified in other satirical datasets, we conduct a sanity checking in order to discard tweets that started from some words such as entrevistaFootnote 11 or words related to merchandising. We also remove some clues of the texts, as we found that some media uses hooks such as “INCREIBLE PERO CIERTO”.Footnote 12 We also remove those tweets with less than 15 words

One of the problems identified in the previous datasets is that satirical and non-satirical news may not focus on the same events, thus introducing some bias when automatic classifiers can discern among satiric and non-satiric utterances just looking at what the news are about. To mitigate this inconvenient we select the non-satirical tweets in base of the satirical ones. For this, we look for the most similar documents for each satiric document using Text Similarity based on TF-IDF and calculating their cosine distance. We build a matrix in which the rows are the indexes of the satiric documents and the columns the indexes of the non-satiric documents and each component of the matrix represents their distance. Then, we apply an iterative process to select the most similar satiric and non-satiric documents. Once selected, we remove the row and column from the matrix and repeat this process until there are no rows left, that is, we find a match for each satirical document.

The first release of the Spanish SatiCorpus 2021 contains 18,207 satiric and 18,207 non-satiric tweets. This dataset contains tweets between March, 2018 to June, 2021. We divided the dataset into three splits, namely, train, development, and testing, in a ratio of 60-20-20. The dataset is available for the research community at https://pln.inf.um.es/corpora/satire/spanish-saticorpus-2021.zip. However, according to the Twitter guidelines,Footnote 13 we only share the identifiers of the tweets as well as the labels and the split (train, development, test) to preserve users’ rights over their content. Table 1 contains the statistics per corpus regarding label and split.

Table 1 Corpus distribution per label and split

Some satiric and non satiric examples from the dataset are depicted in Fig. 1. In the first row, we can compare two news from Mexico. The non-satirical one comments that during Ash Wednesday (a catholic holiday), the ash was distributed in individual bags in order to avoid agglomerations due to the COVID pandemic. However, the satirical counter-news warns citizens not to confuse these sachets with similar ones with inhaled drugs. As we can observe, forcing the datasets to contains similar satiric and non satirical tweets prevents machine learning models to overfit regarding certain keywords and topics related to the domain. In the second row, both satirical and non-satirical news are related to an event that was debated in the congress of the deputies in Spain, in relation to the facilitating mental health treatments in Spain. An open microphone caused that the comment “Go to the doctor” from a politician was heard throughout the hemicycle. This fact was criticised in the press because it downplays mental illness issues. However, from a satirical point of view, the news looks like this politician has been a hero, and that his comment saved the life of the of the politician who was exposing at that time, for something that had nothing to do with mental health.

Fig. 1
figure 1

Spanish SatiCorpus 2021 examples

Materials and methods

In order to evaluate the Spanish SatiCorpus 2021, we conducted a deep-learning pipeline to evaluate it with different feature sets and deep-learning architectures and so, evaluate what the strong features for automatic satire identification are. Figure 2 depicts the architecture of our proposal. In a nutshell, it can be described as follows. First, we apply a text pre-processing stage to clean the dataset (see Sect. 4.1). Second, we divided the dataset into training, validation, and testing in a ratio of 60-20-20 (see Sect. 3). For this, we use a stratified split to ensure the ratio of satiric and non-satiric documents remains balanced. Third, we conduct a feature extraction stage to obtain the linguistic features and the embedding based features (see Sect. 4.2). Fourth, we perform an hyperparameter optimisation stage to evaluate several machine learning models that combines each feature set separately and combined (see Sect. 4.3) and, finally, the best models for each feature set are evaluated with the test dataset.

Fig. 2
figure 2

System architecture

Text pre-processing

In this stage we remove some parts of the texts that could bias the deep-learning classifiers. First, we remove social media jargon such as hyperlinks, hashtags, or mentions. Besides, we replace digits with a token. Next, to work with features that are based on lexicons, we remove expressive lengthening, that it is a linguistic device used to emphasis some words and we fix misspellings with the PSPELL tool.

It is important to remark that we maintain different versions of the documents to extract some of the linguistic features. For example, to extract features related to Part-of-Speech features and named entities we keep a normalised version but in which the case was preserved.

Feature extraction

In this section we describe the feature sets evaluated. First, we evaluate linguistic features and different kind of embeddings, including pretrained non-contextual word and sentence embeddings from different models as well as pretrained contextual sentence and word embeddings from different versions of BERT.

The linguistic features (LF) are extracted using the UMUTextStats tool [7, 8]. This tool is inspired in LIWC [25], but designed to the Spanish in mind. It captures a total of 365 different linguistic features organised in the following categories:

  • (COR) Correction and style: This category checks the correct usage of writing communication. In particular, it distinguish among (1) orthographic errors, related to the number of misspelled words or the bad use of Spanish accentuation; (2) stylistics errors, related to the presence of sentences that starts with numbers or with the same word; and (3) performance, looking for duplicated words, the usage of dot after exclamation or question signs or the presence of common errors and redundant expressions.

  • (PHO) Phonetics: This category is related to expressive lengthening, which consists in the elongation of certain letters with an emphasising purpose. As Spanish language contains words in which the same letter appears contiguous twice such as coordinarFootnote 14 we consider only the repetition of three or more letters and, in the case of vowels, whether or not they have an accent.

  • (MOR) Morphosyntax: This category includes features that represent how words and sentences are composed. Specifically, there are features that capture (1) grammatical gender, discerning among masculine, feminine, and neutral words; (2) number, to discern among plural and singular; (3) affixes, to capture a fine-grained variety of suffixes (nominals, adjectivizers, verbalizers, adverbializers, augmentative, diminutives, or despective) and prefixes. There are features also organised by PoS categories, such as nouns, verbs, adjectives, adverbs, determiners, pronouns, prepositions, conjunctions, or interjections among others. To capture these features we apply a mixed approach based on Stanza [22], to capture the main categories, and lexicons that capture fine grained subcategories for capturing sub-types.

  • (SEM) Semantics. This category captures four semantic features: (1) onomatopoeia, to identify words that are composed as regards of the sound they produce. For example, the word “achís” in Spanish makes reference to the sound that it is produced when you sneeze; (2) euphemism, that are softer versions of expressions that could be considered too rude in some contexts; (3) dysphemism, that are vulgar words used for replacing other considered more neutral; and (4) synecdoche, that are a type of literary trope used for representing a part as a whole. For example, in “Se quedó con cuatro bocas que alimentar”,Footnote 15 the statement “cuatro bocas” refers to four children.

  • (PRA) Pragmatics. This category captures the presence of figurative language devices, distinguish among understatements, hyperboles, idiomatic expressions, rhetorical questions, verbal irony, metaphors or similes among others. It also contains several linguistic features to see how difference sentences are connected by means of discourse markers. In addition, there are features for capturing some typical Spanish courtesy forms.

  • (STY) Stylometry: This category captures a (1) wide variety of punctuation symbols, (2) corpus statistics, such as the Token-Type Ratios (TTR), as well as (3) other metrics related to the number of words, syllables or sentences.

  • (LEC) Lexical: The aim of this category is to capture the topics in the text. For this, we analyse from abstract topics (analytical thinking, achievement, friendship, religion, or certainty among others) to general topics, such as locations, organisations, animals, clothes, food, and a list of professions.

  • (PLI) Psycho-linguistic processes: This category is related to lexicons and emojis related to sentiments (positive, negative) and emotions (anger, sadness, anxiety).

  • (REG) Register: This category captures how people use language to communicate, as it could be the presence of informal or cultured language. We also capture topics related to offensive speech.

  • (SOC) Social media jargon: This category captures features concerning clues that reveals the speaker domains the social media jargon as it can be specific terminology used on social media or the usage of mechanisms such as hyperlinks, mentions or emojis.

With respect the embeddings, we evaluate non-contextual and contextual word and sentence embeddings. Embeddings are dense vectors that represent words within a latent space. These embeddings are usually learned from unsupervised generic tasks, such as next-word prediction. Word embeddings cluster together words that are semantically similar at the same time they maintain the distance with other clusters of words. Their main drawback, however, is that traditional (non-contextual) embeddings are not aware about polysemy, so the same word have the same representation regardless their significance in a sentence. Contextual words embeddings solve this drawback generating the embeddings by taking into account the context of a word, that is, the words that are next to it. Contextual word embeddings and models based on transformers have meant a great qualitative leap in many NLP tasks, although they are computationally more demanding. This fact is partially mitigated with the usage of sentence embeddings, in which instead of words, sentences are encoded within the latent space. There are different strategies to obtain sentence embeddings, generally taking averaging from the word embeddings. In addition, another of the key-advantages of embeddings, regardless if they are contextual or not, is that they can be learn from generic datasets, which provides two major benefits. One the one hand, pretrained embeddings already conveys general meaning they converge faster and, on the other, they are a form of transfer learning, in which the embeddings have been learnt with concepts that could not be in the domain.

The non-contextual word embeddings (WE) evaluated in this work are based on pretrained models based on word2vec [14], fastText [15], and gloVe [19]. It is worth noting that word embeddings allow to explore specific types of neural networks architectures, such as convolutional and recurrent neural networks, that are capable for taking profit as regards of the space and temporal dimension of language. That is, convolutional neural networks can generate high-order features by clustering joint words whose significance differs from the one it can be obtained individually, as it can happen with the words New and York. Recurrent neural networks, on the other hand, explore the temporal dimension, so they take into account the order of the words. Particularly, we evaluate two bidirectional recurrent neural networks based on Long-Short Term Memory (LSTM) and Gated Recurrent Units (GRU). The non-contextual sentence embeddings (SE) are extracted from FastText [11], which Spanish model is trained from Wikipedia and CommonCrawl.

The contextual word embeddings (BETO) are evaluated from the Spanish version of BERT [4]. These contextual embeddings are focused on Spanish. In addition, we have compared their reliability with multi-lingual embeddings from BERT (mBERT) [5] and its distilled version (dmBERT) [24]. For all these embeddings we rely on the HuggingFace transformers library to fine tune the model with the Spanish SatiCorpus 2021. However, these kind of embeddings are difficult to combine with other feature sets as they are time consuming. Therefore, for the contextual sentence embeddings (BF) we extracted the fixed representation of the [CLS] token as suggested in [23] and we use this representation to combine the contextual embeddings more easily with the rest of the feature sets. Our results suggest that the performance of contextual sentence and word embeddings are similar in terms of precision, recall, and accuracy, with this and other datasets.

After the feature sets were extracted, we conduct a feature selection and a feature normalisation stage. First, we apply a MinMax scaler to the linguistic features as they contains features in different scales with raw counts and percentages. Next, we select the best features using Information Gain, discarding those features that belong to the last quartile.

Hyper-parameter tuning

An hyper-optimisation stage is conducted in order to evaluate the configuration of several neural networks architectures for satire identification. Our strategy consist into train a independent neural network per feature set. We rank the models by accuracy. The hyperparameters evaluated are: (1) the number of hidden layers as well as the number of neurons per layer; (2) the dropout rate, to avoid overfitting; (3) the activation function; (4) the learning rate and the usage of a time-based decay scheduler; and (5) the batch size.

The best architecture and hyperparameters for feature set can be seen in Table 2. It can be appreciated that the best accuracy is always obtained with shallow neural networks, that is, neural network with one of two hidden layers maximum. We observe this fact both in individual feature sets (LF, SE, WE, BF), or when combined in groups of two, three or four within the same neural network. However, the number of neurons per layer vary achieving better results with larger number of neurons in models in which the LF are present. We notice that the number of neurons with WE is small, with only 2 neurons. A similar finding results when combining SE and BF, with 4 neurons. Concerning the dropout rate, we observe than only four experiments provided better results without it: WE, BF with SE, and LF, SE, WE, and BF. With respect to the activation function, ReLu is the one that appears more often, regardless when the feature sets are in isolation or combined.

Table 2 Hyperparameter evaluation over the validation dataset per feature set in isolation or combined within the same neural network

Results

For each model we evaluate the precision (see Eq. 1), recall (see Eq. 2), and F1-score (see Eq. 3) of the satirical and non-satirical class. In addition, we include the macro-averaged scores of both labels together with the accuracy (see Eq. 4) to compare the overall performance of each model.

$$\begin{aligned} {\text {Precision}}= & {} TP / (TP + FP) \end{aligned}$$
(1)
$$\begin{aligned} {\text {Recall}}= & {} TP / (TP + FN) \end{aligned}$$
(2)
$$\begin{aligned} F1\_{\text {Measure}}= & {} 2 \times \frac{{\text {Precision}} \times {\text {Recall}}}{{\text {Precision}} + {\text {Recall}}} \end{aligned}$$
(3)
$$\begin{aligned} {\text {Accuracy}}= & {} TP + TN / (TP + TN + FP + FN) \end{aligned}$$
(4)

Table 3 depicts the results for each feature set. It can be observed that the best result is achieved with BF (96.78% of accuracy), which accuracy outperforms largely the rest of the feature sets: LF (85.14% of accuracy), WE (84.25% of accuracy), and SE (82.04% of accuracy). We can observe that LF are good indicators for satire identification, improving the results from SE and WE. The precision between satiric and non-satiric sentences is similar with all feature sets, although we can observe in SE the highest difference, observing larger precision for satiric documents. A similar but slighter behaviour is appreciated in case of multi-lingual BERT. When comparing the results achieved with different version of transformers (BETO, multi-lingual BERT and distilled multi-lingual BERT), learning the embeddings from an specific language rather than a multilingual corpus is beneficial, as there is difference of accuracy of 6.37% between BETO and mBERT, and a difference of accuracy of 13.042% between BETO and distilled mBERT. This finding is inline with other research that compare embeddings learnt from monolingual or multilingual data sources [13, 26]. The accuracy achieved with BF is similar to the achieved with the pretrained model from BETO (96.897% of accuracy). Thus, the fixed vectors from the [CLS] token extracted for the fine-tuned model contains the relevant information of the sentence without sacrificing performance. Moreover, we can observe from Table 2, that the best results were achieved with a shallow neural network composed of one hidden layer of 48 neurons. However, this fixed sentence representation eases their combination with other feature sets improving the performance.

Table 3 Precision, recall, F1-score of satiric and non-satiric labels by feature set separately

Next, Table 4 shows the results achieved by combining different feature sets in the same neural network, applying a knowledge integration strategy. LF improves the results when combined with SE (89.044% of accuracy), WE (87.740% of accuracy) and BF (97.405% of accuracy). This finding indicates that the linguistic features contains additional information that the one than can be extracted from the embeddings. Other combinations based only in embeddings, however, do not improve their results significantly, as it can be observed from SE and WE (84.308%) and SE with BF (96.870%). It is worth mentioning that accuracy achieved with the combination of LF, SE, and BF (97.32% of accuracy) is lower to the one obtained from combining LF and BF. The combination of all features (LF, SE, WE, and BF) is also slower (97.268%) but it achieves very good precision and recall with both satiric and non-satiric documents.

Table 4 Precision, recall, F1-score of satiric and non-satiric labels by feature sets combined

Next, we evaluate the combination of the models by means of ensembles. These results are shown in Table 5. Specifically, three ensemble learning strategies were evaluated: (1) highest probability, (2) weighted mode, and (3) the average of probabilities. The highest probability consist into inspect the probability reported by each model that one text belongs to the satirical or non satirical category and select the highest one. As we can observe, this strategy reports the highest precision for the non-satire label and the highest recall for the satire class. Therefore, this strategy can be suited in systems that focus on one of these metrics. However, both the recall of the non-satire class and the precision of the satire class are limited. Next, the weighted mode strategy consists in that each model emits a vote for determining whether a text is satiric or not. This strategy is also known as soft voting. As not all classifiers have the same, we ranked them according to their results with the validation split, so the vote for BF is slightly superior to the rest. Note that the weighted mode strategy achieves the overall best result compared with the rest of strategies evaluated, reporting a macro weighted F1-score of 95.510%, performing well regardless of the metric evaluated (precision or recall) and regardless the label (satiric or non satiric). The last strategy evaluated consisted in to average the predictions of all classifiers and so emit the final vote. This strategy achieves slightly lower results than the weighted mode, with a macro weighted F1-score of 93.712125% but with similar performance in both labels and metrics.

Table 5 Precision, recall, F1-score of satiric and non-satiric labels by ensemble learning strategy for LF, SE, WE, and BF

Analysis

The Spanish SatiCorpus 2021 has been evaluated applying different feature sets in isolation (see Table 3) or combined by means of ensembles or within the same neural network (see Tables 4 and 5, respectively). These results are all very competitive. We observe that there is a significant quality leap with contextual word embeddings, regardless if they are sentence or word-based. Moreover, contextual embeddings based on BETO [4] improves significantly the results achieved by embeddings learnt from multi-lingual corpus (mBERT). We also observe that the LF in isolation achieved superior accuracy than the models based on non-contextual embeddings, improving SE (85.145% of accuracy with LF vs 82.042% of accuracy with SE) and WE (85.145% of accuracy with LF vs 84.253% of accuracy with WE). In addition, a relevant finding is that the combination of the LF with other kind of ensembles results in more reliable systems. However, the combination of non-contextual ensembles does not outperform the results achieved separately. We observe a significant increment of the combination of the LF with SE (89.044% of accuracy) and with WE (87.740% of accuracy). However, the increment with BF was less significant (from 96.787% to 97.405% of accuracy).

It is worth mentioning that the training of neural networks adopting the knowledge integration strategy is a time-eating task. First of all, because we perform a extra hyperparameters optimisation process for these neural networks instead of combining the best architectures individually. Secondly, because the neural networks that rely on word embeddings (as opposite of sentence embeddings) requires millions of parameters rather than the thousands evaluated with sentence embeddings. In contrast, the ensemble learning strategy is faster, as we combine the predictions of already existing models, at the same time their results are only slightly lower (from an accuracy of 97.323% combining LF, SE, and BF to 95.511%, with an ensemble of LF, SE, WE, and BF applying the weighted mode.

As the linguistic features provide some kind of interpretability, we calculate the contribution of each linguistic category separately. This information is shown in Table 6. We can observe than stylometry (STY) is the linguistic category that provides major reliability regarding precision and recall (76.3130% of precision, 76.126% of recall). We observe that the feature categories that contains fewer linguistic features, that are, phonetics (PHO) and semantics (SEM), are the ones that achieved lower results, with an 34.115% and a 34.366% of F1-score respectively. In fact, when observe the results achieved per class, it can be noticed that the recall of these models were biased to the non satirical category predicting this feature almost a 100% of the times. The reason of this is that the LF does not appear very often in the dataset. For example, only a small proportion of the documents make use of elongation (PHO), so any model is not capable of making reliable predictions with only this information.

Table 6 Ablation analysis per linguistic category

As we can observe, the results achieved with the LF categories merged outperforms largely the results achieved separately, achieving an accuracy of 85.142% (see Table 3). This fact indicates that the linguistic categories are complementary. For more clarity, we calculate the average of all features organised by linguistic category and label (satire and non-satire) in order to observe this differences in a polar chart (see Fig. 3). It draws our attention than the average of each linguistic category is similar regardless the class, with the notable exception of correction and style (COR), in which non-satiric documents are most relevant. As it can be observed, there are no data regarding semantics (SEM), phonetics (PHO), and pragmatics (PRA). Regarding psycho-linguistic process (PLI) we observe no significant difference and only small difference concerning stylometry (STY). With respect to the usage of social media jargon (SOC) and register (REG), there are differences among satiric and non-satiric utterances although these categories do not appear regularly in the dataset.

Fig. 3
figure 3

Polar chart per linguistic category arranged by satiric and non satiric documents

Next, we calculate the contribution of each linguistic feature to the class by obtaining the mutual information. We select the top most discriminatory features and we observed if they appear more often in satirical or non-satirical documents. This information is depicted in Fig. 4. We can observe than features related to readability (STY), that involves the number of words, syllables and sentences are strong discriminatory features. However, in general, these features have a similar representation for satiric and non-satiric documents. Surprisingly, from correction and style category (COR), we observe than the number of orthographic errors are more common in non satirical documents than in satirical documents. A similar finding is observed with the number of hashtags, that appear mostly in non-satirical documents. Concerning morphological features, the use of pronouns and nouns are discriminatory features, being slightly more frequently the pronouns in satirical documents whereas nouns are more common on non-satirical documents. This finding suggests that satirical documents makes use of personification to perform the satire.

Comparison with previous datasets

In order to compare the reliability of our proposal, we compare our methods with the dataset compiled by Barbieri [3], the two datasets compiled by Salas-Zárate [21] concerning European Spanish and Mexican Spanish, and the three datasets for the IroSVA 2019 shared task [18] based on European Spanish and Spanish from Cuba and Mexico. Table 7 contains the results for each related feature set. We compare the results with our approach based on the combination of LF with BF, applying a knowledge integration strategy, as this approach provide our best results with the Spanish SatiCorpus 2021. It is important to remark that neither [21] nor [3] provided the distribution of the labels for training and testing as they rely on ten-cross fold validation. However, in order to maintain our pipeline, we divided these datasets into training, validation, and testing similar as we do with the Spanish SatiCorpus 2021.

Fig. 4
figure 4

Information gain of the top ranked linguistic features per label

Table 7 Comparison of accuracy and macro averaged precision, recall, and F1-score with other datasets related to satire and irony identification

The first experiments conducted by Barbieri [3], achieved a macro F1-score of 85.200% by using linguistic features and hand-craft features. Our approach achieves a macro F1-score of 90.007%. However, these results should be compared with caution, as Barbieri report their results with ten-cross validation whereas we report us with a test set of 20%. We observed that the most relevant linguistic features are related to stylometry and punctuation, including the readability and the length of the documents. We also observe that common nouns and suffixes were relevant morphosyntax features.

When compared with Salas-Zarate [21], we can observe than for the European Spanish dataset our approach based on LF and BF largely outperforms the results achieved with LIWC, from an 85.5% of accuracy to a 95.6477% of accuracy. In this sense, the results achieved by Salas-Zarate using only linguistic features are competitive, but far for the performance that contextual embeddings provides. In case of the Mexican Spanish dataset, our results also outperform the ones from Salas-Zarate, from a 84% of accuracy to a 92.8427%. In case of the European Spanish, we found that morphological features stand out from the stylometric clues for the classification, including indicative simple present and imperative verbs tenses and prepositions, but also the presence of orthographic mistakes. In case of Mexican Spanish, however, due to the biases of the dataset, the social media features (hashtags, mentions) are the most relevant features. There are also relevant features related to misspellings and stylometry and punctuation.

Finally, we compare our approach with IroSVA 2019. Note that as this competition is recent, the approaches submitted by the participants of the task were very competitive, and our approach only outperforms the results achieved with the Cuban dataset with a macro F1-score of 66.3360%, compared with the macro F1-score of 65.9600%, which was the official best result of the task. In case of Spanish and Mexican, however, the results were slightly inferior. 70.9440% vs 71.6700% of macro F1-score with European Spanish, and 66.3950% vs 68.03% of macro F1-score with Mexican Spanish. We found some differences among the linguistic features among all datasets. First, we only found as relevant performance errors for the Spanish, whereas we could not identify any kind of correction and style relevant feature for Mexican and Cuban. It also draws our attention the presence of psycho linguistic process in Mexican and Cuban. However, these features were positive for the Mexican and related to anger in Cuban. Besides, we found a small bias in the Spanish dataset with the presence of hyperlinks as relevant feature. The most common linguistic category among all datasets is stylometry, with features related to the length of the documents (Spanish), the TTR standard Token (Mexican), and the number of sentences (Cuban).

After the analysis of the results and the comparison with other datasets, we achieve the following insights:

  • The results achieved by contextual sentence and word embeddings are similar. However, sentence embeddings are easier and faster to combine with other features.

  • The usage of embeddings learnt from monolingual approaches are superior from the ones learnt with multi-lingual datasets. as the results achieved with the Spanish version of BERT outperform largely the results achieved with multilingual BERT.

  • The identification of satire by means of linguistic features are more confident with the usage of stylometric clues rather than other clues such as semantics, pragmatics, or by sentiment.

  • Linguistic features are complementary with all kind of embeddings, regardless if they are word or sentence-based, and regardless if they are contextual or non-contextual. However, the contribution of the linguistic features with the contextual sentence embeddings is limited, due to the high performance achieved with these embeddings in isolation.

  • There is no one linguistic category that stands out from the rest. Although the best results for each characteristic were obtained with the stylometry category, the general results obtained by LF are superior. We also observed than some categories are no relevant for satire identification, specially those related to phonetics (PHO), semantics (SEM), and pragmatics (PRA).

Conclusions and further work

In this paper we have described the compilation of the 2021 SatiCorpus 2021 dataset for satire identification. This corpus is balanced and contains a total of 36414 documents. During its compilation, we focused on avoid the problems identified when we analysed past resources that were related to the limitation of the number of different accounts, the heterogeneity between the topics discussed on satirical and non-satirical documents, and the presence of clues within the texts that can biased the results in order to identify the author of the tweet rather than if the document is satirical or not. As commented in Section 3, this dataset has been released for the scientific community. In addition, this dataset has been comprehensively analysed with different feature sets and their combinations, including linguistic features and a wide variety of embeddings.

As future promising research lines, we suggest to extend the Spanish SatiCorpus 2021 to include information regarding what mechanisms for performing satire are involved. On the one hand, this proposal is harder to resolve than simply binary classification. On the other hand, these insights will provide major understanding concerning satire identification. We also propose to investigate the interpretability of the linguistic features avoiding model-agnostic approaches, and focusing of how they perform within the neural network, in order to gain better understanding about in which cases LF and embeddings are complementary. Regarding the selection of the model, we will evaluate to change the hyperparameter optimisation stage, based on random search, for other approaches such as Bayes optimisation. In this sense, we will evaluate other strategies to compare the models and the adoption of nested-cross validation strategies to obtain models that generalise better.