Introduction

Offensive speech is defined as speech that offends someone. A text is considered offensive if it includes any form of unacceptable language, that is, whether it contains insults, threats or bad language [3]. Offensive speech varies widely, from simple profanity to much more serious types of speech [53]. One of the most problematic types of offensive language is hate-speech, since the presence of hate-speech on social media platforms has been shown to correlate with real-life hate crimes [39].

It is quite difficult to distinguish between offensive language and hate-speech as there are few universal definitions [11]. However, all definitions agree in that hate-speech is the language that targets a person or group with the intent to be harmful or cause social chaos. This targeting is usually done on the basis of some characteristics such as race, gender, sexual orientation, nationality, or religion [52]. Offensive language, on the other hand, is a more general category containing any kind of profanity or insult. Hate-speech can, therefore, be classified as a subset of offensive language. Zampieri et al. [59] propose guidelines for classifying offensive language, as well as the type and target of offensive language. These guidelines collect the characteristics of offensive language in general, hate-speech, and other types of targeted offensive language, such as cyberbullying [25, 31].

Given the enormous amount of user-generated content on social networks, it is not feasible to rely on manual supervision to stop hate-speech. In view of that, this study aims to contribute to the detection of hate-speech in Spanish, due to the growing need for research on this topic in languages other than English [15]. For this purpose, we compile the existing Spanish hate-speech corpora and analyze several classification techniques based on linguistic features, transformers and different integration mechanisms. We selected these approaches because they have enabled significant advances in most Natural Language Processing (NLP) classification tasks [54]. To this end, we define the following research questions:

  • RQ1. Which individual features are most effective for hate-speech detection?

  • RQ2. How can features be integrated for more robust systems?

  • RQ3. Is it possible to characterize the language of the different hate-speech types by means of explainable linguistic features?

  • RQ4. Do our methods improve the results of the state-of-the-art?

The main contributions of this work are:

  1. 1.

    Hate-speech detection in Spanish. We focus on Spanish, since the number of existing works in this language is small compared to English and it is very important to increase the reliability of hate-speech detection systems in other languages as well.

  2. 2.

    Compilation of all existing datasets in Spanish for hate-speech detection and experimentation with transformers and state-of-the-art features. Existing studies to date work with one or two of the best known datasets in Spanish (HaterNet and HatEval). However, there are more datasets that the scientific community should be aware of and that could help to advance the study of this phenomenon.

  3. 3.

    Use of the UMUTextStats tool [18, 19] and fine-grain negation features [27, 28] in order to characterize the language present in each type of hate-speech by means of explainable linguistic features.

  4. 4.

    A new method for hate-speech detection in Spanish that outperforms those of the state of the art. Our proposal based on the combination of linguistic features and transformers outperforms current solutions.

The rest of the paper is organized as follows: “State of the art” compiles novel studies and shared-tasks regarding hate-speech detection using NLP, with special interest in those oriented to Spanish. “Datasets” describes in detail the datasets involved in this study. In “Materials and methods”, the reader can find a detailed description of our pipeline, including the feature sets and the deep-learning architectures evaluated. In “Results and analysis” we show and analyze the results of the experiments carried out to answer each of the research questions we formulated. Finally, “Conclusion and further work” summarizes the insights achieved and proposes promising research directions.

State of the art

Hate-speech detection using NLP is a recent task in Spanish, so the number of existing studies is limited. However, its importance has led to an increasing number of researchers focusing on this topic. For an overview on hate-speech detection we find two insightful surveys. On the one hand, in [52], the authors present the terminology employed to talk about this topic, and analyze the methods and features used for hate-speech classification. Moreover, they give some insights for data annotation and pose the context and the language as challenges, because most of the research are on English data. On the other, Fortuna and Nunes [15] also identify the need of studies in languages other than English. In their survey, they analyze the concept of hate-speech from different perspectives and provide a helpful definition for building automatic detection systems. In addition, they compare hate-speech with some related concepts: cyberbullying, abusive language, discrimination, toxicity, flaming, extremism and radicalization. Furthermore, they conduct a systematic literature review and analyze the works based on approaches, algorithms, features, and datasets, among others. Finally, they identify challenges and opportunities.

Most studies on hate-speech focus on the identification of racism [51, 55], the detection of misogyny [17, 18, 47], the identification of xenophobia [7, 47], and the recognition of hate in general [3, 10, 31]. In fact, we can find a large set of shared-tasks about these topics, such as the AMI shared-task on Automatic Misogyny Identification at IberEval 2018 [14] and Evalita 2018 [6], the 2019 and 2020 editions of the HASOC track on hate-speech and offensive content identification [30, 32, 35, 36], and the HatEval shared-task on the Detection of hate-speech against Immigrants and Women [4], among others. Regarding the origin of the analyzed data, there are different sources, such as Twitter [26], Facebook [49], YouTube [50], and Yahoo! [57], being Twitter the most commonly used. With respect to the languages in which the studies are conducted, there are studies in Arabic [1], Croatian [33], Danish [53], Dutch [55], English [14], French [40], German [23], Hindi–English [5], Indonesian [2], Italian [6], Portuguese [16], Spanish [3] and Turkish [9], but by far the majority of them are in English. As our work focuses on Spanish, we present below a brief review of works on the automatic detection of hate-speech in Spanish.

Table 1 Features most commonly used in recent works on hate-speech

The majority of the studies found in Spanish are related to the participation in the shared-tasks AMI 2018 [6, 14] and HatEval 2019 [4]. Regarding the techniques employed for hate-speech detection, some approaches are still based on traditional techniques, such as Support Vector Machines (SVM) [45, 56], as well as traditional feature sets, such as n-grams and TF–IDF. In relation to neural networks models, Long Short-Term Memory (LSTMs) and Convolutional Neural Network (CNNs) are the most popular architectures employed by the teams [13, 58]. As commented in [3], only a few of the participants evaluate the reliability of modern approaches based on transformers [20].

There are, however, some works out of these shared-tasks, such as those described in [3, 18, 44]. García-Díaz et al. [18] focus on misogyny identification and compile a dataset regarding three sub-types of misogyny in Spanish and evaluate a set of linguistic features and sentence embeddings. HaterNet [44] is a another recent work that evaluates hate-speech detection in Spanish. We describe in detail these datasets as well as the techniques employed in “Datasets” because they are the datasets used for answering the research questions proposed in this work. Last, the most recent work we are aware of is [3], in which the authors evaluate non-contextual and contextual embeddings including multilingual and monolingual pre-trained language models such as mBERT, XLM and BETO [8], over HaterNet [44] and HatEval [4] datasets. They conclude that BETO outperforms mBERT and XLM, pointing out that it is necessary to train a model on Spanish, since the system is capable of modulating more accurately the vocabulary. However, it is worth noting that the precision over the hate-speech label was higher with a simple logistic regression and TF–IDF on HaterNet dataset, whereas it was higher with the pre-trained word embeddings and a CNN on HatEval dataset.

Finally, we would like to point out that our work is related to that presented in [3] because of its relationship with the identification of hate-speech in Spanish and the evaluation of state-of-the-art transformers. However, our work differs from it in the following aspects. First, we analyze different feature sets separately, based on linguistic features and transformers, and take into account knowledge integration and ensemble learning strategies to build more robust solutions. Second, we analyze the reliability of the features employed to gain insights on whether these features are common in hate-speech categories. Third, we evaluate novel Spanish BERT models such as Spanish RoBERTa and BERTIN.

Table 1 summarizes the features most commonly used in the most recent works on hate-speech. We have classified them in linguistic features or n-grams (LF), pretrained word embeddings (WE), sentence embeddings (SE), and BERT-based models (BERT based). We can observe that the majority of works employ word embeddings and that BERT based features are being explored in the latest works as they have outperformed results in several tasks regarding NLP.

Datasets

This section describes the Spanish hate-speech datasets involved in this study. They focus on three topics: misogyny, xenophobia and hate in general. Table 2 summarizes the statistics of the datasets which are the (1) Spanish MisoCorpus 2020, (2) the AMI 2018, (3) HaterNet, and (4) the Spanish split of the HatEval 2019 dataset.

Table 2 Corpus statistics regarding size

The Spanish MisoCorpus 2020Footnote 1 [18] is divided into three splits: (1) VARS, considering violence towards women in politics and public media; (2) SELA, on the understanding of the differences in misogynistic messages in Spanish from Spain and Spanish from Latin America; and (3) DDSS, that contains general traits related to misogyny. For this work we consider the full dataset that contains 8390 tweets. This corpus is slightly imbalanced, with more tweets labeled as non-misogyny. It was manually annotated by three human annotators. It is worth mentioning that the experiments conducted in [18] were applied with a balanced dataset, in which the authors sub-sampled the non-misogyny class. However, we compile all the tweets in order to keep the imbalance which is, on the one hand, more realistic and, on the other hand, similar to that observed in the rest of the evaluated datasets. Moreover, the authors evaluated the reliability of their methods using tenfold cross-validation with traditional machine-learning. We consider, however, that for the correct comparison with the rest of the datasets is better to split the Spanish MisoCorpus 2020 into training, development, and testing. The best result achieved by the authors was an accuracy of 82.882% with a combination of linguistic features and sentence embeddings.

Fig. 1
figure 1

System architecture

The second dataset is from the shared-task Automatic Misogyny Identification [14] (AMI 2018)Footnote 2, proposed in IberEval 2018. The dataset is multilingual, with 4138 tweets written in Spanish and English. It has two subtasks: a binary misogyny identification task and a twofold multi-classification of misogynistic behavior. The first subtask includes determining traits of misogyny stereotypes, dominance, derailing, sexual harassment, threats of violence and discredit. The second subtask aims to determining when the target of the misogynist commentary is a particular individual or a group. In the scope of our work, we focus on the binary classification problem. To solve this task, the participants of the shared-task submitted several proposals achieving the best result an accuracy of 81.4681% [42] with a SVM architecture that combined several features, including lexicons of abusive words, with a special focus on sexist slurs and abusive words targeting women.

The HaterNetFootnote 3 [44] dataset was compiled from Twitter. The authors started from an initial set of 2 million of tweets that were filtered automatically and manually tagged by four human annotators. HaterNet is the dataset that presents more imbalance, with 1567 documents annotated as hateful, and 3600 annotated as non-hateful. For the evaluation, the authors focused on the F1-score of the hateful class. During their research, the authors of HaterNet dataset proposed a combination of recurrent neural networks and multilayer perceptrons to combine embeddings, emojis, and other statistical features, achieving an area under the curve (AUC) of 0.828.

Last dataset is HatEval 2019Footnote 4 [4], provided in a shared-task in SemEval 2019. This dataset was released for evaluating the detection of hate-speech towards immigrants and women. According to the overview of the HatEval 2019 task “most part of the training set of tweets against women has been derived from an earlier collection carried out in the context of two previous challenges on misogyny identification”. Those datasets are AMI and EVALITA 2018. This suggests that the HatEval 2019 concerning misogyny is highly biased to those datasets. Two subtasks were proposed in HatEval 2019: (1) hate-speech detection against immigrants and women, and (2) aggressive behavior and target classification, which tries to determine if the target is an individual or a group. The Spanish split of HatEval 2019 consists of 6599 tweets divided into training, validation, and testing. The best result achieved in the Spanish binary subtask was a macro-averaged F1-score of 73%. For the rest of the participants the average was 68.21% and the standard deviation 0.0521, suggesting that most of the results were competitive. Out of the participants, there was a tie in the first position between [45] and [56], both using SVMs but evaluating different feature sets including bag of words, linguistic features, and Part-of-Speech features among others.

Materials and methods

To answer the research questions of this study, we implement a systemFootnote 5 based on linguistic features, transformers and different integration mechanisms (knowledge integration with deep learning and ensemble learning). Figure 1 depicts the pipeline of our proposal. In a nutshell, it can be described as follows. First, the DataResolver module acts as input and is responsible of selecting one of the evaluated datasets. Second, the TextCleaner module cleans and pre-processes the texts to make them more uniform. Third, the DatasetSplitter module gets the training, validation, and testing splits. In addition, for training the models, the training split is used to fit the feature generation and feature selection processes, made by the FeatureGenerator and FeatureSelector modules, respectively. Fourth, the ModelResolver is the other input and is responsible to select one strategy for evaluate the datasets. Next, the HyperParameterSelector module is capable of evaluating different neural network architectures and hyperparameters to obtain the most suitable combination for each feature set and dataset. Note that the ModelResolver can select one these alternatives: (1) deep-learning, that handles the features separately; (2) knowledge integration, that uses two or more feature sets in the same neural network architecture, and (3) ensemble learning, that combines the predictions of the best models of each feature set. In the following sections, these modules are described in detail.

DatasetSelector

It loads the datasets and normalize them into a common format, standardizing the names of the labels, as some of these datasets use numbers as labels and others textual categories.

TextCleaner

It generates two clean versions of the texts. In the first clean version, urls, hashtags, mentions, emojis and punctuation symbols are removed. Digits are replaced with the token [NUMBER], and elongations of certain letters are removed. This version is used for extracting some linguistic features related to Part-of-Speech, specially those related to proper names, such as surnames based on toponymics (Sevilla, Madrid), professions (HerreroFootnote 6, ZapateroFootnote 7), or physical traits (RubioFootnote 8), which are common in Spanish. Next, it generates another clean version by lowercasing the first clean version. The second version is used to generate the word and sentence embeddings as it is explained in “FeatureGenerator”. In addition, the original version of the text is kept to obtain features regarding the usage of uppercase or misspellings.

DatasetSplitter

It splits each dataset into training, validation, and test sets. We adopt the following strategy: (1) if the dataset has these three splits, we use them; (2) if the dataset contains only training and test sets, we split the training set to also generate a validation set by randomly selecting a 20% of the training split and keeping the test set as it is; (3) if the dataset is not partitioned, we perform a random split in a ratio of 60–20–20, keeping the splits balanced.

FeatureGenerator

This module generates the features used to represent the texts. We evaluate the following feature sets: linguistic features (LF), pretrained word embeddings (WE), sentence embeddings (SE), and fine-tuned BERT embeddings (BF). These features have been selected because they are the most commonly used in existing works on hate-speech as can be seen in Table 1.

We extract the linguistic features (LF) from UMUTextStats [18, 19] and extend them with fine-grain negation features from [27, 28]. Concerning the negation features we get the list of negation cues appearing in each text (simple cues (e.g., “no”/ not), continuous cues (e.g., “en mi vida”/ in my life) and discontinuous cues (e.g., “ni...ni”/ nor...nor) and compute their total. Regarding the UMUTextStats, it is a text analysis tool focused on Spanish. It collects a total of 365 linguistic features that are organized with the following categories: phonetics, morphosyntax, correction and style, semantics, pragmatics, stylometry, lexical, psycho-linguistic processes, register, and social media jargon.

We extract three feature sets based on embeddings. First, we evaluate pre-trained word embeddings (WE) from word2vec [37], GloVe [43], and fastText [38]. Pre-trained word embeddings are a form of transfer learning in which the embeddings are learned from other general NLP tasks. They allow networks to converge faster as the representation of the words starts already clustered. Second, we obtain fixed sentence embeddings from fastText (SE) from its Spanish model [22], in which every document is represented as a fixed vector of length 300. Word and sentence embeddings from the pretrained models have the drawback that they do not take into account polysemy, so words have an unique representation regardless their context. Contextual word embeddings, on contrast, take into account the surrounding words in order to convey the embeddings. For this, for the third kind of embeddings, we evaluate different BERT models: BETO [8], the Spanish adaption of BERT [12], multilingual BERT (m-BERT) [46], Spanish RoBERTa [24], trained from National Library of Spain, and BERTIN.Footnote 9 BERT, and consequently BETO, use bidirectional transformers to learn contextual embeddings. Our approach to obtain these vectors is the following: we use the HuggingFace library to fine-tune BETO with each dataset separately (BF). Then, we extract the sentence embeddings, as suggested in [48], applying a mean pooling on-top of the contextualized word embeddings, obtaining a fixed-vector representation of length of 768 for each document in the corpus. The advantage of this representation is that it is easier to combine with other feature sets but keeping the performance.

FeatureSelector

The FeatureSelector module is responsible for normalizing the features and selecting the best ones. Regarding normalization, we scale each LF individually in a range of 0 and 1. We apply this strategy as linguistic features are more heterogeneous, including some features that measures percentages and other raw counts. Next, we obtain the Mutual Information (MI) to observe the dependency of each feature with the label. With this information, we perform a feature selection by discarding those features that were ranked in the last quartile. We apply this process to LF and SE. We do not apply it to BF, as we observe that feature selection is not effective on those features.

ModelResolver

The ModelResolver selects the strategy used to train the models and the feature combinations. To address RQ1, we evaluate the feature sets separately. To answer RQ2, we evaluate two strategies. On the one hand, knowledge integration, which consists of combining several neural networks into a bigger one. Each feature set could work independently with a series of hidden layers, and then combine their inputs to output the final prediction or even to feed some more networks. On the other hand, we evaluate the combination of feature sets by means of ensembles. An ensemble is the combination of the outputs of two or more algorithms in order to make the final prediction. Ensembles are less sensitive to the training data, and usually provides better performance. Specifically, four strategies of ensemble learning are considered: (1) hard voting, which consists of selecting the label with a majority vote from the individual models; (2) highest probability, which consists of selecting the highest prediction probability among all the models; (3) average probability, which averages the probabilities of each model; and (4) logistic regression, which involves training a logistic regression classifier from the probabilities of the training splits.

HyperParameterSelector

As neural networks are highly configurable, we conduct an hyper-optimisation stage to evaluate which neural networks architectures are more suitable for hate-speech recognition. For SE or LF, we rely mostly on multilayer perceptrons. Word embeddings, however, allow to feed several kinds of neural networks architectures based on convolutional and recurrent networks. On the one hand, CNNs are more popular for solving computer vision tasks, but they could also be applied for conducting NLP tasks such as document classification [34]. The idea behind CNNs is the usage of filters based on pooling layers that are capable of generating high-order features. In this sense, CNNs exploit spatial time dimension of natural language, clustering joint words or expressions that may convey different meaning from the meaning of each word separately. Recurrent Neural Networks (RNNs), on the other hand, exploit the time dimension of the text, as they read the embeddings sequentially keeping information of past or even future words in case of bidirectional recurrent neural networks. Specifically, we evaluate two recurrent bidirectional networks: Long-Short Memory Units (BiLSTM), and Gated Recurrent Units (BiGRU).

We also evaluate the number of hidden layers and neurons. We distinguish among shallow neural networks, composed by one or two hidden layers and the same number of neurons per layer, and deep-learning architectures, between three and eight hidden layers. For deep-learning architectures we test different number of neurons disposed in several shapes: brick, funnel, long funnel, diamond and triangle. In addition, we evaluate a dropout to avoid overfitting, and different activation functions, including ReLu, eLu, Sigmoid, Tanh, and Linear. For the learning rate, we evaluate 10e−3 and 10e−4 with a scheduler using a time-based decay. Due to the size of the datasets and their slightly imbalance, we decided to evaluate the batch size of 32 and 64 but including larger batch sizes (128, 256, 512) for HaterNet, to ensure that every batch contains an enough number of hate-speech instances. All models were trained during 1000 epochs maximum with an early-stopping mechanism to avoid overfitting. For word embeddings we evaluate the usage of Spanish pre-trained word embeddings from word2vec, fastText, and GloVe or leaving the embeddings be learn from scratch. The hyperparameters explored are included in Table 3.

Table 3 Hyperparameter options for the neural networks architectures evaluated

Classification report

This module reports the results of the best model with the test split. We evaluate the precision (see Eq. 1), recall (see Eq. 2), and F1-score (see Eq. 3) of the hate-speech label (misogynous for the Spanish MisoCorpus 2020 and AMI, and hateful for HaterNet and HatEval 2019), and of the non hate-speech label (non_misogynous for the Spanish MisoCorpus 2020 and AMI, and non_hateful for HaterNet and HatEval 2019), as well as the weighted versions of precision, recall, and F1-score for the overall comparison. In addition, to compare our approach with other systems, we use the accuracy (see Eq. 4) because is the metric used to rank the Spanish MisoCorpus-2020 and AMI 2018, and the macro F1-score, used to rank HatEval 2019.

$$\begin{aligned} {\text {Precision}}&= {\text {TP}} / ({\text {TP}} + {\text {FP}}), \end{aligned}$$
(1)
$$\begin{aligned} {\text {Recall}}&= {\text {TP}} / ({\text {TP}} + {\text {FN}}), \end{aligned}$$
(2)
$$\begin{aligned} {\text {F1-score}}&= 2 \times \frac{{\text {Precision}} \times {\text {Recall}}}{{\text {Precision}} + {\text {Recall}}}, \end{aligned}$$
(3)
$$\begin{aligned} {\text {Accuracy}}&= \frac{{\text {TP}} + {\text {TN}} }{{\text {TP}} + {\text {TN}} + {\text {FP}} + {\text {FN}}}, \end{aligned}$$
(4)

where TP (True Positives) are those assessments where the system and human experts agree for a label assignment, FP (False Positives) are those labels assigned by the system that do not agree with the expert assignment, FN (False negatives) are those labels that the system failed to assign as they were given by the human expert, and TN (True Negatives) are those non-assigned labels that were also discarded by the expert.

Table 4 Performance of the individual features regarding hate-speech detection

Results and analysis

This section presents the results of the experiments conducted to answer the research questions formulated. In the following, each of them is addressed in a separate subsection.

RQ1. Which individual features are most effective for hate-speech detection?

The objective of this research question is to determine which feature set, in isolation, performs best in detecting hate-speech messages.

Discussion

Table 4 shows the results obtained for the individual features of each dataset. As expected, the fine-tune versions of BETO (BF) outperform the rest of features in a great extent. Therefore, we decide to evaluate other Spanish BERT models in order to compare the performance of different embeddings based on BERT. Table 5 presents the results of different pretrained contextual embeddings from BERT. Specifically, we evaluate BERTIN, BETO, the Spanish RoBERTa trained from National Library of Spain (BNE), and multilingual BERT (M-BERT). The best results are obtained by BETO, so we select this pretrained model to combine with the rest of the features to answer RQ2 and RQ3.

Table 5 Performance of BERT embeddings

Next, we focus on evaluating the network complexity. Table 6 depicts the network architecture and the hyperparameters per feature set. It can be noticed that all neural network architectures are shallow neural networks with one or two hidden layers maximum with the same number of neurons in each layer. We can also observe that larger learning rates (10e−3) behave better than smaller ones (10e−4). However, there are no clear clues as to whether the learning rate is correlated with the feature set or the dataset. Regarding the activation function, ReLu is the one that appears mostly for achieving the best results, specially in AMI 2018. Tanh appears in complex networks with greater number of neurons. Finally, we can observe that the Spanish MisoCorpus 2020 and HaterNet share the learning rate and the activation function for LF, SE, and WE. In case of BF, however, the learning rate is the same but not the activation function.

Table 6 Feature hyperparameter results of the individual features regarding hate-speech detection

Response

After analyzing the performance of the evaluated feature sets, we observe that the fine-tuned embeddings from BETO (BF) outperform the rest of feature sets. They achieved a significant increase of 4-5% regarding the second best feature set (WE). In relation to the reliability of LF, they achieve competitive results in all datasets except in HaterNet, in which they obtain limited results regarding the recall of the hateful category. Finally, with respect to the neural network architecture, we observe that shallow neural networks with few neurons and few hidden layers behave better than deep neural networks.

Table 7 Performance of the features regarding hate-speech detection applying knowledge integration strategy
Table 8 Feature hyperparameter results of the knowledge integration strategy regarding hate-speech detection

RQ2. How can features be integrated for more robust systems?

To answer this research question, we evaluate two different strategies to combine the feature sets. On the one hand, by combining them into the same neural network from a knowledge integration strategy (see “Knowledge integration strategy”) and, on the other hand, by combining the results of the best model for each feature set using ensemble learning (see Sect. “Ensemble learning”).

Knowledge integration strategy

To evaluate the knowledge integration strategy, we combine the features in the same neural network and perform the hyperparameter optimisation again. For this, we use the Keras API function to develop a multi-input neural network. Each feature set is used as input and connected to a dense layer. Then, all features are combined to produce the final prediction. Table 7 shows the results achieved by the linguistic features and the best embedding (LF, BF), all the embeddings (SE, WE, and BF), and all features (LF, SE, WE, BF).

For the Spanish MisoCorpus 2020, the best result is achieved with the combination of the embeddings (SE, WE, and BF), with an 90.52% of weighted F1-score. However, the results of the combination of LF and BF as well as the combination of all features (LF, SE, WE, BF) are very similar: 90.40% and 90.44% of weighted F1-score, respectively. In addition, it can be seen that precision and recall become more stable with the feature combinations than with the feature sets separately (see Sect. “RQ1. Which individual features are most effective for hate-speech detection?”). In relation to AMI 2018, the best result is achieved by combining all the feature sets, with an 84.72% of weighted F1-score. In this case, the other combinations also get similar results: 83.27% for LF and BF, and 83.39% for SE, WE, and BF. HaterNet also obtain the best result with the combination of all features (84.08%). Regarding the last dataset, HatEval 2019, the best overall result is achieved with the combination of LF and BF, with a weighted F1-score of 77.27%.

Next, we focus on evaluating the network complexity. Table 8 lists the network architecture and the hyperparameters per feature set. The combination of different feature sets within the same network achieves better results with simpler neural networks, except for AMI 2018. Despite the fact that the number of hidden layers is similar, the number of neurons is quite superior, except for AMI 2018. For the learning rate, we can observe that larger learning rates (10e−3) behave better regardless the feature set combination for the Spanish MisoCorpus 2020, the AMI 2018, and the HaterNet datasets. However, small learning rates (10e−4) get better results in HatEval 2019 with the combination of all the embeddings (SE, WE, and BF) and all the features (LF, SE, WE, BF). Regarding the activation function, it can be observed that the Spanish MisoCorpus-2020 and HaterNet perform better with no activation function.

Table 9 Performance of the features regarding hate-speech detection applying an ensemble learning strategy over the Spanish MisoCorpus 2020
Table 10 Performance of the features regarding hate-speech detection applying an ensemble learning strategy over the AMI 2018 dataset

Ensemble learning

For evaluating the ensemble learning strategy, we exploit the best model per feature set developed for answering RQ1 (see Sect. “RQ1. Which individual features are most effective for hate-speech detection?”). Next, we generate a new prediction based on different approaches. First, using the hard voting strategy that consists of getting the majority vote of the models in the ensemble. Second, based on the prediction output of the model with the highest probability in the output layer. Third, by computing the average of the predictions of the last layer. Forth, by training a logistic regression classifier that learns to predict hate-speech or not based on the probabilities output by each model separately. We experiment with three combination sets: (1) linguistic features and the fine-tuned embeddings from BETO (LF, BE); (2) all the embeddings (SE, WE, and BF); and (3) linguistic features and all the embeddings (LF, SE, WE, BF). Due to the large number of datasets and strategies, we split the results in the following tables: (1) Table 9 for the Spanish MisoCorpus 2020, (2) Table 10 for the AMI 2018, (3) Table 11 for HaterNet, and (4) Table 12 for the HatEval 2019.

Regarding the Spanish MisoCorpus 2020 (see Table 9), the best overall result is obtained with the logistic regression strategy and all the feature sets, reaching a weighted F1-score of 90.14%. We can observe that this strategy also gets very good results with the other feature combinations evaluated. However, it is noteworthy that the highest probability strategy obtains the best precision regarding misogyny identification, but with a considerable sacrifice of the recall. This strategy considers one text as hate-speech when any of its classifiers outputs a probability higher than the 50%. Due to larger precision achieved, we conclude that this strategy is reliable for predicting a text as misogyny. However, the low recall indicates that there are many FN. Therefore, it would be necessary tuning the threshold of this strategy in order to adjust the compromise between precision and recall. On the other hand, the ensemble based on averaging probabilities performs slightly worse than the ensemble based on logistic regression, outperforming only when combining LF and BF. Finally, the ensemble learning based on the hard voting strategy penalizes the models with LF, getting the ensemble of SE, WE, BF better results. When comparing those results to the ones achieved with the knowledge integration strategy (see Table 7), we can observe that the knowledge integration strategy improves the results of the best ensemble learning strategy (logistic regression) with all the combinations: (1) 90.40% vs 89.74 with LF and BF, (2) 90.52% vs 89.96% with SE, WE, and BF; and (3) 90.44% vs 90.14% when combining all the features. However, the network architecture is complex.

Concerning AMI 2018 (see Table 10), the best ensemble learning result is provided by the strategy based on averaging the probabilities of LF and BF, with a weighted F1-score of 83.30%. As we observed when analyzing the Spanish MisoCorpus 2020 (see Table 9), the precision of the misogyny label with the ensemble based on the highest probability is high, with 92.90% by combining all feature sets, but with a great recall loss (53.50%). In the same line with the Spanish MisoCorpus 2020, the highest probability strategy of LF and BF achieves a reliable precision (87.10%), but with a slight drop in recall (69.90%). However, on the contrary, those ensembles based on averaging probabilities get better results than the logistic regression strategy.

As for HaterNet (see Table 11), the best result corresponds to the logistic regression strategy combining the features based on embeddings (SE, WE, and BF). This model achieves a weighted F1-score of 84.34%. The same strategy, but combining all the features (LF, SE, WE, and BF) obtains slightly worse weighted F1-score (84.25%), and 83.16% when combining LF and BF. When comparing the logistic regression strategy with the average probabilities strategy, we can observe that the weighted F1-scores are similar, but there are important differences among the precision and recall values of the hateful class. These differences were not observed in the Spanish MisoCorpus 2020 (see Table 9), the AMI 2018 datasets (see Table 10) nor the HatEval datasets (see Table 12). Regarding the strategy based on the highest probability, it can be observed that the combination of all feature sets (LF, SE, WE, and BF) achieves a perfect precision of the 12.60% identified instances. As we observed in the other datasets, the highest probability strategy using LF and BF achieves high precision (87.10%) but limited recall (69.90%). With respect to the hard voting strategy, it reaches lower results regarding the hateful label, but similar precision and recall in all the ensembles with LF (67.30% precision, 65.40% recall for LF and BF, and 69.66% precision, 60.19% recall when LF is combined with the rest of features).

Table 11 Performance of the features regarding hate-speech detection applying an ensemble learning strategy over the HaterNet dataset

With respect to HatEval (see Table 12), the best overall result is reached with the logistic regression strategy of LF and BF, with a weighted F1-score of 76.66%. Regarding the ensemble learning based on the hard voting strategy, we can observe that the precision of identifying the hateful label is limited when comparing with the rest of the ensemble strategies. The highest probability strategy gets superior precision than the rest of the strategies in a similar way as we have observed for the rest of the datasets. However, we can see that the balance between precision and recall is not as high as we noticed in the Spanish MisoCorpus 2020, AMI and HaterNet. As we are able to discern which HatEval tweets focus on women and which on immigrants, we analyzed the results separately. We observed that the subset of HatEval 2019 focused on misogyny gets higher precision with the hate-speech label but limited recall, regardless of the ensemble strategy. On the other hand, the analysis of the split focused on migrants suggests the opposite, lower precision but higher recall. However, as the subset of HatEval 2019 towards women is biased with AMI 2018, a more thorough analysis of these differences would be necessary.

Table 12 Performance of the features regarding hate-speech detection applying an ensemble learning strategy over the HatEval 2019 dataset

Response

For the combination of the features, we evaluated two strategies, one consisting of knowledge integration and the other of ensemble learning with multiple criteria. We realized that the results obtained with knowledge integration are, in general, superior to those achieved with ensemble learning, although there is not a great difference. However, we observed a higher complexity on the neural networks that requires more neurons than those of the best models of each feature set independently.

Concerning the ensemble learning study, the highest probability strategy achieves the best precision over the misogynous and hateful classes in all the datasets. However, this comes at a cost with respect to recall. We observe this specially with the HaterNet dataset, in which we obtained a perfect precision but a recall of 12.60%. For systems in which the precision is more important than recall, we recommend to focus on the highest probability strategy but selecting fewer feature sets as we observe better F1-score. In general, we can say that the strategies that provide competitive results regardless the datasets are knowledge integration and ensemble learning based on logistic regression using LF and BF as features.

RQ3. Is it possible to characterize the language of the different hate-speech types by means of explainable linguistic features?

For addressing this research question, we obtain the mutual information per linguistic feature of the different categories. In order to observe how linguistic features from different categories contribute to the identification of hate-speech, we rank those features and we organize them in groups of 5 according to the category. Figures 2, 3, 4, and 5 represent this classification for the Spanish MisoCorpus 2020, AMI 2018, HaterNet, and HatEval 2019 datasets respectively. Note that there are some categories, such as semantics, in which there are fewer than five categories.

Discussion

Regarding the Spanish MisoCorpus 2020 (see Fig. 2) we can observe that the categories related to register are the most discriminatory. Register includes the usage of strong offensive speech, swear, colloquialisms and, to a lesser extent, non-fluent speech. Correction and style is another relevant category, highlighting features related to orthographic errors and misspelled words. Concerning morphosyntax, those features related to grammatical feminine words, nouns, prepositions, and suffixes are strong discriminatory features. Stylometry and social media are also effective regarding misogyny identification, including mentions or replies than include female name accounts. The usage of hashtags and social media jargon also stand out. As for stylometry, we can observe that features related to readability, quotations, and the length of the text can contribute in some extent to the identification of misogynistic messages. With respect to pragmatics, discourse markers used to argue, structure, or add information are relevant features. It also appears as relevant the usage of similes in figurative language and the usage of courtesy forms to introduce oneself in the conversation. The usage of negations appears both in misogynous and non misogynous documents, being the most relevant ni...ni, no...no, nunca...nadie, sin...ni, and casi nadie.

Fig. 2
figure 2

Mutual information of the five-ranked features per category from Spanish MisoCorpus 2020

Next, we analyze the correlation between the linguistic features with the AMI 2018 dataset (see Fig. 3). The analysis reveals that linguistic features within register category and, specifically, related to offensive speech are the strongest discriminatory features. AMI and the Spanish MisoCorpus 2020, which both focus on misogyny identification, share this feature. Correction and style, however, is less relevant in AMI 2018 than in Spanish MisoCorpus 2020, as in AMI 2018 there are differences in the number of misspelled words among misogynous and non misogynous classes but we observe no differences in orthographic errors. Regarding morphosyntax, we can observe that verbs in subjunctive simple or in singular are discriminatory features as well as words in masculine and nominal suffixes. Lexical is another relevant feature but also differs from Spanish MisoCorpus 2020. In AMI 2018, the most relevant topics are related to animals, female and male persons and groups, and topics related to sex and risk. In the Spanish MisoCorpus 2020, however, the relevant topics are related with locations, organizations and also with analytic thinking and tentativeness. Only sex topics appear in both datasets as one of the most discriminating feature. This finding suggests that the context in which the tweets were collected can have a relevant role. In AMI, the documents were compiled using the following strategies: (1) using offensive representative words, (2) observing accounts from potential victims, and from (3) people who explicitly declared their hate against women. The Spanish MisoCorpus-2020 shares two of the three strategies mentioned, not taking into account misogynist accounts but taking more attention to certain events, like the arrival of Greta Thunberg in Madrid at the UN Climate Change Conference, or a case of rape of a minor that occurred in Spain related to a local soccer team. Focusing on those events may force the relevance of lexical features related to locations and organizations. It is surprising, however, that in the Spanish MisoCorpus-2020, animals is not a relevant feature for misogyny identification (in Spanish, the name of some female animals are usually misogynistic insults). In the same line, the usage of male and female groups of persons are relevant features in AMI but not in the MisoCorpus. This fact suggests that those terms can appear in misogynous and non-misogynous texts, so their are not good indicators of misogynous content. Another relevant difference of AMI 2018 with the Spanish MisoCorpus 2020 is social media, that have major impact in the Spanish MisoCorpus 2020. With respect to pragmatics, mutual information in AMI 2018 suggests that figurative language plays an important role to discern among misogynist messages from the usage of metaphors and understatements.

Fig. 3
figure 3

Mutual information of the five-ranked features per category from AMI 2018

Similar to the Spanish MisoCorpus 2020 and AMI, we can observe from HaterNet (see Fig. 4) that offensive speech (register) is the most discriminatory feature. However, in HaterNet the presence of swear, SMS language and cultism appear as relevant features. With regard to the rest of categories all features behave similar. We note common performance errors for the correction and style category, and topics related to clothes and body for the lexical category. Intransitive verbs, as well as verbs in indicative (simple or compound), impersonal pronouns, and articles are also relevant features within the morphosyntactic category. Regarding pragmatics, the discourse markers related to reformulate and argument, as well as figurative language related to similes and idioms, are relevant for hate-speech detection. In addition, we can observe that social media jargon and the usage of hyperlinks can be useful for this dataset.

Fig. 4
figure 4

Mutual information of the five-ranked features per category from HaterNet

Fig. 5
figure 5

Mutual information of the five-ranked features per category from HatEval 2019

Concerning HatEval 2019 (see Fig. 5), we can confirm the importance of offensive language (register), as this feature shows similar behavior in all datasets. It is worth mentioning that this analysis is biased because some of the documents from the misogyny split in HatEval 2019 also appear in AMI 2018. Consequently, we rank the linguistic features with information gain, but only with the subset of the HatEval 2019 dataset regarding to immigrants (not shown). The mutual information on the immigrants split also indicates that strong offensive speech is a relevant feature, but also the usage of colloquialisms and softer offensive language. Regarding lexical, the linguistic features are similar to the ones that appear for the AMI 2018 dataset including sex, common names referring to women or groups of women, inclusive language, and exploration. When we remove from our analysis the documents towards women and we focus only on hate-speech towards immigrants, we observe topics concerning sex, home, friendship, perceptual processes, and discrepancies. Some negation cues also appears as discriminatory features, including nada más, jamás, casi nadie, ni...ni, tampoco...tan. In fact, HatEval 2019 is the only dataset of those studied in which the total of negations appears as discriminatory feature, although with little impact regarding the identification of hate-speech. In terms of morphosyntax, we can observe that suffixes, including adjectivizers and nominals, as well as prepositions and copulative verbs, are discriminatory features. With respect to pragmatics, we can observe connectors, reformers, and words and expressions used to order the clauses. This fact suggests that a reflection, discussion and/or debate occurs when speaking towards women or immigrants within a conversation in hate-speech context. Regarding psycho-linguistic processes, those related with negative sentiments and especially anger are discriminatory features. We can also observe the presence of negative emojis. Concerning social media usage, we observe that reply to females is a discriminatory feature. We analyzed if this fact appears also in the subsets of the HatEval 2019 but we noticed that this feature is specially relevant to the misogyny subset. In fact, replying to males is also a relevant feature in this context.

Response

The analysis of the interpretability of the features leads to the following findings:

  1. 1.

    We observe common traits in all datasets regarding the register category, such as the features related to hard offensive speech. Moreover, swear and colloquialisms also appear as discriminatory features, but in different degree.

  2. 2.

    From the misogyny identification, we note that the percentage of misspelled words is relevant for the Spanish MisoCorpus 2020 and AMI 2018 dataset. This finding does not appear in HaterNet, and is less so in HatEval 2019, which largely shares documents with AMI 2018.

  3. 3.

    Pragmatics and, specifically, discourse markers, appear frequently as discriminatory features. We observe that these features are more frequent in hate-speech or non hate-speech classes. We notice that argumentative markers are more common on non misogynous texts in the Spanish MisoCorpus 2020, but more common in misogynous texts in AMI 2018. Connectors used before to state a consequence are more common in non hateful documents as well as discourse markers used for structuring the text.

  4. 4.

    Linguistic features concerning the usage of social media show different behavior in the two corpus related to misogyny. The usage of mentions, hyperlinks, hashtags, and the usage of specific jargon appear as relevant features in the Spanish MisoCorpus 2020. However, social media features are no relevant in AMI 2018. HaterNet and HatEval indicate an intermediate value but not relevant.

  5. 5.

    Topics are not shared among the datasets focused on misogyny. We observe a strong presence of topics related to locations, organizations, and analytic thinking on the Spanish MisoCorpus 2020 whereas in AMI the topics are more related to animals (as the names of female animals species are common insults in Spanish), male and female social groups, and risk.

  6. 6.

    The usage of negations are not discriminatory features for hate-speech identification. We conduct a deep analysis of a total of 121 negations including simple, continuous, and discontinuous cues. However, the only dataset in which these features appears to be relevant is the HatEval 2019, with more statements with negations in the hateful class.

RQ4. Do our methods improve the results of the state-of-the-art?

To address this research question we compare our results with the best state-of-the-art results obtained for each particular dataset. Specifically, we compare our two best strategies, consisting of knowledge integration and ensemble learning based on logistic regression of LF and BF, with the best approaches of the state-of-the-art. These models are selected because achieved competitive results regardless the dataset. It is worth mentioning the limitations of this comparison. First, when comparing with HaterNet and the Spanish MisoCorpus 2020, the results were evaluated using ten-fold cross validation but in our approach we use the test set. Second, the results described in [3] regarding HaterNet use a training-test split, which is not the same split than ours nor the one used during the original experiment by the authors, since they did not release the splits. Third, not all shared-tasks and research focus on the same metrics, as those focusing on misogyny use accuracy, HaterNet compares with the F1-score of the hateful label, and HatEval 2019 with the Macro F1-score. Accordingly, we have include in Table 13 all the metrics and all the available results.

Table 13 Comparison of our approaches with the state-of-the-art for the Spanish MisoCorpus 2020, AMI 2018, HaterNET, and HatEval 2019, using accuracy (Acc), F1-Score of the hate-speech class (F1_HS), and Macro F1-score (M_F1)

Discussion

When comparing the results for the Spanish MisoCorpus 2020, we can observe that our proposal, grounded on the usage of linguistic features and transformers, outperforms the accuracy achieved in [18], from 85.2% to 90.4% with knowledge integration and to 89.7% with the ensemble learning based on logistic regression. It should be noted that, to the best of our knowledge, this dataset has not been evaluated in other research works, so the conclusions are limited.

Regarding AMI 2018, the best result obtained during the shared-task was an accuracy of 81.4681% by [42], outperformed slightly in [18] with an 81.5217%. These results were achieved using Support Vector Machines and similar strategies for the features. The results reported by our systems outperform both results, but not significantly. Our proposal based on knowledge integration gets an accuracy of 83.3% and the ensemble learning based on logistic regression an 82.5%. Although our results are the best we are aware of, we consider that the novelty of the models employed based on transformers should have improved the state-of-the-art results even more.

Regarding HaterNet, we focus on F1-score of the hateful class. In the original experiment with this dataset [44], the authors achieved a F1-score for the hateful label of 61.1%. This result was outperformed by [3] with their proposal based in BETO, with a 65.8%. Our proposal based on linguistic features with a knowledge integration strategy outperforms slightly these results, achieving 65.9% of F1-score for the hateful label, but the results are superior applying the ensemble learning based on logistic regression, with a 68.3%.

Finally, for comparing HatEval 2019 we rely on the macro F1-score. During the competition, the best results were achieved by [45, 56], both with an accuracy of 73%. These results were outperformed by [18] and [3] with a macro F1-score of 75.4% and 75.5%, respectively. Similar as we observe in AMI 2018, the results of our proposal outperforms slightly these results: 76.8% of macro F1-score with knowledge integration of LF and BF, and 76.5% with ensemble learning based on logistic regression.

Response

Taking into account the results provided by our methods and after comparing them with those of the state-of-the-art, we can say that our methods outperform those of the state-of-the-art.

Conclusion and further work

In this paper we have conducted a study of different datasets regarding hate-speech identification in Spanish, in order to determine which kind of individual features are most effective for hate-speech detection, how these features can be combined, if linguistic features could provide insights regarding the identification of hate-speech, and if the methods proposed here outperforms the state-of-the-art results.

As future lines of research, we plan two strategies, one related to further experimentation on the hate-speech topic and the other to an in-depth analysis of the system presented herein. On the one hand, in terms of experimentation, we will include the cross-validation strategy in our pipeline. Moreover, we will work on hate-speech related subtasks, such as determining the target and taking into account contextual features and media features, such as images or hyperlinked content. In addition, we will also try to focus on longer documents. On the other hand, the analysis strategy will be directed towards error analysis and the use of explainability tools. First, we will perform an error analysis to determine which cases are misclassified by each of the explored feature types and why, and whether the combination of them improves the classification. Finally, regarding explainability, we plan as future work to use tools like SHAP to see the contribution of each feature within the neural network. In this work, we have evaluated the reliability of using linguistic features to characterize hate-speech using model agnostic metrics, but these features are evaluated outside the neural network.