This section presents the results of the experiments conducted to answer the research questions formulated. In the following, each of them is addressed in a separate subsection.
RQ1. Which individual features are most effective for hate-speech detection?
The objective of this research question is to determine which feature set, in isolation, performs best in detecting hate-speech messages.
Table 4 shows the results obtained for the individual features of each dataset. As expected, the fine-tune versions of BETO (BF) outperform the rest of features in a great extent. Therefore, we decide to evaluate other Spanish BERT models in order to compare the performance of different embeddings based on BERT. Table 5 presents the results of different pretrained contextual embeddings from BERT. Specifically, we evaluate BERTIN, BETO, the Spanish RoBERTa trained from National Library of Spain (BNE), and multilingual BERT (M-BERT). The best results are obtained by BETO, so we select this pretrained model to combine with the rest of the features to answer RQ2 and RQ3.
Next, we focus on evaluating the network complexity. Table 6 depicts the network architecture and the hyperparameters per feature set. It can be noticed that all neural network architectures are shallow neural networks with one or two hidden layers maximum with the same number of neurons in each layer. We can also observe that larger learning rates (10e−3) behave better than smaller ones (10e−4). However, there are no clear clues as to whether the learning rate is correlated with the feature set or the dataset. Regarding the activation function, ReLu is the one that appears mostly for achieving the best results, specially in AMI 2018. Tanh appears in complex networks with greater number of neurons. Finally, we can observe that the Spanish MisoCorpus 2020 and HaterNet share the learning rate and the activation function for LF, SE, and WE. In case of BF, however, the learning rate is the same but not the activation function.
After analyzing the performance of the evaluated feature sets, we observe that the fine-tuned embeddings from BETO (BF) outperform the rest of feature sets. They achieved a significant increase of 4-5% regarding the second best feature set (WE). In relation to the reliability of LF, they achieve competitive results in all datasets except in HaterNet, in which they obtain limited results regarding the recall of the hateful category. Finally, with respect to the neural network architecture, we observe that shallow neural networks with few neurons and few hidden layers behave better than deep neural networks.
RQ2. How can features be integrated for more robust systems?
To answer this research question, we evaluate two different strategies to combine the feature sets. On the one hand, by combining them into the same neural network from a knowledge integration strategy (see “Knowledge integration strategy”) and, on the other hand, by combining the results of the best model for each feature set using ensemble learning (see Sect. “Ensemble learning”).
Knowledge integration strategy
To evaluate the knowledge integration strategy, we combine the features in the same neural network and perform the hyperparameter optimisation again. For this, we use the Keras API function to develop a multi-input neural network. Each feature set is used as input and connected to a dense layer. Then, all features are combined to produce the final prediction. Table 7 shows the results achieved by the linguistic features and the best embedding (LF, BF), all the embeddings (SE, WE, and BF), and all features (LF, SE, WE, BF).
For the Spanish MisoCorpus 2020, the best result is achieved with the combination of the embeddings (SE, WE, and BF), with an 90.52% of weighted F1-score. However, the results of the combination of LF and BF as well as the combination of all features (LF, SE, WE, BF) are very similar: 90.40% and 90.44% of weighted F1-score, respectively. In addition, it can be seen that precision and recall become more stable with the feature combinations than with the feature sets separately (see Sect. “RQ1. Which individual features are most effective for hate-speech detection?”). In relation to AMI 2018, the best result is achieved by combining all the feature sets, with an 84.72% of weighted F1-score. In this case, the other combinations also get similar results: 83.27% for LF and BF, and 83.39% for SE, WE, and BF. HaterNet also obtain the best result with the combination of all features (84.08%). Regarding the last dataset, HatEval 2019, the best overall result is achieved with the combination of LF and BF, with a weighted F1-score of 77.27%.
Next, we focus on evaluating the network complexity. Table 8 lists the network architecture and the hyperparameters per feature set. The combination of different feature sets within the same network achieves better results with simpler neural networks, except for AMI 2018. Despite the fact that the number of hidden layers is similar, the number of neurons is quite superior, except for AMI 2018. For the learning rate, we can observe that larger learning rates (10e−3) behave better regardless the feature set combination for the Spanish MisoCorpus 2020, the AMI 2018, and the HaterNet datasets. However, small learning rates (10e−4) get better results in HatEval 2019 with the combination of all the embeddings (SE, WE, and BF) and all the features (LF, SE, WE, BF). Regarding the activation function, it can be observed that the Spanish MisoCorpus-2020 and HaterNet perform better with no activation function.
For evaluating the ensemble learning strategy, we exploit the best model per feature set developed for answering RQ1 (see Sect. “RQ1. Which individual features are most effective for hate-speech detection?”). Next, we generate a new prediction based on different approaches. First, using the hard voting strategy that consists of getting the majority vote of the models in the ensemble. Second, based on the prediction output of the model with the highest probability in the output layer. Third, by computing the average of the predictions of the last layer. Forth, by training a logistic regression classifier that learns to predict hate-speech or not based on the probabilities output by each model separately. We experiment with three combination sets: (1) linguistic features and the fine-tuned embeddings from BETO (LF, BE); (2) all the embeddings (SE, WE, and BF); and (3) linguistic features and all the embeddings (LF, SE, WE, BF). Due to the large number of datasets and strategies, we split the results in the following tables: (1) Table 9 for the Spanish MisoCorpus 2020, (2) Table 10 for the AMI 2018, (3) Table 11 for HaterNet, and (4) Table 12 for the HatEval 2019.
Regarding the Spanish MisoCorpus 2020 (see Table 9), the best overall result is obtained with the logistic regression strategy and all the feature sets, reaching a weighted F1-score of 90.14%. We can observe that this strategy also gets very good results with the other feature combinations evaluated. However, it is noteworthy that the highest probability strategy obtains the best precision regarding misogyny identification, but with a considerable sacrifice of the recall. This strategy considers one text as hate-speech when any of its classifiers outputs a probability higher than the 50%. Due to larger precision achieved, we conclude that this strategy is reliable for predicting a text as misogyny. However, the low recall indicates that there are many FN. Therefore, it would be necessary tuning the threshold of this strategy in order to adjust the compromise between precision and recall. On the other hand, the ensemble based on averaging probabilities performs slightly worse than the ensemble based on logistic regression, outperforming only when combining LF and BF. Finally, the ensemble learning based on the hard voting strategy penalizes the models with LF, getting the ensemble of SE, WE, BF better results. When comparing those results to the ones achieved with the knowledge integration strategy (see Table 7), we can observe that the knowledge integration strategy improves the results of the best ensemble learning strategy (logistic regression) with all the combinations: (1) 90.40% vs 89.74 with LF and BF, (2) 90.52% vs 89.96% with SE, WE, and BF; and (3) 90.44% vs 90.14% when combining all the features. However, the network architecture is complex.
Concerning AMI 2018 (see Table 10), the best ensemble learning result is provided by the strategy based on averaging the probabilities of LF and BF, with a weighted F1-score of 83.30%. As we observed when analyzing the Spanish MisoCorpus 2020 (see Table 9), the precision of the misogyny label with the ensemble based on the highest probability is high, with 92.90% by combining all feature sets, but with a great recall loss (53.50%). In the same line with the Spanish MisoCorpus 2020, the highest probability strategy of LF and BF achieves a reliable precision (87.10%), but with a slight drop in recall (69.90%). However, on the contrary, those ensembles based on averaging probabilities get better results than the logistic regression strategy.
As for HaterNet (see Table 11), the best result corresponds to the logistic regression strategy combining the features based on embeddings (SE, WE, and BF). This model achieves a weighted F1-score of 84.34%. The same strategy, but combining all the features (LF, SE, WE, and BF) obtains slightly worse weighted F1-score (84.25%), and 83.16% when combining LF and BF. When comparing the logistic regression strategy with the average probabilities strategy, we can observe that the weighted F1-scores are similar, but there are important differences among the precision and recall values of the hateful class. These differences were not observed in the Spanish MisoCorpus 2020 (see Table 9), the AMI 2018 datasets (see Table 10) nor the HatEval datasets (see Table 12). Regarding the strategy based on the highest probability, it can be observed that the combination of all feature sets (LF, SE, WE, and BF) achieves a perfect precision of the 12.60% identified instances. As we observed in the other datasets, the highest probability strategy using LF and BF achieves high precision (87.10%) but limited recall (69.90%). With respect to the hard voting strategy, it reaches lower results regarding the hateful label, but similar precision and recall in all the ensembles with LF (67.30% precision, 65.40% recall for LF and BF, and 69.66% precision, 60.19% recall when LF is combined with the rest of features).
With respect to HatEval (see Table 12), the best overall result is reached with the logistic regression strategy of LF and BF, with a weighted F1-score of 76.66%. Regarding the ensemble learning based on the hard voting strategy, we can observe that the precision of identifying the hateful label is limited when comparing with the rest of the ensemble strategies. The highest probability strategy gets superior precision than the rest of the strategies in a similar way as we have observed for the rest of the datasets. However, we can see that the balance between precision and recall is not as high as we noticed in the Spanish MisoCorpus 2020, AMI and HaterNet. As we are able to discern which HatEval tweets focus on women and which on immigrants, we analyzed the results separately. We observed that the subset of HatEval 2019 focused on misogyny gets higher precision with the hate-speech label but limited recall, regardless of the ensemble strategy. On the other hand, the analysis of the split focused on migrants suggests the opposite, lower precision but higher recall. However, as the subset of HatEval 2019 towards women is biased with AMI 2018, a more thorough analysis of these differences would be necessary.
For the combination of the features, we evaluated two strategies, one consisting of knowledge integration and the other of ensemble learning with multiple criteria. We realized that the results obtained with knowledge integration are, in general, superior to those achieved with ensemble learning, although there is not a great difference. However, we observed a higher complexity on the neural networks that requires more neurons than those of the best models of each feature set independently.
Concerning the ensemble learning study, the highest probability strategy achieves the best precision over the misogynous and hateful classes in all the datasets. However, this comes at a cost with respect to recall. We observe this specially with the HaterNet dataset, in which we obtained a perfect precision but a recall of 12.60%. For systems in which the precision is more important than recall, we recommend to focus on the highest probability strategy but selecting fewer feature sets as we observe better F1-score. In general, we can say that the strategies that provide competitive results regardless the datasets are knowledge integration and ensemble learning based on logistic regression using LF and BF as features.
RQ3. Is it possible to characterize the language of the different hate-speech types by means of explainable linguistic features?
For addressing this research question, we obtain the mutual information per linguistic feature of the different categories. In order to observe how linguistic features from different categories contribute to the identification of hate-speech, we rank those features and we organize them in groups of 5 according to the category. Figures 2, 3, 4, and 5 represent this classification for the Spanish MisoCorpus 2020, AMI 2018, HaterNet, and HatEval 2019 datasets respectively. Note that there are some categories, such as semantics, in which there are fewer than five categories.
Regarding the Spanish MisoCorpus 2020 (see Fig. 2) we can observe that the categories related to register are the most discriminatory. Register includes the usage of strong offensive speech, swear, colloquialisms and, to a lesser extent, non-fluent speech. Correction and style is another relevant category, highlighting features related to orthographic errors and misspelled words. Concerning morphosyntax, those features related to grammatical feminine words, nouns, prepositions, and suffixes are strong discriminatory features. Stylometry and social media are also effective regarding misogyny identification, including mentions or replies than include female name accounts. The usage of hashtags and social media jargon also stand out. As for stylometry, we can observe that features related to readability, quotations, and the length of the text can contribute in some extent to the identification of misogynistic messages. With respect to pragmatics, discourse markers used to argue, structure, or add information are relevant features. It also appears as relevant the usage of similes in figurative language and the usage of courtesy forms to introduce oneself in the conversation. The usage of negations appears both in misogynous and non misogynous documents, being the most relevant ni...ni, no...no, nunca...nadie, sin...ni, and casi nadie.
Next, we analyze the correlation between the linguistic features with the AMI 2018 dataset (see Fig. 3). The analysis reveals that linguistic features within register category and, specifically, related to offensive speech are the strongest discriminatory features. AMI and the Spanish MisoCorpus 2020, which both focus on misogyny identification, share this feature. Correction and style, however, is less relevant in AMI 2018 than in Spanish MisoCorpus 2020, as in AMI 2018 there are differences in the number of misspelled words among misogynous and non misogynous classes but we observe no differences in orthographic errors. Regarding morphosyntax, we can observe that verbs in subjunctive simple or in singular are discriminatory features as well as words in masculine and nominal suffixes. Lexical is another relevant feature but also differs from Spanish MisoCorpus 2020. In AMI 2018, the most relevant topics are related to animals, female and male persons and groups, and topics related to sex and risk. In the Spanish MisoCorpus 2020, however, the relevant topics are related with locations, organizations and also with analytic thinking and tentativeness. Only sex topics appear in both datasets as one of the most discriminating feature. This finding suggests that the context in which the tweets were collected can have a relevant role. In AMI, the documents were compiled using the following strategies: (1) using offensive representative words, (2) observing accounts from potential victims, and from (3) people who explicitly declared their hate against women. The Spanish MisoCorpus-2020 shares two of the three strategies mentioned, not taking into account misogynist accounts but taking more attention to certain events, like the arrival of Greta Thunberg in Madrid at the UN Climate Change Conference, or a case of rape of a minor that occurred in Spain related to a local soccer team. Focusing on those events may force the relevance of lexical features related to locations and organizations. It is surprising, however, that in the Spanish MisoCorpus-2020, animals is not a relevant feature for misogyny identification (in Spanish, the name of some female animals are usually misogynistic insults). In the same line, the usage of male and female groups of persons are relevant features in AMI but not in the MisoCorpus. This fact suggests that those terms can appear in misogynous and non-misogynous texts, so their are not good indicators of misogynous content. Another relevant difference of AMI 2018 with the Spanish MisoCorpus 2020 is social media, that have major impact in the Spanish MisoCorpus 2020. With respect to pragmatics, mutual information in AMI 2018 suggests that figurative language plays an important role to discern among misogynist messages from the usage of metaphors and understatements.
Similar to the Spanish MisoCorpus 2020 and AMI, we can observe from HaterNet (see Fig. 4) that offensive speech (register) is the most discriminatory feature. However, in HaterNet the presence of swear, SMS language and cultism appear as relevant features. With regard to the rest of categories all features behave similar. We note common performance errors for the correction and style category, and topics related to clothes and body for the lexical category. Intransitive verbs, as well as verbs in indicative (simple or compound), impersonal pronouns, and articles are also relevant features within the morphosyntactic category. Regarding pragmatics, the discourse markers related to reformulate and argument, as well as figurative language related to similes and idioms, are relevant for hate-speech detection. In addition, we can observe that social media jargon and the usage of hyperlinks can be useful for this dataset.
Concerning HatEval 2019 (see Fig. 5), we can confirm the importance of offensive language (register), as this feature shows similar behavior in all datasets. It is worth mentioning that this analysis is biased because some of the documents from the misogyny split in HatEval 2019 also appear in AMI 2018. Consequently, we rank the linguistic features with information gain, but only with the subset of the HatEval 2019 dataset regarding to immigrants (not shown). The mutual information on the immigrants split also indicates that strong offensive speech is a relevant feature, but also the usage of colloquialisms and softer offensive language. Regarding lexical, the linguistic features are similar to the ones that appear for the AMI 2018 dataset including sex, common names referring to women or groups of women, inclusive language, and exploration. When we remove from our analysis the documents towards women and we focus only on hate-speech towards immigrants, we observe topics concerning sex, home, friendship, perceptual processes, and discrepancies. Some negation cues also appears as discriminatory features, including nada más, jamás, casi nadie, ni...ni, tampoco...tan. In fact, HatEval 2019 is the only dataset of those studied in which the total of negations appears as discriminatory feature, although with little impact regarding the identification of hate-speech. In terms of morphosyntax, we can observe that suffixes, including adjectivizers and nominals, as well as prepositions and copulative verbs, are discriminatory features. With respect to pragmatics, we can observe connectors, reformers, and words and expressions used to order the clauses. This fact suggests that a reflection, discussion and/or debate occurs when speaking towards women or immigrants within a conversation in hate-speech context. Regarding psycho-linguistic processes, those related with negative sentiments and especially anger are discriminatory features. We can also observe the presence of negative emojis. Concerning social media usage, we observe that reply to females is a discriminatory feature. We analyzed if this fact appears also in the subsets of the HatEval 2019 but we noticed that this feature is specially relevant to the misogyny subset. In fact, replying to males is also a relevant feature in this context.
The analysis of the interpretability of the features leads to the following findings:
We observe common traits in all datasets regarding the register category, such as the features related to hard offensive speech. Moreover, swear and colloquialisms also appear as discriminatory features, but in different degree.
From the misogyny identification, we note that the percentage of misspelled words is relevant for the Spanish MisoCorpus 2020 and AMI 2018 dataset. This finding does not appear in HaterNet, and is less so in HatEval 2019, which largely shares documents with AMI 2018.
Pragmatics and, specifically, discourse markers, appear frequently as discriminatory features. We observe that these features are more frequent in hate-speech or non hate-speech classes. We notice that argumentative markers are more common on non misogynous texts in the Spanish MisoCorpus 2020, but more common in misogynous texts in AMI 2018. Connectors used before to state a consequence are more common in non hateful documents as well as discourse markers used for structuring the text.
Linguistic features concerning the usage of social media show different behavior in the two corpus related to misogyny. The usage of mentions, hyperlinks, hashtags, and the usage of specific jargon appear as relevant features in the Spanish MisoCorpus 2020. However, social media features are no relevant in AMI 2018. HaterNet and HatEval indicate an intermediate value but not relevant.
Topics are not shared among the datasets focused on misogyny. We observe a strong presence of topics related to locations, organizations, and analytic thinking on the Spanish MisoCorpus 2020 whereas in AMI the topics are more related to animals (as the names of female animals species are common insults in Spanish), male and female social groups, and risk.
The usage of negations are not discriminatory features for hate-speech identification. We conduct a deep analysis of a total of 121 negations including simple, continuous, and discontinuous cues. However, the only dataset in which these features appears to be relevant is the HatEval 2019, with more statements with negations in the hateful class.
RQ4. Do our methods improve the results of the state-of-the-art?
To address this research question we compare our results with the best state-of-the-art results obtained for each particular dataset. Specifically, we compare our two best strategies, consisting of knowledge integration and ensemble learning based on logistic regression of LF and BF, with the best approaches of the state-of-the-art. These models are selected because achieved competitive results regardless the dataset. It is worth mentioning the limitations of this comparison. First, when comparing with HaterNet and the Spanish MisoCorpus 2020, the results were evaluated using ten-fold cross validation but in our approach we use the test set. Second, the results described in  regarding HaterNet use a training-test split, which is not the same split than ours nor the one used during the original experiment by the authors, since they did not release the splits. Third, not all shared-tasks and research focus on the same metrics, as those focusing on misogyny use accuracy, HaterNet compares with the F1-score of the hateful label, and HatEval 2019 with the Macro F1-score. Accordingly, we have include in Table 13 all the metrics and all the available results.
When comparing the results for the Spanish MisoCorpus 2020, we can observe that our proposal, grounded on the usage of linguistic features and transformers, outperforms the accuracy achieved in , from 85.2% to 90.4% with knowledge integration and to 89.7% with the ensemble learning based on logistic regression. It should be noted that, to the best of our knowledge, this dataset has not been evaluated in other research works, so the conclusions are limited.
Regarding AMI 2018, the best result obtained during the shared-task was an accuracy of 81.4681% by , outperformed slightly in  with an 81.5217%. These results were achieved using Support Vector Machines and similar strategies for the features. The results reported by our systems outperform both results, but not significantly. Our proposal based on knowledge integration gets an accuracy of 83.3% and the ensemble learning based on logistic regression an 82.5%. Although our results are the best we are aware of, we consider that the novelty of the models employed based on transformers should have improved the state-of-the-art results even more.
Regarding HaterNet, we focus on F1-score of the hateful class. In the original experiment with this dataset , the authors achieved a F1-score for the hateful label of 61.1%. This result was outperformed by  with their proposal based in BETO, with a 65.8%. Our proposal based on linguistic features with a knowledge integration strategy outperforms slightly these results, achieving 65.9% of F1-score for the hateful label, but the results are superior applying the ensemble learning based on logistic regression, with a 68.3%.
Finally, for comparing HatEval 2019 we rely on the macro F1-score. During the competition, the best results were achieved by [45, 56], both with an accuracy of 73%. These results were outperformed by  and  with a macro F1-score of 75.4% and 75.5%, respectively. Similar as we observe in AMI 2018, the results of our proposal outperforms slightly these results: 76.8% of macro F1-score with knowledge integration of LF and BF, and 76.5% with ensemble learning based on logistic regression.
Taking into account the results provided by our methods and after comparing them with those of the state-of-the-art, we can say that our methods outperform those of the state-of-the-art.