Introduction

With the rise of social network popularity, hate speech phenomena have significantly increased [22]. Hate speech not only harms both minority groups and the whole society, but it can lead to actual crimes [3]. Thus, (automated) hate speech detection mechanisms are urgently needed. However, falsely accusing people of hate speech is also a problem. Many content providers rely on human moderators to reliably decide whether a given text is offensive or not, but this is a mundane and stressful job which can even cause post-traumatic stress disordersFootnote 1. There have been many attempts to automate the detection of hate speech in social media using machine learning, but existing models lack the quantification of reliability for their decisions.

In the last few years, recurrent neural networks (RNNs) were the most popular text classification choice. Long Short-Term Memory (LSTM) networks, the most successful RNN architecture, were already successfully adapted for the assessment of predictive reliability in hate speech classification [7]. Recently, neural network architecture with attention layers, called ‘transformer architecture’ [6], have showed even better performance on almost all language processing tasks. Using transformer networks for masked language modeling produced breakthrough pretrained models, such as BERT (Bidirectional Encoder Representations from Transformers) [43]. The attention mechanism, which is a crucial part of transformer networks, became an essential part of natural language understanding with a significant impact on language applications. We aim to investigate the behavior of the attention mechanism concerning the reliability of predictions. We focus on the hate speech recognition task.

In hate speech detection, reliable predictions are needed to remove harmful content and possibly ban malicious users without harming the freedom of speech [7]. Standard neural networks are inadequate for the assessment of predictive uncertainty, and the best solution is to use the Bayesian inference framework. However, classical Bayesian inference techniques do not scale well in neural networks with high dimensional parameter space [8]. Various methods were proposed in order to overcome this problem [9]. One of the most efficient methods is called Monte Carlo Dropout (MCD) [12]. Its idea is to use dropout in neural networks as a regularization technique [13] and interpret it as a Bayesian optimization approach that takes samples from the approximate posterior distribution.

Several authors have shown that emotional information [56] extracted from a text can improve the performance of lexical approaches and standard machine learning algorithms [1, 2, 21, 28]. The role and utility of emotional information in deep learning have not yet been established; besides, we still have only limited understanding of the emotions in the text. A series of computational models that bridge the gap between the human emotional perspective evolved in a domain known as ’Sentic Computing’ [54]. The computational initiative, named ’SenticNet’, combines knowledge from psycholinguists, neuroscientists, and computer scientists to better understand emotions in text. We used information on affective dimensions provided by SenticNet, together with the outputs of the state-of-the-art contextual language model BERT [43]. This was enhanced with a reliability estimation mechanism based on MCD as input for a hate speech classifier. Concerning emotions, we follow two goals in this work: i) to test the predictive performance of emotion-enhanced BERT models in hate speech detection, and ii) to better understand the role of emotions in hate speech.

Our main contributions are:

  1. 1.

    We present a novel methodology for the assessment of prediction uncertainty in attention networks and in BERT models.

  2. 2.

    Empirical analysis of the proposed Bayesian attention networks (BANs) and MCD enhanced BERT models show an improved calibration and prediction performance on hate speech detection tasks in several languages.

  3. 3.

    We combine contextual and reliability information obtained from MCD BERT with sentiment-related knowledge provided by SenticNet.

  4. 4.

    We demonstrate novel visualization of prediction uncertainty for individual instances, as well as for groups of instances.

The paper consists of six more sections. In Section 2, we present related works on prediction uncertainty, hate speech detection and its relationship with sentiment analysis. In Section 3, we propose the methodology for uncertainty assessment in transformer networks using attention layers and MCD, while in Section 4, we analyze the calibration of predictions. Section 5 presents the datasets and the evaluation scenario. The obtained results are presented in Section 6, followed by conclusions and ideas for further work in Section 7.

Related Work

We present the related work categorized into four areas. In Section 2.1, we introduce work done on hate speech detection, followed by the related research on transformer architecture for text classification in Section 2.2. In Section 2.3, we describe existing approaches for the assessment of uncertainty in text classification. Finally, in Section 2.4, we relate hate speech detection with the particularities of sentic computing.

Hate Speech Detection

Analyzing sentiments and extracting emotions from texts are very useful natural language processing (NLP) applications. With the rise of social media popularity, the hate speech detection became highly needed. Hate speech is defined as written or oral communication that abuses or threatens a specific group or target [15].

Detecting abusive language for less-resourced languages is complex, and has inspired research in multilingual and cross-lingual methods [16]. These methods are especially useful when the involved languages are morphologically or geographically close [18]. In our work, we investigate hate speech detection methods for English, Croatian, and Slovene languages. The English language is well-resourced and researched [19, 22, 24]. Recently, hate speech detection studies appeared for Croatian [25, 27, 29] and Slovene [31, 33, 34].

The hate speech detection is mostly treated as a binary text classification problem. In the past, the most frequently used classifier was the Support Vector Machines (SVM) method [37]. However, deep neural networks are now a dominant technique, first through RNNs [38], and recently using the pre-trained transformer networks [39, 40]. In this work, we analyzed the state-of-the-art pre-trained transformer networks, called (multilingual) BERT model.

Attention Networks for Text Classification

Attention mechanism is a key component of transformer architecture, proposed by [6]. Due to its power and suitability for parallelization, this architecture soon replaced LSTM networks for many NLP tasks. Recently, large pre-trained transformer models have been investigated in the context of text classification tasks. For example, [11] trained both multiplicative LSTM (mLSTM) and transformer language models on a large 40GB text dataset [42] and transferred those models to binary and multi-class text classification problems. They concluded that the transformer model outperforms the mLSTM model, especially when fine-tuned for multidimensional emotions classification.

The BERT model [43] uses the transformer architecture and large text corpora to learn masked language model and sequence of sentences tasks. BERT and its follow-ups are able to learn and extract many language characteristics (both syntactic and semantic) and excel for many text classification tasks. Despite the short time since its conception, BERT has already attracted enormous attention from the NLP community. Hundreds of research groups extensively research it; see a recent overview by [23]. Practical guidelines on how to fine-tune the BERT model for text classification were compiled by [41].

A multilingual hierarchical attention mechanism for document classification was investigated by several authors [44,45,46]. However, different attention layers of large pre-trained models were not tested separately or in the context of prediction reliability. Also, to the best of our knowledge, the predictive reliability of BERT outputs has not been investigated, yet.

Prediction Uncertainty for Text Classification

While recent works on classification reliability mostly investigate deep neural networks, many other probabilistic classifiers were analyzed in the past [10]. For example, [30] explores the probabilistic properties of SVM predictions.

Prediction uncertainty is an important issue for black-box models like neural networks, as they do not provide interpretability or reliability information about their predictions. Most reliability scores for deep neural networks are based on a Bayesian framework. The most popular exception is the work of [26], who proposed using deep ensembles to estimate the prediction uncertainty.

An efficient approach to reliability assessment in neural networks is to mimic the Bayesian inference using MCD [12]. The dropout technique was first introduced to RNNs in 2013 [47], but further research revealed a negative impact of dropout in RNNs [4]. Later, dropout was successfully applied to language modeling by [69], who applied it only to fully connected layers. [5] implemented the variational inference based dropout, which can regularize also recurrent layers. Additionally, they provide a solution for dropout within word embeddings. The method mimics Bayesian inference by combining probabilistic parameter interpretation and deep RNNs. The authors introduce the idea of augmenting probabilistic RNN models with the prediction uncertainty estimation. Several other works investigate how to estimate prediction uncertainty using RNNs [48], e.g., Bayes by Backpropagation (BBB) [32].

Recently, a fast and scalable method called ‘SWAG’ was proposed by [36]. The main idea of this method is to randomize the learning rate and interpret it as a sampling from the Gaussian distribution. SWAG fits the Gaussian distribution by capturing the Stochastic Weight Averaging (SWA) mean and co-variance matrix, representing the first two moments of stochastic gradient descent iterations. Different to SWAG, we use the Gaussian distribution as a posterior over neural network weights, and then perform a Bayesian model averaging for uncertainty estimation and calibration.

MCD was recently used within several models and different architectures to obtain the prediction uncertainty and improve the classification results [49,50,51]. Transformer networks were not yet analyzed.

Sentic Computing

Sentiments and emotions play an essential role in hate and offensive speech, and have been used successfully in their automatic detection. [28] have used eight basic emotions from Plutchik’s model [52], the positive and negative sentiment polarities, indicator of a presence of a word in the Hatebase lexiconFootnote 2, and the intensity of anger emotion. Their combination of the lexicon-based and machine learning approach successfully predicted hate speech and showed a high utility of emotional features. Alorainy et al., 2018 [1] used the emotional analysis on Twitter suspended accounts and discovered that they contain more disgust, negative sentiment, fear, and sadness than active accounts. Using this information for hate speech detection, their machine learning models showed improved performance. [21] also used the eight basic emotions in their emotional analysis and showed that emotions could improve Facebook posts’ clustering. Finally, [2] used several different groups of features (linguistic, sentiment, and Twitter-specific features such as hashtags and profanity lexicon) to predict hate speech. Interestingly, their results show that Twitter-specific features are the most successful, and the additional sentiment features do not improve predictive performance. All the methods mentioned above use either classical machine learning approaches such as SVM, Naive Bayes, logistic regression, and random forest, or RNNs, such as LSTMs.

To advance approaches based on lexical keywords and frequency statistics, [54] proposed a framework for emotional computing called ‘SenticNet’ that captures semantics and latent emotional information by relying on the implicit meaning associated with commonsense concepts. The original emotion categorization model called Hourglass of Emotions [53] was supported by the SenticNet 4 framework [54], while its newer revised version [55] is used in SenticNet 6 framework [17]. These models are biologically-inspired and psychologically-motivated. Each of the two models is based on four independent but concomitant affective dimensions, which can be combined to build more complex emotions. Based on this, SenticNet framework can describe and explain emotional experiences by disassembling text to the ground sentiments.

The SenticNet framework has been successfully used in sentiment classification problems. Sentic LSTM [35] integrates the explicit emotional information with the LSTM networks by adding a recurrent additive network that simulates sentic patterns. A recent SenticNet 6 framework [17] combines top-down and bottom-up knowledge representation. From top-up direction it encodes meaning using symbolic models (logic and semantic networks); in bottom-up direction, it learns syntactic patterns from data, using subsymbolic methods (biLSTM and BERT). Authors report state-of-the-art results for sentiment analyses.

In our work, we use the transformer architecture that can extract highly relevant information from texts. Concerning emotions, the question we investigate is whether adding emotional information to the distribution of predictions can improve the performance of hate speech detection. This question is particularly relevant for the current state-of the-art BERT model [43], which is known to capture a plethora of language information, such as part-of-speech tags, dependency structure, and sentiment.

Bayesian Attention Networks

The BERT model [43] is the transformer network that has achieved state-of-the-art results in many NLP tasks, including text classification [58,59,60]. In this work, we introduce Monte Carlo Dropout to transformer networks and BERT to construct their Bayesian variants. Analysis of different amounts of dropout, different variants of BERT modifications, and their hyper-parameters would require pretraining and fine-tuning several different BERT models, which would require substantial computational resources. For example, pretraining a single BERT model on four Tensor Processing Units (TPUs) requires more than a month of computational time. Thus, in this work, we explore two reliability extensions, i) the reliability on the encoder part of the BERT architecture trained from scratch (without pretraining) on the task of interest (in this work, we refer to these models as the attention networks), and ii) reliability of pre-trained BERT models, using only fine-tuning. We believe this is a reasonable setting which sheds light on an important reliability aspect of transformer networks.

In Section 3.1, we first formally define the attention network architecture, and in Section 3.2, we make it Bayesian by introducing MCD. Finally, in Section 3.3, we describe how the MCD principle can be employed in already pre-trained BERT models.

Attention Networks

The basic architecture of the attention network follows the architecture of transformer networks [6] and is shown in Fig. 1.

Fig. 1
figure 1

A scheme of Attention Networks. The dropout is introduced in the blue colored layers

The proposed architecture is similar to the encoder part of the transformer architecture. The difference is in the output part, where a single output head was added to perform binary classification using the sigmoid activation function. The main difference to BERT, which also uses just the encoder part of transformer network, is that we do not use any pretraining. The second difference is that attention network uses the classification head and BERT has the language model head. In both cases, the output is composed of feed-forward layers followed by the nonlinearity but with different dimensions in each case. By not relying on the pretraining, we are much more flexible concerning the number of layers and number of neurons in each layer. For our tasks, we use orders of magnitude fewer parameters, e.g., we used a maximum of 3 million parameters (at the expense of loosing information from pretraining). The architecture can contain many attention heads, where a single attention head is computed as:

$$\begin{aligned} o_h = \text {softmax}(\frac{\varvec{Q} \cdot \varvec{K}^T}{\sqrt{d_k}})\cdot \varvec{V}, \end{aligned}$$

The attention matrices are commonly known as the query \(\varvec{Q}\), the key \(\varvec{K}\), and the value matrix \(\varvec{V}\). The normalizing factor, \(d_k\), denotes the dimensionality of keys. The attention function can be described as mapping the query and the set of key-value pairs to the output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values. The weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Intuitively, the multiplication of query and key vectors with subsequent values can be understood as the extraction of relations. The softmax activation enables each pair of considered input tokens to be represented with a single real value. It effectively introduces sparseness into the weight space – only certain token pairs emerge with high weights and are relevant for the remaining part of the considered neural network architecture. In practice, multiple such heads can be concatenated and fed into the succeeding feed-forward layer. The application of softmax has been shown to emphasize only particular parts of the parameter space, thereby making the neural network more focused.

The positional encoding, as discussed in [6], represents a matrix that encodes individual positions in a matrix of the same dimensionality as the one holding the information on sequences (input embedding). The positional encoding was introduced to account for word order. Here, relative distances between different tokens are taken into account by incorporating the position-related signal into a given token representation.

While there are, in principle, many different ways of how attention networks can be extended with the Bayesian approach, we propose to use the well-established MCD.

Monte Carlo Dropout for Attention Networks

In our proposal, called Bayesian Attention Networks (BAN), we use MCD within attention networks but contrary to the original dropout setting, the dropout layers are active also during the prediction phase. In this way, the predictions are not deterministic but are sampled from the learned distribution, thereby forming an ensemble of predictions. The obtained distribution can be, for example, inspected for higher moment properties and can offer additional information on the uncertainty of a given prediction. During the prediction phase, the dropout layers are activated again and the output of a proportion of randomly selected neurons in those layers is set to zero. A forward pass on such partially activated architecture is repeated for a fixed number of samples, every time dropping different randomly selected neurons. The results of different passes can be combined to obtain the final prediction, or further inspected as a probability distribution.

Monte Carlo Dropout for BERT

MCD was used in the BERT model in the same way as in BAN. MCD can provide multiple predictions of a neural network during the test time, as long as the dropout was used during the training phase [61]. Training of neural networks with the dropout distributes the captured information across the network. During the prediction, such a trained neural network is robust. Using the dropout principle, a new prediction is possible in each forward pass. A sufficiently large set of such predictions can be used to estimate the prediction reliability. The BERT model is trained with 10% of dropout in all of the layers by default, and thus allows for multiple predictions using the described principle. We call this model ‘MCD BERT’. A limitation of this approach is that a single dropout rate of 10% is used during training, while other dropout probabilities might be more suitable for reliability estimation. We leave this analysis for further work.

Calibration of Probabilistic Classifiers

The quality of reliability scores returned by probabilistic classifiers (such as BAN and MCD BERT) is assessed with calibration measures. A classifier is calibrated if its output scores are close to actual probabilities in a sense that a class predicted with the score p is correct with the actual probability p, i.e. in \(p \cdot 100\) percent of cases. Without special calibration approaches, most neural networks are overconfident and overestimate their probabilities. The calibration of a model can be visualized using a calibration plot where the model’s prediction accuracy (true probabilities) is plotted against the predicted probabilities (i.e. outputs scores). The perfect calibration manifests itself as a diagonal in the calibration plot (see an example of a calibration plot in Fig. 6).

Since classifiers are typically not perfectly calibrated, we investigated different methods to improve the calibration of used neural networks. We compared several existing calibration methods with a novel approach that combines existing techniques with a method for threshold adaptation. In Section 4.1, we describe the existing calibration methods, followed by the proposed threshold adaptation in Section 4.2.

Existing Calibration Methods

We first formally describe how to obtain calibrated predictions from the reliability scores. Let (XY) be the input space, where X represents the set of predictive variables, and Y is the binary class variable (either 0 or 1). Let f be the predictor (e.g., neural network) with \(f(X)= (\hat{Y},\hat{P})\), where \(\hat{Y}\) is the binary class prediction, and \(\hat{P}\) is its associated confidence score or probability score of correct prediction. The calibration of the model f is expressed as:

$$\begin{aligned} P(\hat{Y}=Y|\hat{P}=\hat{p})=p, \end{aligned}$$
(1)

where \(\hat{p}\) is the prediction score from [0, 1] interval, obtained from the predictor f. We interpret this score as the probability of a specific outcome, assigned by the model f. Probability p is the model’s confidence or true probability that model f predicts correctly. If a model predicts a certain outcome with a high probability, it is desirable that the confidence of this prediction being correct is also high. In the ideal case of perfect calibration \(\hat{p} = p\).

Based on Equation (1), there are two ways to reduce the calibration error: either to obtain calibrated predictions \(\hat{p}\) or to manipulate the prediction threshold in such a way that the predicted outcome \(\hat{Y}\) is better calibrated. To assess the quality of the produced reliability scores, we compare them to results of two calibration methods, Platt’s method and Isotonic regression.

Platt’s method [30] learns two scalar parameters \(a, b \in \mathbb {R}\) in such way that the prediction \(\hat{q} = \sigma (a \hat{p} + b)\) presents a calibrated probability of predicted score \(\hat{p}\), and \(\sigma\) is the sigmoid function. To find good values of a and b, typically a separate calibration dataset is used. The isotonic regression is a non-parametric form of regression in which we assume that the function is chosen from the class of all isotonic (i.e., non-decreasing) functions [62]. Given the predictions from our classifier \(\hat{p}\), and the true target y, the calibrated prediction returned by the isotonic regression is:

$$\hat{q} = m(\hat{p}) + \epsilon$$

where m is a non-decreasing function.

Adaptive Threshold

We explored the adaptive threshold (AT), which we apply to classification with BANs. During learning, after each weight update phase, we assess the performance of BAN. For each instance in the validation set, we do multiple forward passes with unfrozen dropout layers and store the average of the returned scores as the probability estimate. Once the probability estimates for the validation set are collected, we test several decision thresholds and determine the predictions of each instances. The best-performing threshold w.r.t. a given performance metric (in our case the classification accuracy), is stored together with its performance and weights of the neural network. The obtained performance estimate can also be used for early stopping in the learning phase. When we apply the model to new instances, we use the best threshold from the training phase (instead of the default value of 0.5). The purpose of AT is to automatically find the threshold with the best performance. To summarize, we employ the following procedure:

  1. 1.

    During the training and after each weight update, we generate the probability distribution with MCD. The mean of the distribution is considered the probability score of a given instance being assigned to the positive class.

  2. 2.

    Using the validation set, we test a range of possible thresholds that determine the instances’ labels. We tested the threshold range between 0.1 and 0.9 in increments of 0.001.

  3. 3.

    If the accuracy obtained by the default threshold (0.5) was improved by any other threshold, we stored both the current parameter set and the threshold value used to obtain the improved performance on the validation set.

  4. 4.

    The weights of the best performing model and the matching threshold are returned as the final prediction model.

Evaluation Settings

In this section, we present the evaluation settings, and in Section 6, we report the results. Starting with Section 5.1, we describe the used hate speech datasets, followed by the affective dimensions of the Hourglass of Emotions method in Section 5.2. The implementation details of the used prediction models are presented in Section 5.3. In Section 5.4, we present the evaluation measures for the predictive performance, and in Section 5.5, the measures used in the evaluation of calibration.

Hate Speech Datasets

To test the proposed methodology in the multilingual context, we used hate speech datasets in three languages, English, Croatian, and Slovene. The summary of datasets is available in Table 1.

  1. 1.

    The English datasetFootnote 3 is extracted from hate speech and offensive language detection study of [22]. The subset of data we used consists of 5,000 tweets. We took 1,430 tweets labeled as the hate speech and randomly sampled 3,670 tweets from the remaining 23,353 tweets.

  2. 2.

    The Croatian dataset was provided by the Styria media company within the EMBEDDIA projectFootnote 4. The texts consists of user comments on the news portal Večernji listFootnote 5. The original dataset consists of 9,646,634 comments from which we selected 8,422 comments. 50% of instances were labeled as the hate speech by human moderators, and the other half was chosen randomly from non-problematic comments.

  3. 3.

    The Slovene dataset was produced in the Slovenian national project FRENKFootnote 6. The text dataset used in the experiment is a combination of two different studies of Facebook comments [33]. The first group of comments was collected on LGBT homophobia topics, while the second on anti-migrants posts. In our final dataset, we used all of the 2,182 hate speech comments, and the same number of non-hate speech comments were randomly sampled.

Table 1 Characteristics of the datasets used in the experiments

The Hourglass of Emotions Affective Dimensions

To test if emotional information extracted from text can complement the information extracted by BERT models, we used the English tweets dataset and affective dimensions obtained with two versions of the Hourglass of Emotions model; the affective dimensions of the original model can be extracted using the SenticNet 4 framework [53], and the affective dimensions of its revision [55] are available in the SenticNet 6 framework.

SenticNet 4

We used the SenticPhrase interface to obtain the original Hourglass of Emotions affective dimensions from the SenticNet 4 framework [14]. For each sentence, we extracted four affective dimensions (pleasantness, attention, sensitivity, and aptitude). Within SenticNet 4, verb and noun concepts are linked to primitives, and in this way, most concept inflections can be captured by the knowledge base verb concepts. The implementation is freely accessible via Python API (Application Programming Interface) in the Python sentic packageFootnote 7.

To gain a better understanding of the four affective dimensions, [57] presented the following example:

  1. 1.

    The user is happy with the service provided (pleasantness).

  2. 2.

    The user is interested in the information supplied (attention).

  3. 3.

    The user is comfortable with the interface (sensitivity).

  4. 4.

    The user is disposed to use the application (aptitude).

The hate speech texts usually express unhappiness with the current situation and unwillingness to hear or consider different opinions. Hence, the nature of the hate speech is opposite to the nature of pleasantness and aptitude, while it can be correlated with the attention.

The distributions of the affective dimensions for English tweets, separately for non-hate speech and hate speech instances, are shown in Fig. 2. While the distributions are different among the variables, the differences between the hate speech and non-hate speech distributions are not pronounced. This indicates that these variable are not strong indicators of hate speech if used independently, but might still be useful in combination with textual features extracted by neural networks.

Fig. 2
figure 2

Distributions of the four affective dimensions from the original Hourglass of Emotions model, obtained from the SenticNet 4 framework for the dataset of English tweets. Left-hand side shows non-hate speech tweets and right-hand side shows hate speech tweets

SenticNet 6

The revisited Hourglass of Emotions model [55] is based on empirical evidence obtained in the context of sentiment analysis. Each of the four proposed baseline affective dimensions gives positive and negative perspective of one emotion:

  1. 1.

    Introspection - the joy versus sadness;

  2. 2.

    Temper - the calmness versus anger;

  3. 3.

    Attitude - the pleasantness versus disgust, and

  4. 4.

    Sensitivity - the eagerness versus fear.

The dataset of affective dimensions was obtained using the senticnet Python libraryFootnote 8. We used the publicly available word level API to obtain the affective dimension values for each token separately. We averaged the affective dimension and polarity values on the level of each tweet/comment.

We show the distributions of these new dimensions for English tweets in Fig. 3. Similarly to SenticNet 4 framework, the distributions between the hate speech and non-hate speech tweets are similar.

Fig. 3
figure 3

Distributions of the four affective dimensions from the revisited Hourglass of Emotions model, obtained from the SenticNet 6 framework for the dataset of English tweets. Left-hand side shows non-hate speech tweets and right-hand side shows hate speech tweets

Implementation of Prediction Models

We used three types of neural network architectures. As a baseline, we used MCD LSTM networks [7], which include reliability information obtained with MCD. We compared that model with newly proposed BAN and MCD BERT. As shown in the right-most column of Table 1, the input to MCD LSTM are pre-trained word embeddings: sentence encoder for English [20], and fastText embeddingsFootnote 9 for Slovene and Croatian. For the implementation of BAN, we used the Keras tokenizerFootnote 10, and for MCD BERT, we used the BERT’s tokenizer.

We implemented the proposed BANsFootnote 11 and MCD BERTFootnote 12 with the PyTorch library. The main hyper-parameters of the BAN architecture are the number of attention heads and the number of attention layers. The adaptive classification threshold (described in Section 4.2) is computed every time we evaluate the performance on the validation set. When a network makes a prediction, we deactivate all layers except the dropout layers. In this way, we maintain the variance of predictions. Each final prediction consists of a set of results obtained by several forward passes.

Other parameters are set as follows. We use the Adamax optimizer [63], a variant of Adam based on infinity norm, and binary cross-entropy loss function. To automatically stop training, we use the stopping step of 10 – if after 10 optimization steps the performance on the validation set is not improved, the training stops.

We explored the following hyperparameter tuning space: the validation percentage (size of the validation set) was varied between 5% and 10%. The rationale for testing different validation set sizes are relatively small datasets, therefore it is difficult to strike a good balance between the training and validation set. Given enough data, the validation set shall be on the upper margin. The number of epochs was either 30 or 100, the number of hidden layers and attention heads was 1 or 2. The maximum padding of the input sequences was either 48, 32, or 64. The learning rate was either 0.001 or 0.0005, and AT was either enabled or disabled.

MCD LSTM networks consist of an embedding layer, LSTM layer, and a fully connected layer within the word2vec [64] and ELMo [65] embeddings. To obtain the best architectures for the LSTM and MCD LSTM models, we tested different number of units, batch sizes, dropout rates, etc.

For BERT, we used the BERT base model in English and the multilingual BERT variant for Croatian and Slovene. We used the HuggingFace implementationFootnote 13. To combine the information from the MCD BERT and SenticNet, we generated 1000 MCD BERT predictions for each instance. We merged them with the four Sentic variables, described in Section 5.2, thus obtaining 1004 variables. This data was passed as an input to the SVM model. The process used 5-fold cross-validation.

Prediction Performance Evaluation Measures

Depending on the purpose of the prediction model, we might optimize different evaluation measures, such as classification accuracy, precision, recall, or \(F_1\) score. In the hate speech detection, we want to avoid false accusations of hate speech. For that aim, we maximize precision on the validation set during training. As this could negatively affect other measure, we alter the decision threshold to achieve good precision vs. accuracy balance. In Fig. 4, we present the accuracy-precision trade-off.

Fig. 4
figure 4

Trade-off between precision and accuracy across various hyper-parameters settings of BAN model. Each curve shows one set of hyper-parameters, each color depicts one decision threshold (0, 0.25, 0.5, 0.75, or 1.0). The hyper-parameters contain the number of heads, max padding, number of layers, number of epochs, and validation set ratio

Calibration Quality Measures

To measure the quality of computed calibration scores, we use the expected calibration error (ECE) [66]. To compute ECE, we split all n predictions into M equally spaced bins \(B_1, B_2, \ldots , B_M\), that contain instances with prediction scores in the given bin. We sum the weighted differences between actual prediction accuracies and predicted scores over all the bins and normalizes the result with the number of instances n.

$$\begin{aligned} ECE= \sum _{m=1}^{M}\frac{|B_m|}{n}|\text {accuracy}(B_m) - \text {score}(B_m)| \end{aligned}$$
(2)

This measure produces lower scores for better calibrated models (lower calibration error).

Results

In this section, we present results of five sets of experiments. In Section 6.1, we report calibration of different prediction models, and in Section 6.2, their prediction performance. The comparison between the reliability of BERT and MCD BERT is presented in Section 6.3, while the impact of sentic features is discussed in Section 6.4. Finally, we present different visualizations of models’ uncertainty in Section 6.5.

Calibration of BAN and BERT

Figure 5 shows how calibration of prediction scores changes during the training of BAN. The red line represents the performance of the fully trained network. It is apparent that an additional calibration is necessary – as the perfect calibration corresponds to the dotted line. Surprisingly, some of the training iterations show better calibrated scores. This is the motivation for AT, presented in Section 4.2.

Fig. 5
figure 5

Calibration plot for the BAN English model after each epoch (green) based on the validation set and the best performing architecture. The transparency of the green calibrations lines decreases with the number of epochs (i.e. initial stages are the most transparent). The final calibration is in red and the dotted line shows the perfect calibration.

In Tables 2, 3, and 4, the calibration results for different calibration settings on the BAN are presented: no calibration, isotonic regression, and Platt’s method. Each calibration is either combined with AT or not. For all three languages, both calibration methods improve the ECE score, and Platt’s method produces the best calibration scores. The AT slightly improves the ECE score for the uncalibrated (raw) results. This is especially true for the Slovene comments where the ECE score was reduced from 0.794 to 0.621. We can conclude that the calibration using AT heuristics might not be beneficial when used in combination with the established calibration techniques (isotonic regression and Platt’s method) but used exclusively.

Table 2 The calibration scores of BAN with different calibration approaches on the English tweets dataset. We present average classification accuracy and F1 score with their standard deviations, computed using 5-fold cross-validation
Table 3 The calibration scores of BAN with different calibration approaches on the Croatian user news comments dataset
Table 4 The calibration scores of BAN with different calibration approaches on the Slovene Facebook comments dataset

To compare the calibration of MCD BERT with different BAN calibrations, we plotted their ECE scores in Fig. 6. It can be observed that calibration methods substantially improve the BAN score. However, the MCD BERT model is better calibrated even without the usage of an explicit calibration methods.

Fig. 6
figure 6

Calibration plots based on English test set performance for MCD BERT and BAN with different calibration algorithms

Prediction Performance

We compare the predictive performance of four neural network architectures in Table 5. MCD LSTM and BERT serve as the baselines for comparison with the proposed BAN and MCD BERT. The MCD BERT model provides the best results for all three languages. BERT models are pre-trained on large amounts of text, which makes a significant difference compared to LSTM and BAN. MCD BERT is slightly better than BERT due to its better performance for the instances where BERT is uncertain. Here, multiple predictions reduce the prediction variance. MCD LSTM is more stable than BAN (see the standard deviation of \(F_1\) score in (see the standard deviation of \(F_1\) scores in Table 5). We attribute this to the larger number of parameters in BAN and insufficient number of training instances. BERT and MCD BERT models compensate for this problem with large scale pre-training.

Table 5 Predictive performance of compared models. We present the average classification accuracy and \(F_1\) score with their standard deviations (in brackets), computed using 5-fold cross-validation. The best accuracy for each language is typeset in bold

Reliability of BERT and MCD BERT

As established in Section 6.1, BERT models are already well-calibrated. In this section, we test if the proposed MCD BERT extension is useful beyond the advantage in predictive performance, and analyze the ability of MCD BERT to detect problematic predictions. For each classifier (BERT and MCD BERT), we split the tested instances into two groups, uncertain and certain, based on the computed prediction scores. As BERT and MCD BERT return most of the predictions close to 0 or 1, we used the following criteria for the certainty of prediction scores. For MCD BERT, the tested instance is declared uncertain if the variance computed on 1000 dropout predictions is greater then 0.1, otherwise it is declared certain. As BERT returns a single prediction score, we have chosen the same number of uncertain instances as for MCD BERT, based on the criterion that their prediction scores are farthest away from 0 or 1, i.e. they are least certain to be either hate speech or not.

Table 6 The number and ratio of predictions where classifiers are correct/incorrect is very different for instances where BERT and MCD BERT are certain/uncertain. We use three datasets, English (ENG), Croatian (CRO), and Slovenian (SLO)

In Table 6, we show the number of predictions where classifiers are correct/incorrect separately for instances with certain/uncertain prediction for each of the three languages. The ratio of incorrectly to correctly classified instances is significantly different between the certain and uncertain group, which is a strong indication that both BERT and MCD BERT correctly recognize uncertain predictions. This ratio is also much larger for MCD BERT than for BERT for the English and Croatian dataset, which testifies that the reliability of MCD BERT predictions is better. The ratio is similar for the Slovene dataset, where BERT also has a good ratio.

Using the Chi-square statistical test, we assessed the difference in correct/incorrect classifications between the certain and uncertain group. For the English dataset, this difference is highly significant for both BERT and MCD BERT (\(p=\)1.384e-11 and 2.2e-16, respectively). For the Croatian dataset, the p-values are 1 and 8.348e-16, meaning that we cannot rely on BERT scores to detect uncertain classifications, while the distribution returned by the MCD BERT is very informative. The p-values for the BERT and MCD BERT on Slovene are 0.0037 and 0.0002, respectively. Again, MCD BERT is much better in detecting unreliable classifications.

The observed difference in assessment of reliability can have important practical consequences. For example, if we are faced with the re-annotation task to improve the quality of predictions, MCD BERT would choose much better borderline instances compared to BERT.

Combining Emotional Information with MCD BERT

As the experiments in Section 6.2 show, MCD BERT is superior to other tested models on the hate speech detection task. In this section, we test if additional emotional information obtained from the SenticNet framework can complement the information about the hate speech extracted by the MCD BERT model and further improve its performance. We merge the affective dimensions computed based on SenticNet 4 and SenticNet 6 with the output vector of MCD BERT predictions, described in Section 3.3. Additionally, we investigate if the emotional information can help in the interpretation of trained hate speech detectors.

Fig. 7
figure 7

A diagram of merging MCD BERT predictions with the emotional information based on SenticNet 4 and SenticNet 6 frameworks. The concatenated vector is an input to the final SVM classifier that predicts the hate speech

For the evaluation, we use 5-fold cross-validation. In each iteration, we combine the predictions from MCD BERT (1000 of them, sorted in ascending order) with the affective dimensions from the original and revised variant of the Hourglass of Emotions models as depicted in Fig. 7. We obtain four affective dimensions from the original Hourglass of Emotions model (pleasantness, aptitude, sensitivity, and attention), and four from the revisited model (introspection, sensitivity, temper, and attitude). Using the dataset obtained in this way, we train the SVM model to predict the hate speech. According to the results in Table 7, the additional information does not improve the hate speech detection. The same conclusions can be drawn from Fig. 8, where we plot the scores assigned to the used features by the random forest algorithm [68]. This learning algorithm can detect feature dependencies that affect the prediction variable. Thus, the results show that SVM and random forest cannot detect any pronounced interactions between affective dimensions and MCD BERT predictions that would impact the hate speech classification.

The results show that introducing knowledge regarding emotional content after the predictions are done cannot improve the performance. However, according to the authors of the Hourglass of Emotions revisited model [55], the full sentence model introduced in SenticNet 6 [17] can provide superior text classification results on problems involving emotions. Thus, the layers that can capture emotional information from the text should be build within the prediction model architecture. Introducing uncertainty component in such architecture remains an interesting direction for further research.

Table 7 Predictive performance of the MCD BERT model and the SVM model trained on the output features of MCD BERT and affective dimensions from the two Hourglass of Emotions models for the English tweets dataset
Fig. 8
figure 8

Feature importance scores according to the random forest algorithm. We show scores of 8 affective dimensions extracted from the SenticNet 4 and SenticNet 6 frameworks, as well as five most important attributes generated by the MCD BERT model

To better understand the emotions involved in hate speech problem, we further investigated the relation between the affective dimensions of the two Hourglass of Emotions models (original and revisited) and the hate speech prediction probabilities of MCD BERT, separately for the non-hate speech and hate speech English tweets.

The top line of Fig. 9 shows results for the affective dimensions of the original Hourglass of Emotions models (pleasantness, attention, sensitivity, and aptitude). The top parts of graphs show that linear regression lines (in orange) for the hate speech are almost horizontal, so there is no significant correlation between the predicted probability of hate speech obtained with MCD BERT and affective dimensions. In contrast, the correlation between the predicted probability and affective dimensions for the non-hate speech tweets is significant, as the blue regression lines at the bottom parts of the graphs in the top row show. Both attention and sensitivity have positive correlation with the hate speech prediction probability. This is in accordance with the conclusions of the original Hourglass of Emotions model that high attention and sensitivity lead to aggressiveness (Fig. 5 in [53]).

Fig. 9
figure 9

Relationship between prediction probability of MCD BERT and the Hourglass of Emotions affective dimensions. Original affective dimensions are shown in the top line, while revisited dimensions are shown in the bottom line of graphs

The bottom line of Fig. 9 shows the affective dimensions of the revisited Hourglass of Emotions model (sensitivity, temper, attitude and introspection). These affective dimensions are all negatively correlated with the non-hate speech. It can be observed that there is also a slight negative correlation between the affective dimensions and hate speech probabilities, especially for sensitivity and temper. Thus, tweets that contain dominantly positive emotions have a low probability of being hate speech which is in accordance with the results presented by [55].

Visualization of Uncertainty

Obtaining multiple predictions for a specific instance can improve understanding of the final prediction. We used the mean of the distribution to estimate the probability. The variance informs us about the spread and certainty of predictions. We can inspect the actual distribution of prediction scores with histogram plots, as illustrated in Fig. 10 for a few correctly classified instances from the English dataset, and on Fig. 11 for a few misclassified instances. We analyze distributions produced by the MCD LSTM baseline, BAN with 10% and 30% dropout, and MCD BERT.

Histograms in Figs. 10 and 11 visually display the prediction certainty for the specific instances. We notice that MCD BERT’s predictions are always close to 0 or 1, especially when the model seems certain of the prediction. BAN with 10% dropout provides a similar spread of values as MCD BERT. This is expected as BERT is also pre-trained with 10% dropout. However, 30 % of dropout in BAN results in a much larger spread of predictions for instances where BAN is uncertain. Note that the results of MCD BERT are concentrated in a much narrower interval compared to MCD LSTM and BAN.

Fig. 10
figure 10

Distributions of prediction scores for a few correctly classified English instances. We show histograms for MCD LSTM (first row), BAN with \(30\%\) dropout (second row), BAN with \(10\%\) dropout (third row), and MCD BERT (fourth row). Each tweet is shown in a separate column

Fig. 11
figure 11

Distributions of prediction scores for a few incorrectly classified English instances. We show histograms for MCD LSTM (first row), BAN with \(30\%\) dropout (second row), BAN with \(10\%\) dropout (third row), and MCD BERT (fourth row). Each tweet is shown in a separate column

Fig. 12
figure 12

Visualization of 100 test tweets projected into two dimensional space by the UMP method. Tweets whose classifications seem certain are colored in blue while tweets with uncertain classification are shown in orange. We can observe clustering of uncertain tweets

Fig. 13
figure 13

Visualization of the probability space for 100 tweets from the test set. The instances are colored green, yellow, or red, depending on the mean probability of the 1000 predictions. Predictions with high confidence form an isolated part in the probability space

While visualizations of prediction distributions for individual instance (see Figs. 10 and 11) are useful in the assessment of their prediction reliability, we also aggregate results over multiple instances to understand more general reliability phenomena. Following [7], we visualize the embeddings of the prediction distributions. The idea of this visualization is to detect and identify clusters of certain and uncertain classifications. First, we obtain many predictions (1,000 in our experiments) for each instance. The space of prediction distributions across instances is embedded into two dimensions by the Uniform Manifold Projections method (UMP) [67]. In this way, we obtain a two-dimensional space corresponding to the initial 1,000 dimensional space of prediction distributions. Next, we use the Gaussian kernel estimation to identify equivalent regions and connect them with closed curves. Finally, the shapes and sizes of individual predictions are chosen based on their classification error and certainty of predictions. The goal of this visualization is to discover structures within the space of probability distributions, possibly offering insights into the drawbacks and limitations of the analyzed classifier. The resulting visualizations are shown in Figs. 12 and  13. In Fig. 12, the plot displays the position of certain and uncertain test set instances in the embedded space of distributions, while in Fig. 13 the differences are based on the mean of predicted probability scores.

In both Figs. 12 and 13, the probability space is distinctly separated into two components, indicating that there are predictions for which the neural network is certain (and were correctly classified). However, for some predictions, especially non hate speech instances, the model is less certain (albeit still correct). The two visualizations demonstrate how the probability space is split into distinct components for a trained neural network. The visualizations also shows problematic predictions, allowing their identification and potentially facilitating the debugging process for developers (e.g., an inspection of convergence).

Conclusions and Future Work

In real world scenarios, an automatic detection of hate speech requires high precision and reliable decisions. Wrong classifications can lower the level of democratic debate and damage freedom of speech. In technological terms, NLP is witnessing a switch from RNNs with pre-trained word embeddings (such as LSTM with fastText) to large pre-trained transformer models (such as BERT).

We proposed to use the MCD in the attention layers of transformer neural networks, and to unfreeze dropout layers also during the prediction phase. This resulted in two new architectures, BAN and MCD BERT. The BAN models are transformer networks trained from scratch, using dropout in both training and prediction phase. MCD BERT uses pretrained BERT model and uses dropout during fine-tuning and prediction phase. We have shown that these approaches are useful for estimation of prediction uncertainty. MCD BERT significantly improves the prediction performance in the hate speech detection task. Its pre-training extracts useful information about the language use that can be successfully exploited in the fine-tuning to a specific problem. BANs, trained from scratch, are not competitive with this. We also empirically investigated the calibration of BAN and MCD BERT. The results show that MCD BERT is much better calibrated than BAN.

Multiple predictions obtained from MCD BERT not only produce better predictive performance compared to BERT, but also provide better reliability information. The visualizations based on them enable detection of less certain decisions and can help moderators or annotators to focus on uncertain instances.

In line with the recent research showing that the affective information available in the SenticNet 6 framework provides favorable results in the sentiment analysis [55], we tested this information on the hate speech detection task. We combined affective dimensions from the original and revisited Hourglass of Emotions models with predictions generated by the MCD BERT model. While our results do not show any improvement in predictive performance, we believe inclusion of affective information should be incorporated within the prediction model together with possibility of obtaining prediction uncertainty. Thus, we see an opportunity for further work in this area by introducing BERT-based uncertainty estimated into full sentence models from the SenticNet 6 framework. Nevertheless, the predictions of the MCD BERT model confirm the findings of the Hourglass of Emotions model. The affective dimensions of the Hourglass of Emotions model are correlated with the non-hate speech probabilities returned by the MCD BERT, and can potentially explain emotions involved in the hate speech. Breaking down a complex offensive language to fundamental emotions can bring interesting insights into the hate speech problem.

In future work, we aim to adapt other Bayesian approaches, such as SWAG, to transformer networks. Reliability enhanced classifications could be used in many other domains, such as machine translation. One of the tasks where Bayesian text classification can be particularly useful is semi-supervised learning, which iteratively expands an initial small set of manually labeled instances with the most reliably classified instances. Data re-annotation is another example where reliability scores can be of great use. An initial pilot study on Croatian comment filtering showed that human annotators decide mostly based on the observed keywords and lack the time to detect more subtle expressions of offensive content. These circumstances result in low quality of the resulting datasets and demand their reannotation. Using the reliability scores of the proposed MCD BERT, one could significantly reduce the amount of reannotation and focus on genuinely difficult and borderline cases where prediction models may err.