To BAN or not to BAN: Bayesian Attention Networks for Reliable Hate Speech Detection

Hate speech is an important problem in the management of user-generated content. In order to remove offensive content or ban misbehaving users, content moderators need reliable hate speech detectors. Recently, deep neural networks based on transformer architecture, such as (multilingual) BERT model, achieve superior performance in many natural language classification tasks, including hate speech detection. So far, these methods have not been able to quantify their output in terms of reliability. We propose a Bayesian method using Monte Carlo Dropout within the attention layers of the transformer models to provide well-calibrated reliability estimates. We evaluate and visualize the introduced approach on hate speech detection problems in several languages. From the experiments performed it was observed that our approach significantly improve the hate speech detection that can not be trusted. Our approach not only improves classification performance of the state-of-the-art multilingual BERT model, but the computed reliability scores also significantly reduce the workload in the inspection of offending cases and in reannotation campaigns. The provided visualization helps to understand the borderline outcomes.


Introduction
With the rise of the social network popularity, hate speech phenomena has significantly increased [10]. Hate speech not only harms both minority groups and the whole society but it can lead to actual crimes [3]. Hence, automated hate speech detection mechanisms are urgently needed. On the other hand, falsely accusing people of hate speech is also a problem. Many content providers rely on human moderators to reliably decide if a given context is offensive or not but this is a mundane and stressful job which can even cause post-traumatic stress disorders 1 . There are many attempts to automate detection of hate speech in the social media using machine learning, but existing models lack quantification of reliability for their decisions.
In the last few years, recurrent neural networks (RNNs) were the most popular choice in text classification. The LSTM networks, the most successful RNN architecture, were already successfully adapted for assessment of predictive reliability in hate speech classification [37]. Recently, neural network architecture with attention layers, called transformer architecture [53], shows even better performance on almost all language processing tasks. Using transformer networks for the task of masked language modelling produced breakthrough pretrained models such as BERT [11]. Hence, the attention mechanism seems to be at the forefront of natural language understanding with potentially huge impact on language applications. We aim to investigate the behavior of the attention mechanism concerning the reliability of predictions. We focus on the hate speech recognition task.
In hate speech detection, reliable predictions are needed to remove harmful contents and possibly ban malicious users without harming the freedom of speech [37]. Standard neural networks are inadequate for assessment of predictive uncertainty, and we have to use the Bayesian inference framework. However, classical Bayesian inference techniques do not scale well in neural networks with high dimensional parameter space [21]. Various methods were proposed in order to overcome this problem [41]. One of the most efficient method is called Monte Carlo Dropout (MCD) [16]. Its idea is to use dropout in neural networks as a regularization technique [49] and interpret it as a Bayesian optimisation approach that samples from the approximate posterior distribution.
We propose a Bayesian Attention Network (BAN) architecture that combines the attention mechanism in transformer networks with the MCD based Bayesian inference in order to estimate the reliability of predictions. Our main contributions are: 1. Methodology for assessment of prediction uncertainty in attention networks (AN) and BERT model. 2. Empirical analysis of the proposed Bayesian Attention Networks and BERT models regarding their calibration within the multilingual hate speech detection tasks.
3. Visualization of prediction uncertainty level for individual instances as well as for their groups.
The paper consists of six more sections. In Section 2, we present related works on prediction uncertainty and hate speech detection. In Section 3, we propose the methodology for uncertainty assessment using attention layers and MCD while in Section 4 we describe calibration of predictions. Section 5 presents the data sets and the evaluation scenario. The obtained results are presented in Section 6, followed by conclusions and ideas for further work in Section 7.

Related Work
We present three areas of related work concerning the three pillars of the paper. First, we introduce work done on the hate speech detection, followed by the related research on transformer architecture for text classification. We end the section with description of existing approaches for assessment of uncertainty in text classification.

Hate Speech Detection
Analyzing sentiments and extracting emotions from the text are one of the most usefully applications of the natural language processing [52]. From the wide rage of applications where machines tend to understand human sentiments hate speech detection is becoming more and more needed with the rise of the social media popularity. Hate speech is defined as written or oral communication that abuses or threatens specific group or target [56].
Depending on the geographical and political influences in the specific country or region the hate speech target group may vary [47]. One of the most targeted groups in Europe based report done by British Institute of Human Rights [23] are the race, national and LGBT minorities. For each of those three target groups substantial research regarding how to stop the hate speech was done. Works by [4,58] define and propose law medications racist hate speech, in the research done by [1,6] national hate speech was investigated and the abuse based on the sexual orientation was explored by [2,24]. As our interest are also the less represented European languages our focus in this paper was on those three target groups.
Detecting abusive language for lowresource languages is a complex task, hence multilingual and cross-lingual methods are used to improve the results [50]. This can especially the case when the languages have morphological and geographical similarities [43]. In our work we investigate and compare results of the hate speech detection for English, Croatian and Slovenian. As most widely used, English language has the most studies done [10,31,57], however, last years similar studies were done for Croatian [26,28,32] and Slovenian [13,29,54] languages. Hate speech detection, a particular case of text classification, can be formulated as a binary classification problem, therefore it is usually approached by using supervised learning methods. In past the most frequently used classifier was the Support Vector Machines (SVM) method [48]. However, deep neural networks are now dominant technique, first through recurrent neural network language models [35], and recently using the pre-trained transformer networks [40,59]. In this work we used pre-trained (mulilingual) BERT model in order to obtain state-of-the-art results for hate speech detection.

Attention Networks for Text Classification
With intention to replace existing LSTM model, Attention networks introduced within the Transformer model were first time proposed by [53]. As it was shown that this change improve results for many NLP tasks, researcher from various fields started experimenting with Transformer model. Recently, several works investigated large pre-trained transformer models that use the attention mechanism for text classification [22]. In thier work Kant at al. [22] train both mLSTM and Transformer language models on a large 40GB text dataset [33] and then they transfer those models to the both binary and multi-class text classification problems. The compared the model performance on this tasks with both large academic datasets, and on an original text dataset. The conculded that the Transformer model generally out-performs the mLSTM model, especially when fine-tuning on multidimensional emotion classification and that fine-tuning the model significantly improves performance on the emotion tasks, both for the mLSTM and the Transformer model. The BERT model [11] was the first to use a large dataset combined with transformer architecture applied to the masked language model. BERT and its follow-ups learn sufficient amount of language characteristics (both syntactic and semantic) to be useful for many other language classification tasks. Despite short time since its conception, BERT has already attracted enormous attention of NLP community and has been extensively researched by hundreds of research groups; see a recent overview by Rogers et al. [46]. Also, practice guidelines about how to fine-tune the BERT model for text classification were proposed by [51].
Multilingual hierarchical attention mechanism for document classification was investigated by several authors [12,44,61]. However, different attention layers from these large pre-trained models were not tested separately or in the context of prediction reliability. Also, to the best of our knowledge, predictive reliability for results obtained by BERT was not investigated, yet.

Prediction Uncertainty for Text Classification
While recent works on classification reliability mostly investigate deep neural networks, many other probabilistic classifiers were analyzed in the past [7,20,42,64]. The well-known work of Platt [45] investigates probabilistic properties of SVM model predictions.
Prediction uncertainty is an important issue for black box models like neural networks as they do not provide any interpretability or reliability information about their predictions. Most existing reliability scores for deep neural networks are based on the Bayesian inference. The exception is the work of Lakshminarayanan et al. [27], who proposed using deep ensembles to estimate the prediction uncertainty.
One of the most popular approaches is to mimic Bayesian inference using Monte Carlo dropout [16]. The dropout technique was first introduced to RNNs in 2013 [55] but further research revealed negative impact of dropout in RNNs [5]. Later, the dropout was successfully applied to language modeling by Zaremba et al. [63] who applied it only on fully connected layers. Gal and Ghahramani [17] implemented the variational inference based dropout which can also regularize recurrent layers. Additionally, they provide a solution for dropout within word embeddings. The method mimics Bayesian inference by combining probabilistic parameter interpretation and deep RNNs. Authors introduce the idea of augmenting probabilistic RNN models with the prediction uncertainty estimation. Several other works investigate how to estimate prediction uncertainty within different data frameworks using RNNs [65], e.g., Bayes by Backpropagation (BBB) was applied to RNNs [14].
Recently, a fast and scalable method called SWAG was proposed by Maddox et al. [30]. The main idea of this method is to randomise learning rate and interpret this as a sampling from the Gaussian distribution. SWAG fits the Gaussian distribution by capturing the Stochastic Weight Averaging (SWA) mean and covariance matrix, representing the first two moments of SGD iterations. Different to SWAG, we use the Gaussian distribution as a posterior over neural network weights, and then perform a Bayesian model averaging for uncertainty representation and calibration.
MCD was recently used within various models and architectures in order to obtain the prediction uncertainty and improve the classification results [36,38,39]. Transformer networks were not analyzed yet.

Bayesian Attention Networks
The BERT model [11] is the representative of transformer networks and has achieved state-of-the art results in many NLP tasks, including text classification [9,19,60]. In this work, we introduce Monte Carlo Dropout to transformer networks and BERT with the intention to construct their Bayesian variants. Analysis of different amounts of dropout, different variants of BERT modifications, and their hyperparameters would require huge computational resources, e.g., training a single BERT model on four TPUs requires more than a month time. Due to limited computational resources, we explore these issues in a limited setting, first on only the encoder part of the BERT architecture, called Attention Network (AN), and then on the entire pretrained BERT model.
In the following subsections, we first formally define the Attention Network architecture, and then make it Bayesian by introducing MCD. We describe how we can introduce MCD principle into the already pretrained BERT model.

Attention Networks
The basic architecture of Attention Network follows the architecture of transformer networks [53] and is shown in Figure 1. The architecture is similar to the encoder part of the transformer architecture. The difference is in the output part where a single output head was added to perform binary classification, using the sigmoid activation function. By applying only the encoder part of transformer architecture, orders of magnitude less parameters are needed to learn a particular classification task, e.g., in this work, we used at maximum 3 million parameters. The architecture can contain many attention heads, where a single attention head is computed as: The attention matrices are commonly known as the query Q, the key K, and the value matrix V . The o h represents the output. The attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The d k represents the dimensionality of the keys. The positional encoding, as discussed in [53], represents a matrix that encodes individual positions in a matrix of same dimensionality as the one holding the information on sequences (input embedding). Intuitively, the multiplication of query and key vectors with subsequent values can be understood as extraction of relations.
The softmax activation enables that each pair of considered input tokens is represented with a single real value, and effectively introduces sparseness into the weight space -only certain token pairs emerge with high weights and are as such relevant for the remaining part of the neural network architecture considered. In practice, multiple such heads can be concatenated and fed into the succeeding feed-forward layer. Application of softmax has been shown to emphasize only particular parts of the parameter space, thereby making the neural network more focused. The positional encoding, being the part of the transformer architecture, was introduced to account for word order.
Here, relative distances between different tokens are taken into account by incorporating the position-related signal into a given token representation.
While there are in principle many different ways how attention networks can be extended with Bayesian approach, we propose to use a well established Monte Carlo Dropout.

Monte Carlo Dropout for Attention Networks
In our proposal, called Monte Carlo Dropout Attention Networks (MCD AN or BAN) contrary to the original dropout setting, the dropout layers are active also during the prediction phase. In this way, predictions are not deterministic and are sampled from the learned distribution, thereby forming an ensemble of predictions. The obtained distribution can be, for example, inspected for higher moment properties and can offer additional information on the certainty of a given prediction. During the prediction phase, all layers except the dropout layers are deactivated. Forward pass on such partially activated architecture is repeated for a fixed number of samples, which can be combined to obtain the final probability, or further inspected as a distribution underlying the probability.

Monte Carlo Dropout for BERT
Monte Carlo dropout was used for the BERT predictions in the same way as for BAN. MCD can provide multiple predictions during the test time completely free, as long as the dropout was used during the training time [15]. Training neural network with dropout distributes the information contained in the neurons throughout the network. Hence, during the prediction, such trained neural network will be robust; using the dropout principle a new prediction is possible in each forward pass, and sufficiently large set of such predictions can be used to estimate the reliability. BERT model is trained with 10% of dropout in all of the layers and thus allows for multiple predictions using the described principle. We call this model MCD BERT. A possible limitation of this approach is that during training a single dropout rate of 10% is used, while other dropout probabilities might be more suitable for reliability estimation.

Calibration of Probabilistic Classifiers
The quality of reliability scores returned by classifiers (such as MCD AN and MCD BERT) are typically assessed with calibration measures. A classifier is calibrated if its output scores are true probabilities in a sense that a class predicted with the score p is correct with actual probability p, i.e. in p · 100 percent of cases. For most neural networks it is true that without any measures taken to ensure that they are calibrated, they will most likely be somewhat overconfident and will overestimate their probabilities. A model's calibration can be visualized with a calibration plot where the model's prediction accuracy (true probabilities) is plotted against the predicted output (i.e. softmax score). A perfect calibration manifests itself as a diagonal in the calibration plot (see an example of calibration plot in Fig. 4).
Since classifiers are typically not perfectly calibrated, we investigated different methods to improve calibration of predictions for a binary model. We compared various existing methods for calibration of neural networks with a novel approach that combines the existing methods with the adaptive prediction threshold. We first describe the existing calibration methods, followed by the proposed adaptive threshold.

Existing Calibration Methods
Below we formally describe how to obtain calibrated predictions from the reliability scores. Let (X, Y ) be the input space, where X present the set of predictive variables and Y is the class variable with possible values 0 and 1. Let f be the predictor (e.g., neural network) with f (X) = (Ŷ ,P ), whereŶ is the binary class prediction, andP is its associated confidence score or probability score of correct prediction. The calibration of the model f is expressed with the formula: wherep is the prediction score from [0, 1] interval, obtained from the predictor f . We interpret this score as the probability of a specific outcome assigned by the model f . Probability p is the model's confidence or true probability that model f predicted correctly. If a model predicts certain outcome with high probability, it is desirable that the confidence of this prediction being correct is also high. In the ideal case of perfect calibrationp = p. Based on Equation (1), there are two ways to reduce the calibration error: either to obtain calibrated predictionsp or to manipulate the prediction threshold in such way that the predicted outcomeŶ is better calibrated.
To asses the quality of the produced reliability scores, we compare them to results of two most popular calibration methods, Platt's method and Isotonic regression. Platt's method [45] learns two scalar parameters a, b ∈ R in such way that the predictionq = σ(ap + b) presents a calibrated probability of predicted scorep, and σ is a sigmoid function. To find good values of a and b typically a separate calibration data set is used. The isotonic regression is a non-parametric form of regression in which we assume that the function is chosen from the class of all isotonic (i.e. non-decreasing) functions [62]. Given predictions from our classifierp, and the true target y, the calibrated prediction returned by isotonic regression is: where m is a non-increasing function.

Adaptive Threshold
As a part of this work, we explored the notion of adaptive threshold (AT) which we apply to classification with MCD ANs. During learning, after each weight update phase, we asses the performance of AN. For each instance in the validation set, we do multiple forward passes with unfixed dropout layers and store the average of returned scores as the probability estimate. Once the probability estimates for the validation set are collected, we apply several decision thresholds to them and determine the final predictions of the instances. The best-performing threshold w.r.t. a given performance metric, in our case classification accuracy, is stored together with its performance and weights. This performance estimate can also be used in early stopping of the learning phase. When we apply the model to new instances, we use the best threshold from the training phase (instead of 0.5). The purpose of the adaptive threshold is to automatically find the threshold with better performance than the default value of 0.5. To summarize, we employ the following procedure: 1. During training, after each weight update, a probability distribution is generated with MCD. The mean of the distribution is considered the probability of a given instance being assigned to the positive class. 2. Using the validation set, we test a range of possible thresholds that determine the instances' labels. In this work, we tested the threshold range between the 0.1 and 0.9 in increments of 0.001. 3. If the accuracy obtained by the default threshold (0.5) was exceeded by any other threshold, the BAN network stored both the current parameter set as well as the threshold value that was used to obtain this superior performance on the validation set. 4. After the end of training, the weights of the best performing architecture and the matching threshold value are used as the final trained model.

Evaluation Settings
We evaluate the proposed novelties concerning three main aspects: i) the calibration of returned probabilities, ii) prediction performance, and iii) visualization of prediction uncertainty on the hate speech problem. In this section, we present the evaluation settings, and in Section 6 we report the results for the three aspects. Here, we start with the description of the used hate speech datasets, followed by the implementation details of used prediction models, evaluation measures for prediction performance, and evaluation measures for calibration.

Hate Speech Datasets
In order to test the proposed methodology in the multilingual context, we applied the previously presented classification models to three different datasets.
1. English dataset 2 is extracted from hate speech and offensive language detection study of Davidson et al. [10]. The subset of data we used consists of 5,000 tweets. We took 1,430 tweets labeled as hate speech and randomly sampled 3,670 tweets from the collection of remaining 23,353 tweets. 2. Croatian dataset was collected by Styria media company within the EU Horizon 2020 project EMBEDDIA 3 . The texts were extracted from the database of user comments on the website of Večernji list 4 news portal. The original data set consists 9,646,634 comments described with 11 attributes from which we selected 8,422 comments, of which 50% are labelled as hate speech by human moderators, and the other half was randomly chosen from the non-problematic comments. 3. Slovenian data set is a result of the Slovene national project FRENK 5 .
The text dataset used in the experiment is a combination of two different studies of Facebook comments [29]. First group of comments were colected on LGBT homophobia topics while the second on anti-migrants posts. In our final data set we used all of the 2,182 hate speech comments and the same number of non-hate speech comments were randomly sampled.
The summary of data set sizes is available in Table 1.

Prediction models
We used three types of prediction models. As a baseline we used MCD LSTM networks [37], which include reliability information obtained with MCD. We compared that model with MCD AN and MCD BERT. As the input to MCD LSTM, we used pretrained word embeddings: sentence encoder for English [8] and fastText embeddings 6 for Slovenian and Croatian. For MCD AN we used Keras tokenizer 7 and for the MCD BERT we used BERT's own tokenizer.

Implementation details
We implemented the proposed MCD ANs in PyTorch library 8 . The main hyper-parameters of the architecture are the number of attention heads and the number of attention layers. The adaptive classification threshold (described in Section 4.2) is computed every time we evaluate the performance on the validation set. The Monte Carlo dropout in AN is used also in the prediction phase. When a network makes a prediction, we apply a deactivation operation across the architecture, which freezes everything but the dropout layers. In this way, we maintain the variance of predictions. Each final prediction thus consists of a set of results obtained by several forward passes. The complete source code of our approaches will be available with the final version of the paper.
Other parameters are set as follows. We use the Adamax optimizer [25], a variant of Adam based on infinity norm. Binary cross-entropy loss guides the training. In order to automatically stop training, we use the stopping step of 10 -if after 10 optimization steps the performance on the validation set is not improved, the training stops.
We explored the following hyperparameter tuning space: the validation percentage (size of validation set) was varied between 5% and 10%. The rationale for testing different percentages of validation set sizes is that the data considered is small, hence considering too high validation percentages could prevent the classifier from viewing crucial instances and thus reduce its final performance. However, given enough data the percentage should be as high as possible. Number of epochs was either 30 or 100, number of hidden layers and attention heads was 1 or 2. Maximum padding of the input sequences was either 48, 32 or 64. Learning rate was either 0.001 or 0.0005 and the adaptive threshold was either enabled or disabled.
MCD LSTM networks consist of an embedding layer, LSTM layer, and a fully connected layer within the word2Vec and ELMo embeddings. In order to obtain best architectures for the LSTM and MCD LSTM models, various number of units, batch sizes, dropout rates etc. were tested. For BERT, we used the BERT base model in English and multilingual variant, using HuggingFace code 9 .

Prediction Performance Evaluation Measures
Depending on the purpose of the prediction model, we might want to maximize different evaluation measures, such as classification accuracy, precision, recall or F 1 score. In hate speech detection, our interest is to avoid falsely accusing people of hate speech, therefore we try to maximise precision by measuring it on the validation set during training. On the other hand, this could influence other measures. With that aim, we alter the decision threshold to achieve good precision vs. accuracy balance. It the Figure 2 the accuracy-precision trade-off is presented.

Calibration Quality Measures
To measure the quality of computed calibration scores, we use the expected calibration error (ECE) [18]. ECE splits all n predictions into M equally spaced bins B 1 , B 2 , . . . , B M , that contain instances with prediction scores in the given bin. It sums the weighted differences between actual prediction accuracies and predicted scores over all the bins and normalizes the result with the number of instances n.
This measure produces lower scores for better calibrated models (lower calibration error).

Results
In this section, we present results of four evaluations: calibration of different prediction models, their prediction performance, reliability of BERT and MCD BERT, and visualization of uncertainty.  In the Tables 2, 3, and 4 calibration results for different calibration approaches of MCD AN are presented: no calibration, isotonic regression, and Platt's method, combined with the adaptive threshold or not. It can be observed that for all three languages both calibration methods improve the ECE score, and Platt's method seems to produce the best calibration scores. Adaptive threshold slightly improves the ECE score for the uncalibrated (raw) results. This is especially true for the Slovenian comments where the ECE score was reduced from the 0.794 to the 0.621. Nevertheless, we can conclude that calibration using adaptive threshold heuristics is beneficial but cannot be compared with the improvements brought by proper calibration techniques.

Calibration of BAN and BERT
In order to compare the calibration results for MCD BERT with different MCD AN architectures, we plotted their ECE scores in Figure 4. It can be observed that calibration methods substantially improve the MCD BAN calibration; however, the MCD BERT model is even better calibrated.

Prediction Performance
The results that compare four different models are presented in Table 5. MCD BERT provides the best results for all three languages. BERT is pre-trained on large amount of text and that makes a significant difference compared to LSTM and MCD AN. As the MCD BERT is slightly better that BERT, we can conclude for the instances where BERT is unsure, multiple predictions reduce the variance and can influence decision in the right direction. LSTM seems to be more consistent than BAN (see the F 1 scores). We attribute this to the larger number of parameters in BAN and insufficient number of instances, which for BERT models is compensated this with pre-training. Table 5: Comparison of predictive models. We present average classification accuracy and F 1 score with their standard deviations, computed using 5-fold cross-validation. All the results are expressed in percentages and the best accuracy for each language is typeset in bold.

English Tweets
Croatian

Reliability of BERT and MCD BERT
As BERT is already well calibrated, we want to test if the proposed MCD extension is useful beyond the advantage in predictive performance. With intention to test if MCD BERT could recognize problematic predictions, we investigated instances from the testing sets. For each classifier (BERT and MCD BERT), we split tested instances into two groups, confused and certain, based on the scores provided by classifiers. As BERT and MCD BERT generally provide predictions close to 0 or 1, we used the following simple criterion for MCD BERT: the tested instance is declared confused if the variance computed from the 1000 predictions is greater then 0.1 otherwise it was declared certain. As BERT returns a single prediction, we have chosen the same number of confused instances as for MCD BERT, with the criterion that the predicted score is farthest away from 0 or 1, i.e. the least certain. In Table 6, we show results for each of the three languages, separately for correctly classified instances and for incorrectly classified instances. The ratio of incorrectly to correctly classified instances is significantly different between certain and confused groups, and much larger for MCD BERT than for BERT for English and Croatian dataset, while being similar for Slovene. Also, in order to statistically test the whether change from the BERT to MCD BERT would influence the number of correctly detected confused tweets the Chi-square test was applied. The Chi-square test for English MCD BERT returns very significant p-value= 2.2e-16 while the Chi-square test for BERT model results was found to be less significant with p-value = 1.384e-11. For CRO BERT the Chi-square test was not significant with p-value= 1. Thus, it is clear that for Croatian we can not classify tweets on certain and confused based just on the single prediction. On the other hand, for the CRO MCD BERT provide much better spit so the Chi-square test become significant with p-value= 8.348e-16. The p-values for the SLO BERT and SLO MCD BERT are 0.0037 and 0.0002 respectively. This indicates that MCD BERT is much better in detecting unreliable classification. For example, if we are faced with the reannotation task in order to improve the quality predictions, MCD BERT would choose much better borderline instances compared to BERT. Thus, the results indicate that MCD BERT provides better understanding of the how much we can trust the predictions compared to BERT.

Visualization of Uncertainty
Obtaining multiple predictions for a specific instance can improve understanding of the final prediction. The mean of the distribution is used to estimate the probability and the variance informs us about the spread and certainty of the prediction. We can inspect the actual distribution of prediction scores with histogram plots as demonstrated on Figure 5 for a few correctly classified instances from English dataset and on Figure 6 for a few misclassified instances.
Histograms presented in the Figures 5 and 6 visually display the prediction certainty for specific instances. MCD BERT's predictions are always close to 0 or 1, especially when model seems to be sure. MCD AN with 10% dropout provides similar spread of values as BERT which is expected as BERT architecture is pre-trained having 10% dropout. On the other hand, 30 % of dropout in MCD AN, results in a much larger spread of predictions in instances where BAN is uncertain. Note that the results of MCD BERT are concentrated in a much narrower interval compared to MCD LSTM and MCD AN.  While visualizations of prediction distributions for each instance separately as in Figures 5 and 6 are useful in assessment of individual prediction reliability, we want to show also aggregated results for multiple instances to understand dependencies with other instances. Following [37], we visualize the embeddings of the prediction distributions.  summarized as follows. First, a large number (1000 in our experiments) of the predictions are obtained for each instance. The space of such prediction distributions across individual instances is embedded into two dimensions by using the Uniform Manifold Projections method [34]. In this way, we obtain a two dimensional space corresponding to the initial 1000 dimensional space of prediction distributions. Next, we use Gaussian kernel estimation to identify equivalent regions connected with closed curves. Finally, the shapes and sizes of individual predictions are chosen based on their classification error and certainty of predictions. The goal of using this visualization is to discover potentially larger structures within the space of probability distributions, possibly Fig. 7: Visualization of the 100 test tweets in two dimensions. Tweets that are found to be certain are colored in blue (0) while tweets that are confused in orange (1). It can be observed that uncertain tweets get clustered. Fig. 8: Visualization of the outcome probability space for 100 tweets from the test set. The test tweets are colored in the green, yellow and red depending to which interval belongs the mean probability of the 1000 predictions. It can be observed that the predictions with very high confidence form an isolated part of the probability space.
offering insights into drawbacks and limitations of the analyzed classifier. The results of this visualization are shown in Figures 7 and 8. In Figure 7, the plot displays the position of certain and uncertain test instances in the latent space while in Figure 8 the differences based on the mean probability.
In both Figures7 and 8 the probability space is distinctly separated into two components, indicating that there are predictions for which the neural network is certain (and were correctly classified); however, for some predictions, especially of instances that are not hate speech, the network is less certain (albeit still correct). The two examples demonstrate how the space of probabilities is split into distinct components once the neural network is trained. The visualizations also indicates that some of the instances are more problematic than others, allowing their identification and potentially facilitating the debugging process for a developer (e.g., inspection of convergence).

Conclusions and Future Work
In practical setting, the automatic detection of hate speech not only requires high precision but also prediction uncertainty estimates. In times when social networks suffer from high amount of offensive messages, wrong classifications can damage the minorities, lower the level of democratic debate but also damage the freedom of speech. In technological terms, natural language approaches are witnessing a switch from recurrent neural networks with pretrained word embeddings (such as LSTM with fastText) to large pretrained transformer model (such as BERT). Monte Carlo dropout in the attention layers of transformer neural networks turned out to be a useful tool for prediction uncertainty estimation in the hate speech detection task.
We investigated possibility of obtaining uncertainty assessment for predictions with transformer neural networks. To better understand uncertainty that can be obtained for the BERT model, we investigated the part of its architecture, called attention layers, and proposed the use of Monte Carlo dropout in these layers. We empirically investigated calibration and predictive performance of MCD AN and MCD BERT. The results show that MCD BERT is much better calibrated than MCD AN. Its pre-training gathers large amount of information about language that can be successfully exploited in fine-tuning to the specific problem. MCD ANs, trained from scratch, are not competitive with this. Multiple predictions obtained from dropout layers of BERT (i.e., MCD BERT) turn out to be very useful. On one hand they produce better predictive performance compared to BERT, and on the other better separation between trusted and dubious predictions. This information can significantly reduce the amount of work in task where questionable cases are selected. Also, the proposed visualizations of reliability enhanced classifications support detection of less certain automatic decisions and help moderators or annotators focus on dubious cases. The visualizations can show either uncertainty in classification of individual instances or relationship among them.
Our further work will adapt other Bayesian approaches to transformer networks. In hate speech detection, we will investigate other languages. Besides hate speech, we plan to apply reliability enhanced classification to other domains, such as medical imaging and machine translation. One of tasks where Bayesian text classification can be particularly useful is semi-supervised learning where we iteratively expand an initial small set of manually labelled instances with the most reliably classified instances. Data re-annotation is another example where reliability scores can be of significant help. An initial pilot study of Croatian comment filtering showed that human annotators decide mostly based on the observed keywords and lack time to find more subtle expressions of offensive contents. Such decisions result in low quality of the resulting datasets and demand their reannotation. Using the proposed MCD BERT reliability scores can significantly reduce the amount of reannotation and focus the work on truly borderline cases where the prediction models may err.

Compliance with Ethical Standards
Conflict of Interest The authors declare that they have no conflict of interest.
Ethical Approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent Informed consent was not required as no humans or animals were involved.