Introduction

Nowadays, Social Media (SM) are an essential part of human life. They are used for business, entertainment, communications with friends and fellow workers, representing skills, knowledge, and abilities, and acquiring new ones. SM are closely related to all human activities. They represent a means of communication between sports players with the broadest part of their fans. Athletes can get extra incentives for further effort and better results through contact with their fans. However, on the other hand, SM can be used by unwell-meaning people to destabilize and frustrate athletes in their efforts. It is crucial to point out that, by using SM, sports players can face up: to different kinds of aggression, like flaming, harassment, hate, and trolling, as well as other kinds of hate speech like an insult, quarrels, swearing, and invective, obscene, obscure, offensive, profanity, toxic speech, up to threats. This kind of harmful communication can negatively impact players, upset them, and lead to negative feelings or even real-life violence.

Objectionable Content (OC) is a term introduced in the USA in 1996 by Communications Decency Act [1]. It denotes “sexual, homicide, and violent text, pornography content, drugs, weapons, gambling, violence, hatred, and bullying and hate speech” [2, 3]. Facebook [4] also uses the term to designate different aggressor-victim relationships that appear via SM and social networks (SN). Most SN, like Facebook, Twitter, Instagram, and YouTube, have a strictly defined code of prevention and mechanisms for removing all kinds of OC. However, those phenomena are linguistically diverse and geographically widespread. One kind of OC is Hate Speech (HS). According to [5], HS is considered “a broad umbrella term for numerous kinds of insulting user-created content, as the most frequently used expression for this phenomenon, and is even a legal term in several countries.”

Therefore, building, using, and continually improving methods for automatically monitoring the content on SN, detecting, predicting, and mitigating OC effects can intensify the community’s fight against them. It is essential to address this problem for languages in which no one has thoroughly dealt with it so far, such as Serbian.

The paper is organized as follows. After the introduction, the problem definition is outlined in Sect. "Problem definition". The related work is considered in Sect. "Related work". Section "Data preparation" describes the datasets used in the study, including the sources of the data and the steps taken to preprocess the data—how to gather HS examples from SM and how to make features for a method of automatic recognition of HS in the sports domain. In Sect. "Experimental setup—automatic recognition of hate speech in the sports domain", our method for the automatic recognition of HS in the sports domain is presented. Section "Results and discussion" presents the experimental results and discusses the model’s performance on different datasets. Section "Conclusion and future work" gives conclusions, summarizes the main contributions to the field of study, and highlights directions for future research.

Problem definition

Fortuna [6] highlighted definitions of HS adopted by major international regulatory bodies, institutions, and significant SM and descriptions adopted by the scientific community. According to the European Union (EU) Code of conduct [7], HS is “All conduct publicly inciting to violence or hatred directed against a group of persons or a member of such a group defined by reference to race, colour, religion, descent or national or ethnic.”

There are more definitions of HS in the field of study which deals with automatic detection methods. According to [2], “The automatic identification of hate speech has been mostly formulated as a natural language processing problem.” So far, the scientific community has been using automatic detection methods to identify HS on online social platforms such as Facebook and MySpace [8,9,10,11], Twitter, Tumblr [8, 12,13,14,15], YouTube, Instagram, Whisper [16,17,18,19,20,21,22,23,24,25,26,27], Reddit, Slashdot [11, 28,29,30,31], or Pinterest [16]. This paper focuses specifically on HS expressed through text on SM platforms. In computational linguistics, it is known as online hate [32], cyber hate [33], or HS [34].

The primary objective of this study is to examine a particular form of HS that pertains to the sports domain and is expressed in the Serbian language. Namely, sports players and their fans are connected in many ways. According to Wasserman [35], “fans become participants, seeking to help their teams win through their cheering rituals and songs and cheers.” But Wasserman also asks questions, “how far does the right to engage in this expression go?”, “will fans be able to cheer and jeer using profanity?”, “Can cheering rely on sexual innuendo?”. US law grants freedom of speech and allows HS. At the same time, legislation in Europe tends to protect decency in communication and suppress violence, hatred, and aggression toward persons or groups determined by race, religion, ethnicity, nationality, sexual orientation, intelligence, disability, and other types of differences among people. The conflict between the protection of freedom of speech and the safety of a person from abuse, harassment, or threat makes detecting these phenomena challenging.

As SM platforms expand and the phenomenon becomes more varied, detecting and addressing this issue has become more complex. Therefore, building a generalized valuable method in different domains and SM is the aim of the field of study.

The issue of expressing hatred related to sports has been studied for a long time [36,37,38], but there are few studies dealing with the recognition of HS in SM that is directed against athletes [39,40,41,42,43] and an insignificant number that deals with automatic recognition of HS in that domain [44,45,46,47,48]. They are primarily in English; therefore, insults, hatred, and even threats to athletes written in different languages cannot be straightforwardly recognized and removed or recognized from SM. We believe it is vital for sports science to be linked to other sciences (like linguistics and computer science), which can help to successfully detect insults, hatred, and threats against all people in sports, regardless of language and cultural belonging.

There are many techniques and methods to automatically detect different kinds of HS in other languages [2, 32, 49]. However, they are all directed to vulnerable groups (race-related, ethnic, gender-related, refugees, groups of people with disabilities, super-sized persons, and other vulnerable groups) [6, 50,51,52,53,54,55]. Different studies [36, 38, 56,57,58] conclude that athletes are the vulnerable group, too. However, as far as we know, these groups still require comprehensive automated HS detection like previously mentioned ones. The aim is to explore whether broadly used, generalized HS recognition methods can be adjusted for particular (sports) cases like those initially studied, for example, by Toraman et al. [48]. Transferring a domain is successfully explored in the HS detection and recognition field [48, 59, 60].

This paper is among the first studies to explore the automatic recognition of HS in the sports domain at SM in languages other than English. To the author’s knowledge, this may be the first study for Serbian.

Related work

Many studies have been conducted on HS in different languages, with a particular emphasis on English. To our best knowledge, a few studies have collected datasets from SM related to the sports domain to deal with automated HS detection problem.

Pavlopoulos et al. [61] created a dataset from 1.6 million user comments from the Greek sports site Gazzetta. The dataset is publicly available, and the authors used it in Deep Learning (DL) methods to classify sports comments into accepted (not hate speech) or rejected (hate speech). The best result, measured by AUC (Area Under the ROC curve), has been achieved by RNN (Recurrent Neural Network), and it raised AUC = 0.80 and AUC = 0.84 for two different datasets produced from the original one.

De Pelle and Moreira [62] collected a dataset with 1,250 randomly selected comments from the Globo news site on politics and sports news in Portuguese. Three annotators reviewed and marked each comment for the presence of categories such as ‘racism,’ ‘sexism,’ ‘homophobia,’ ‘xenophobia,’ ‘religious intolerance,’ and ‘cursing.’ A binary classifier into offensive or not offensive comments achieved the best F1 = 0.80.

Toraman et al. [48] retrieved more than 200 thousand top-level English and Turkish tweets published in 2020 and 2021 from five hate domains—religion, gender, racism, politics, and sports, where each tweet can belong to a single domain. Twenty thousand tweets in English and Turkish were related to the sports domain.

Kapil and Ekbal [63] also considered this problem in English. They discussed how the internet and SM platforms had created numerous opportunities for people to voice their opinions and how these platforms have facilitated the dissemination of hate speech. They proposed a model trained on a large dataset collected from diverse sources, including online forums, blogs, and SM platforms. It achieved high accuracy on all tasks. The authors also comprehensively analyse the model’s performance and show that it outperforms several baseline models regarding macro-F1 and weighted-F1. Their findings suggest that distinct datasets classified into multiple subclasses help one another in the classification process. However, rather than generating a new dataset and labelling it with additional classes (which may overlap with pre-existing ones), authors recommend focusing on data classified into two primary classes—Offensive and Non-Offensive. Furthermore, Non-Offensive posts should be considered as non-hate speech, while Offensive posts can be further studied and classified into additional subclasses according to their sentiment. We also employed this approach in our research.

As can be seen from the related work presented above, the HS detection problem related to the sports domain is still an active area of research that has not been fully explored or given the attention it deserves, especially in cases of languages other than English. In this paper, we focused on Serbian. We explored whether a DL method learned on the dataset created by gathering text from different domains can be successfully applied to detect HS in the sports domain. That is, whether a generalized model can be applied to a specific case.

We achieved the following contributions, as presented in Fig. 1:

  • We constructed a digital lexicon of HS terms and phrases because there was no publicly available resource of this type for Serbian.

  • We crawled, refined, and formatted five datasets containing 180,785 comments. Three of them are manually annotated by 33 students, and the annotations are evaluated. The comments have been published over two years as reactions to the news and sports news on web pages on portals and YouTube channels.

  • Two datasets are labelled automatically using a HS lexicon and a keyword-based approach. The datasets are used to learn domain-agnostic and domain-specific word embeddings. Word embeddings are used as features for generating DL models. We explore if models trained based on domain-agnostic features can be used for HS classification in the specific (sports) domain.

Fig. 1
figure 1

An overview diagram illustrating the approach adopted in this study

Data preparation

Hate speech lexicon

According to Mladenović et al. [64], “to generate valuable features for automatic Cyber aggression classifiers, it is necessary to include HS lexicons, psycho-linguistic resources, semantic networks, sentiment analysis lexicons, and tools.” Therefore, one of the first steps in creating an application for automatic HS recognition is to make a HS lexicon of terms and phrases commonly used in a natural language which is a subject of the study. HS lexicons are important resources in automatic HS detection tasks. According to [65], “a lexicon-based approach is effective in cross-domain classification.”

To induce a contemporary HS lexicon in Serbian, we retrieved scientific papers in linguistics [66,61,68], scientific conference proceedings [69, 70], conference papers [71,66,73], and lexicons published in books [74]. In the proceeding edited by Marković [69]—vulgarisms in the discourse of telephone conversations were analysed in [72], obtaining obscene words as the products of suffixation was broadly explored in [73], and generating derivatives from obscene words was presented by Bogdanović [71]. Aleksić [66] explored obscene words in a novel written by one of the contemporary writers for youth in Serbian. The author extracted vulgar and slang speech terms and collocations related to obscene meaning, swearing, and cursing. Particularly significant research [67] was conducted on the collection containing 2,130 nouns regarding pejorative, contemptuous, mocking, or ironic contexts. The collection was created from five dictionaries (The Dictionary of Serbo-Croatian Literary and Vernacular Language of the Serbian Academy of Sciences and Arts, The Matica srpska six-volume Dictionary, The Matica srpska one-volume Dictionary, Two-Way Dictionary of Serbian Slang by Dragoslav Andrić, Contemporary Belgrade Slang Dictionary by Borivoje and Nataša Gerzić). Another source of our HS lexicon is Rečnik opscenih reči—The dictionary of obscene words [74]. It is a comprehensive dictionary in the field of study in Serbian. We manually selected 1,209 items from this dictionary. In the proceeding edited by Marković [69]—vulgarisms in the discourse of telephone conversations were analysed in [72], obtaining obscene words as the products of suffixation was broadly explored in [73], and generating derivatives from obscene words was presented by Bogdanović [71].

Recent research in [70] has shown that a dialect’s specificity must be taken into account for a better understanding of HS and to get a more generalized lexicon of obscene, vulgar, and hate words and phrases. At this stage of our research, we do not include language dialects. It is the lack of research, but this is the first version of our HS lexicon. Finally, our HS lexicon has 4,705 entries representing lexemes, collocations, MWEs, and sentencesFootnote 1.

We used the HS lexicon and a keyword-based approach for automatic labelling training datasets. A dataset entry is automatically labelled as a hater if a HS lexicon entry is found in the dataset entry.

Datasets

Nowadays, SM are making great efforts to suppress hate speech. Still, there are YouTube channels in Serbian where one can get hateful comments. Furthermore, such comments persist even on the two most prominent Serbian news portals, namely blic.rs, and b92.net. Therefore, we decided to use them to prepare five datasets for modelling a binary HS classifier and exploring the efficiency of transferring a model from the general domain to the source-specific domain (sports domain). Datasets are composed of comments published over a two-year period, encompassing two main sources: (1) comments from popular entertainment and sports channels on YouTube and (2) comments related to news and sports news articles on the portals blic.rs and b92.net [75].

Two datasets (one from YouTube and another from blic.rs and b92.net) are prepared to be used as training sets. The first is created of comments not specific to any particular subject or domain. These comments are considered domain-agnostic, meaning they cover a wide range of topics and are not limited to a specific subject area. The other is created of comments regarding sports (domain-specific). Three additional datasets are constructed in a similar manner, consisting of comments published as reactions to news articles and sports news on the portals blic.rs and b92.net. These datasets capture the comments specifically related to news and sports topics on these platforms. Datasets statistics are shown in detail in Table 1.

Table 1 Datasets statistics

We used STL4NLPFootnote 2 [76], the web application for manual semantic annotation of a corpus in Serbian, to manually annotate test datasets. They were divided into 29 parts containing approximately the same number of comments and automatically imported into STL4NLP. In that way, 29 semantic annotation tasks were created and annotated over one month. The semantic annotation task was assigned to 33 students, and each of them annotated from three to seven parts. They used three tags {‘yes’, ‘no’, ‘neu’} (Fig. 2).

After annotation, we estimated the Inter-Annotation Agreement (IAA) to evaluate the quality of students’ annotations. For that purpose, we used Krippendorff’s α (Kalpha) [77] statistical measure because there were more than two annotators on each task, and some students missed annotating some comments

Fig. 2
figure 2

Annotator Danica labelled the task named Govor mržnje Sport YT (Hate speech Sport YT)

The value of the Kalpha statistical measure can be in the interval [0, 1] where Kalpha = 1 represents the degree of complete agreement, and Kalpha = 0 the degree of complete disagreement. The average IAA Kalpha for all 29 annotating tasks is Kalpha = 0.58. This value is under acceptable value (α < 0.67), but Kalpha is more rigid than other statistical measures. Therefore, we have adopted all three datasets.

Datasets cleaning

The initial stage of data cleaning and preprocessing involved eliminating irrelevant characters, such as special characters, symbols, and emoticons, which were removed from all comments. We utilized the Natural Language Toolkit (NLTK) [78] for that task. Then we utilized our srNLP Python library, developed for Serbian, to split texts into sentences, tokenization, stop word removal, and transliteration from Cyrillic to Latin. Namely, the Serbian language has two official scripts, Cyrillic and Latin. Therefore, one of the vital preprocessing steps is the transliteration in one of these scripts – in our work, the transliteration of texts written in Cyrillic to Latin script.

Given the particularities of the Serbian language, we also encountered challenges concerning the use of diacritics in written texts. Several letters in Serbian Latin script include letters with diacritics (letters ć, č, đ, š, and ž). Notwithstanding, one of the problems presented in contemporary written Serbian on SM is the conspicuous omission of diacritics. Due to the lack of evaluated code available for diacritics restoration, we removed all diacritics during preprocessing.

For the further process, Serbian stop words list was prepared. It originated from [79] and contained 1,267 words.

After performing a thorough preprocessing step, we generated two vocabularies. The vocabulary for the “News portals blic.rs and b92.net” training dataset initially consisted of 167,114 tokens; however, we retained only the tokens with a minimum occurrence of 2, resulting in a vocabulary of 57,535 tokens. Similarly, we generated the vocabulary for the “YouTube” training dataset. Initially, it contained 104,221 tokens, subsequently reduced to 36,319 tokens upon the same removal rule. The preparation process of the datasets is depicted in Fig. 3.

Fig. 3
figure 3

Dataset preparation process

Word embeddings

Building a HS classifier with big data techniques is more manageable in English than other languages. There are powerful resources—embeddings (powered by Word2Vec [80], GloVe [81], and fastText [82]), datasets (for example, ClueWeb09 [83] and ClueWeb12 [84] corpora), and tools (NLTK, LIWC [85]), that help fast and efficient development in this field in English. Therefore, every new resource, tool, or dataset created in some other languages can be valuable for further research in the field.

Recent research on HS [48, 59, 60] suggests that big data techniques can be effectively applied using word embeddings, which involve learning a representation of words in a corpus such that semantically similar words have similar representations. Last few years, embeddings are pushing the boundaries of text classifiers. In DL techniques, they are successfully used as text-derived features.

For English and a few other languages, there are pre-trained embeddings. The good thing is that they shorten development time. However, according to Pamungkas and Pati [86], who experimented with pre-trained models (GloVe, Word2Vec, and FastText), “the result is lower compared to a self-trained model based on the training set.” Also, Saleh et al. [87] found that “domain-specific word embeddings outperform domain-agnostic word embedding models because it is more knowledgeable about the hate domain, while domain-agnostic are trained on books and Wikipedia, which rarely have hate community context.”

For these reasons, we decided to create a word embedding representation using by the domain-agnostic dataset, i.e., the word embedding derived from dataset 1 (Table 1).

We used the Continuous Bag-of-Words (CBOW) model in Gensim [88]. The corpus contained over a million tokens without stop words. Embedding parameters are shown in Table 2.

Table 2 Word embedding statistics

Experimental setup—automatic recognition of hate speech in the sports domain

This section describes the experimental setup for evaluating the performance of the HS detection models, including the evaluation metrics and the training/testing procedures.

However, there are different approaches to cross-domain HS classification. Using a HS detection model trained by a specific dataset on another dataset (domain) with the same class labels is called Transfer Learning in HS detection. However, cross-domain classification is not used in the sports domain, although athletes are threatened by HS on SN. Recently, Toraman et al. [48] studied different domains cross-domain classification. They explored seven transformer-based (BERT-based) language models and two neural (CNN and LSTM) models in Turkish. They found that transformer-based language models outperform conventional ones in large-scale HS detection. However, their results have shown that “while sports can be recovered by other domains,” it “cannot generalise to other domains.”

The basic idea of this study is to explore if a model trained on the dataset from one SM and not related to any specific domain can be efficient for the binary classification on HS and non-HS of test sets regarding the sports domain. Therefore, we compared the results of two models trained on domain-agnostic and domain-specific datasets. The other study fact is that HS datasets usually have a high or medium level of imbalance because HS is not so frequently occurring on most SM in actual situations. For example, in the research of Davidson et al. [89], a dataset comprehending 25,000 tweets was manually annotated by the crowdsourcing technique. The annotation showed, with a very high IAA, that only 5% of tweets contained HS. Other studies showed similar HS distributions [90, 91]. Zhang et al. [92] created a 300,000 tweets training dataset and found HS is under 1% (“extremely rare”). However, the effort to find such rare data is reasonable if we remember how significant the negative influence on targeted people/groups in the real world they have. For training/testing our networks, we used the Google Colab platform [93] with TensorFlow [94] and Keras [95] library.Footnote 3 Models are trained by Bi-LSTM.

Long Short-Term Memory Network (LSTM) is a RNN that has a repeating module. The special type of LSTM is Bidirectional Long-Short Term Memory (Bi-LSTM). These networks are used in NLP tasks like language translation, text classification, and speech recognition. RNN learns sequence patterns and uses them to make predictions of sequential data. LSTM learns order dependence and uses it to predict also sequential data. It includes a repeating module with a more complex structure than RNN repeating module. Bi-LSTM predicts a sequence by learning sequence information in both directions from future (forward) and past (backward).

We trained two models with the same DL architecture (Fig. 4) and parameters. The training set “YouTube entertainment channels” (dataset 1 from Table 1) is used to get the first model. The training parameters are as follows: trainable parameters 12,898,945, vocabulary 57,531 tokens, epochs 5, dimension 200. LSTM output size is set to 64. The dropout rate is 50%. The model gained a training accuracy of 91%. The “News portals training set” (dataset 4 from Table 1) was trained with the same parameters and achieved an accuracy of 93%. The first training set was also used for training with 20 epochs and the same rest parameters. It reached a training accuracy of 97%. For embedding we used two types of representations: BoW model with count values vectors and one-hot encoded vector for each word.

The model is compiled with the Adam optimizer, and the loss parameter is set to binary-crossentropy value which is the recommendation for binary classification models. The output layer takes one unit with Sigmoid activation function.

Fig. 4
figure 4

Bi-LSTM learning architecture

After that, both datasets were used to train models with one-hot embedding. Because of the platform limits, we changed training parameters, so for this case vocabulary is 5000 tokens, and dimension is 50. The rest od parameters were the same as for the BoW. Training accuracy reached 98.93%.

The performance evaluation measures are Accuracy, Precision, Recall, and F1. This study evaluates specific measures for each class (HS and non-HS) because a high accuracy value does not necessarily indicate good performance on other evaluation measures in highly imbalanced datasets. In that case, a more reliable measure is F1, and Precision and Recall can also provide valuable conclusions.

Results and discussion

The test results are presented in Table 3. We should take into account two facts. Training datasets are highly unbalanced toward non-HS. Both training datasets are automatically labelled using by HS lexicon and a technique that detects HS lexicon entry in a training set’s entry. Both models, trained on the “YouTube entertainment channels” and “News portals sports comments” datasets, achieved high Precision in HS classification on the test “YouTube sports channels” dataset (emphasized values in Tables 3 and 4). Unlike non-HS class, which achieved high Recall values on all test datasets, the HS class has low Recall values. Nevertheless, the promising results stem from the notably high Precision values of the HS class in the sports domain, considering the highly unbalanced nature of the training datasets, their automatic annotation, and the fact that one of the trained models was not in the domain of the test dataset.

Overall results are pretty weak, but we did not expect better ones considering the mentioned facts, the small number of epochs, and not very deep network. This study investigates, inter alia, whether to continue the research of domain transferring under the given conditions of having unbalanced datasets that are automatically annotated. The results show that the Precision of predicting HS is better when YouTube is used as a source for the training data. The results indicate what needs to be improved. The Recall has to be notably higher for both sports domain test datasets. A large number of HS comments remain unfound. We can conclude that embedding has to be changed, and the network architecture and the automatic annotation has to be improved.

Table 3 Testing results based on the accuracy, precision, recall, and F1 on non-HS and HS classes

Table 4 shows whether training with more epochs can improve overall BoW results. Although the accuracy was slightly enhanced, HS detection was not improved. However, the embedding change improved all evaluation measures regarding the HS class.

Table 4 Testing Results on non-HS and HS classes using the YouTube training model (dataset 1) trained with different epochs and tested on the YouTube sports comments test dataset (dataset 3)

Conclusion and future work

Considering the popularity of SM and the accompanying opportunity to express an opinion on any subject freely, HS emerges consequently in different domains. As this topic has been examined thoroughly from many points of view in general, in this paper, we have discussed the importance of the development of datasets, HS lexicon, and appropriate machine learning models, to effectively apply automatic HS recognition methods from content published in SM related to one specific domain, the sports domain. Since most research deals with English, we focused on developing resources for Serbian. We constructed a digital lexicon of HS terms and phrases. We designed a dataset composed of comments to the sports news on portals and YouTube sports channels and manually annotated for training and test purposes in our DL model. Then we trained two-word embeddings, domain-agnostic and domain-specific, regarding sports. Word embeddings are known as valuable features for generating DL models. This paper explores if models trained based on domain-agnostic features can be used for HS classification in the specific domain. We pointed out that players were not seen as a vulnerable group regarding hate speech. However, the fact is that HS on SM can have a significant impact on players and their lives. Therefore, they must also be treated as a hate speech-targeted group. In future work, we will work on the refinement of the classifier results, extending of presented datasets and resources, as well as its usage through other models.