SOLD: Sinhala Offensive Language Dataset

The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.


Introduction
Offensive posts on social media platforms result in a number of undesired consequences to users.They have been investigated as triggers of suicide attempts, and ideation [1,2], and mental health conditions such as depression [3,4].Content moderation in online platforms is often applied to mitigate these serious repercussions.As human moderators cannot cope with the volume of posts online, there is a need for automatic systems that can assist them [5].Social media platforms have been investing heavily in developing these systems and several studies in NLP have been conducted to tackle this problem [6].Most studies proposed a supervised approach to detect offensive content automatically using various models ranging from traditional ML approaches to more recent neural-based methods trained on language-specific annotated data [7].
Considering the importance of annotated data, there is a growing interest in the NLP community to develop datasets that are capable of training ML models to detect offensive language.These datasets focus on various kinds of offensive content such as abuse [5,8,9], aggression [10,11], cyber-bullying [12,13], toxicity [14,15], and hate speech [16,17].Furthermore, competitive shared tasks such as OffensEval [18,19], TRAC [20,21], HASOC [22,23], HatEval [24] and AbuseEval [25] have created various benchmark datasets on the topic.Apart from a few notable exceptions, the majority of these datasets are built on English [26] and other high-resource languages such as Arabic [27,28], Danish [29], Dutch [30] and French [31].However, offensive language in social media is not limited to specific languages.Most popular social media platforms, such as Twitter and Facebook, are highly multilingual, as users express themselves in their mother tongue [32].There is a considerable urgency to address offensive speech in different languages, but the lack of annotated datasets limits offensive language identification in low-resource languages.
In this paper, we revisit the task of offensive language identification for low-resource languages.Our work focuses on Sinhala, an Indo-Aryan language spoken by over 17 million people in Sri Lanka.Sinhala is one of the two official languages in Sri Lanka.Most of the people who speak Sinhala are the Sinhalese people of Sri Lanka, who make up the largest ethnic group on the island.Even though Sinhala is spoken by a large population, it is relatively low-resourced compared to other languages spoken in the region.According to Kepios analysis 1 the number of social media users in Sri Lanka at the start of 2022 was equivalent to 38.1% of the total population with users increasing by 300,000 between 2021 and 2022.Despite this growth, the spread of offensive posts on social media platforms is still a huge and largely unaddressed concern in Sri Lanka.In 2019, after the Easter bombings that targeted Christian churches 2 , the government had to temporarily block all the social media on the island to curtail the spread of hate speech against Muslims.Similarly, both in 2019 and 2022, the government again blocked all social media platforms on the island to control offensive speech against the government 3 .These widespread social media bans not only violate rights to free speech but also limit the general public's accessibility to authorities and health services in dire situations.Therefore, a system that can detect offensive posts and help content moderators to remove them is paramount in Sri Lanka, and we believe the datasets and research presented in this paper will be the first step toward this goal.
We collect and annotate data from Twitter to create the largest Sinhala offensive language identification dataset to date; SOLD.We first annotate the tweets at the sentence level for offensive/ not-offensive labels.Most offensive language identification datasets follow a similar approach and classify whole posts.However, identifying the specific tokens that make a text offensive can assist human moderators and contribute to building more explainable models for offensive language identification.Explainable ML is a widely discussed topic in the NLP community [33].Several English offensive language datasets have been annotated at token-level [34,35] to support explainability.Following this, we annotate SOLD both at the post and at the token-level.If a text is offensive, we label each token based on its contribution to the overall offensiveness at the sentence level.If the token adds to the offensiveness, it is annotated as offensive; otherwise, it is marked as not offensive.As far as we know, SOLD is the first non-English offensive language detection dataset with explainable tokens.
Data scarcity is a major challenge in building ML models for low-resource languages like Sinhala [36].In this paper, we explore two approaches to overcome data scarcity.1.We perform transfer learning.We draw inspiration from recent work that applied cross-lingual models for low-resource offensive language identification [7] and adapted it to Sinhala. 2. We perform semi-supervised data augmentation.Motivated by SOLID [37], the largest offensive language dataset available for English, we propose a similar semi-supervised data augmentation approach for Sinhala.We collect more than 145,000 Sinhala tweets and annotate them using a semi-supervised approach.We release the resource as SemiSOLD and use it to improve Sinhala offensive language detection results.As far as we know, SemiSOLD is the largest non-English offensive language online dataset annotated in a semi-supervised manner.We believe that the findings of these two approaches will benefit many low-resource languages.We summarise our contributions in this paper as follows, 1.We release SOLD 4 , the largest annotated Sinhala Offensive Language Dataset to date.SOLD contains 10,000 annotated tweets for offensive language identification at sentence-level and token-level.2. We experiment with several machine learning models, including state-ofthe-art transformer models, to identify the offensive language at sentencelevel and token-level.To the best of our knowledge, the identification of offensive language at both sentence-level and token-level has not been attempted on Sinhala.3. We explore offensive language identification with cross-lingual embeddings and transfer learning.We take advantage of existing data in high-resource languages such as English to project predictions to Sinhala.We show that transfer learning can improve the results in Sinhala, which could benefit many low-resource languages.4. We investigate semi-supervised data augmentation.
We create SemiSOLD; a larger semi-supervised dataset with more than 145,000 instances for Sinhala.We use multiple machine learning models trained on the annotated training set and combine the scores following a similar methodology described in [37].We show that this semi-supervised dataset can be used to augment the training set, which improves the results of machine learning models.5. Finally, we demonstrate the explainability of the sentence-level offensive language identification models in Sinhala using token-level annotations in SOLD.We experiment with how transfer learning and semi-supervised data augmentation affect the explainability of the models.To the best of our knowledge, the explainability of the offensive language models has not been explored in low-resource languages.With these resources released in this paper, we aim to answer the following research questions: • RQ1-Performance: How do the state-of-the-art machine learning models perform in Sinhala offensive language identification at sentencelevel and token-level?• RQ2-Data scarcity: Our second research question addresses data scarcity, a known challenge for low-resource NLP.We divide it into two parts as follows: -RQ2.1: Do available resources from resource-rich languages combine with transfer-learning techniques aid the detection of offensive language in Sinhala at sentence-level and token-level?-RQ2.2:Can semi-supervised data augmentation improve the results for Sinhala offensive language identification at sentence-level?• RQ3-Explainability: Our third research question addresses the explainability of the machine learning models, a topic of interest for the NLP community, yet not explored in low-resource languages.We divide it into three parts as follows: -RQ3.1:How to demonstrate explainability of the sentence-level offensive language identification using token-level annotations in Sinhala?-RQ3.2:Does transfer-learning from resource-rich languages affect the explainability of the offensive language identification models?-RQ3.3:Can semi-supervised data augmentation improve the explainability?
Finally, the development of SOLD and SemiSOLD open exciting new avenues for research in Sinhala offensive language identification.We train a number of state-of-the-art computational models on this dataset and evaluate the results in detail, making this paper the first comprehensive evaluation of Sinhala offensive language online.The rest of the paper is organised as follows.Section 2 highlights the recent research in offensive language identification.Section 3 describes the data collection, annotation process and statistical analysis of the dataset.Section 5 presents the experiments at both sentence-level and token-level.Sections 6 and 7 show the various transfer learning and semisupervised data augmentation techniques we employed, respectively.Section 8 summarises the conclusions of this study revisiting the above RQs.

Related Work
The problem of offensive content online continues to attract attention within the AI and NLP communities.In recent studies, researchers have created datasets and trained various systems to identify offensive content in social media.Popular international competitions on the topic have been organised at conferences such as OffensEval [18,19], TRAC [20,21], HASOC [22,23], HatEval [24], and AbuseEval [25].These competitions attracted many participants, and they provided participants with various important benchmark datasets, allowing them to train competitive systems on them [38,39].
In terms of languages, due to the availability of annotated datasets, the vast majority of studies in offensive language identification use English [40,41] and other high-resource languages such as Arabic [27,28], Dutch [30], French [31], German [42], Greek [43], Italian [44], Portuguese [45], Korean [46], Slovene [47], Spanish [48] and Turkish [49].More recently, several offensive language online datasets have been annotated on low-resource languages such as Bengali [50], Marathi [36,51] Nepali [52], Tamazight [53], and Urdu [54].These datasets have been annotated on coarse-grained labels such as offensive/ not offensive and hate speech/ non-hate speech.Some of these datasets have even been annotated on fine-grained labels.For example, offensive tweets in Urdu [54] have been further annotated as abusive, sexist, religious hate and profane, while the offensive tweets in Marathi [36] have been further annotated into targeted and untargeted offence.For Sinhala, too, there is a hate speech detection dataset [55] where Facebook posts have been annotated for three labels; hate, offensive and neutral speech detection.However, the dataset is limited in size as it contains only 3,000 posts, and the dataset is not publicly available.Another related Sinhala dataset for offensive language identification is Sinhala-CMCS [56], where 10,000 social media comments have been annotated for five classification tasks; sentiment analysis, humour detection, hate speech detection, language identification, and aspect identification.However, the dataset is based on Sinhala-English code-mixed texts.With the development of keyboards that support Sinhala script, such as Helakuru 5 , there is an increasing number of social media users who use Sinhala script in their conversations.Therefore, a Sinhala offensive language identification dataset with Sinhala script is a research gap we address in this paper.
All the datasets we mentioned before are sentence-level offensive language identification datasets where the whole sentence is given a single label.While sentence-level offensive language datasets have been popular in the community, identifying the specific tokens that make a text offensive can be useful in many applications [57].Furthermore, token-level annotations can be used to improve the explainability of the sentence-level models [58,59].As a result, detecting tokens instead of entire posts has been studied in many domains, including propaganda detection [60] and translation error detection [61].In the offensive language domain too, two datasets have been released with explainable token-level labels; HateXplain [34], and TSD [35].Both of these datasets have sentence-level labels together with token-level labels.TSD dataset was released for the SemEval 2021 -Task 5 [62] 6 .While token-level offensive language identification is an important research area, as far as we know, no non-English datasets have been annotated at the token-level.With SOLD, we hope to address this gap with token-level annotations, contributing to the first explainable non-English offensive language identification dataset.
In machine learning approaches, sentence-level offensive language identification has often been considered a text classification task [63,64].Early approaches utilised classical machine learning classifiers such as SVMs with feature engineering [65] to perform sentence-level offensive language identification.With the introduction of word embeddings, [66,67], different neural network architectures were used to perform offensive language identification [68].These architectures contain different techniques such as long short-term memory networks [69,70], convolutional neural networks [71,72], capsule networks [73,74] and graph convolutional networks [75].With the recent development of large pre-trained transformer models such as BERT [76], and XLNET [77], several studies have explored the use of general pre-trained transformers by fine-tuning them in sentence-level offensive language tasks [39,78].These approaches have provided excellent results and outperformed previous architectures in many datasets [38,79,80].Going beyond finetuning, recent approaches such as fBERT [81] and HateBERT [82] have trained domain-specific transformer models on offensive language corpora which have provided state-of-the-art results in many benchmarks.Finally, multilingual pre-trained transformer models such as mBERT [76] and XLM-RoBERTa [83] have enabled cross-lingual transfer learning, which makes it possible to leverage available English resources to make predictions in languages with fewer resources helping to cope with data scarcity in low-resource languages [7,32,84].
Token-level offensive language identification has been commonly addressed as a token classification task where a machine learning model will predict whether each token is offensive or not [85].Besides machine learning models, researchers have explored lexicon-based approaches too [86,87].Three kinds of lexicon-based methods have been used in the past; 1. Lexicon was handcrafted by domain experts and was simply employed as a list of toxic words for lookup operations [87].2. Lexicon was compiled using the set of tokens labelled as positive (offensive, toxic etc.) in the training set, and it was used as a lookup table [88].3. Supervised lexicons were built with statistical analysis on the occurrences of tokens in a training set solely annotated at the sentence-level [89].While lexicon-based approaches provide simple solutions, they are usually outperformed by machine learning approaches [86].Therefore, they have been merely used as baselines.Many deep learning architectures have been explored at the token-level too.Long short-term memory networks [57,90] and convolutional neural networks [91,92] have been popular among them.Similar to sentence-level offensive language detection, pre-trained transformer models such as BERT [76] and XLNET [77] have provided state-of-the-art results in token-level.These approaches either use the default token classification architecture in transformers [93,94] or use a conditional random field layer [95] on top of the transformer model [96,97].Based on this supervised learning paradigm, several open-source frameworks such as MUDES [93] have been released to perform token-level offensive language identification.
In addition to supervised approaches, researchers have explored weakly supervised approaches in token-level offensive language identification [34,35] as it can be seen as a case of rationale extraction [98,99].These approaches use an attentional binary classifier to predict the sentence label and then invoke its attention at inference time to obtain offensive tokens as in rationale extraction.This allows leveraging existing training datasets that provide gold labels indicating sentence-level without providing gold labels at token-level.In recent years, researchers have explored various techniques such as attention scores of a long short-term memory classifier [90], long short-term memory classifier with a token-masking approach [89], SHAP [100] with a sentence-level finetuned transformer model [90] and combine LIME [58] with a sentence-level classifier [101].All the approaches mentioned above used a threshold to turn the tokens' explanation scores (e.g., attention or LIME scores) into binary decisions (offensive/not-offensive tokens) [34,102].Although token classification approaches performed overall better, these approaches have performed surprisingly well, too, despite having been trained on data without token-level annotations [34,35].They have further contributed to explainable machine learning in offensive language identification.
All the token-level methods mentioned above have been experimented only with English data.With SOLD, we fill this gap by evaluating how these token-level offensive language detection methods perform in a low-resource language setting.Furthermore, due to the lack of suitable datasets, techniques we observed at the sentence-level, such as cross-lingual transfer learning and data augmentation, have not been explored widely at the token-level.In this paper, we will analyse the effect of transfer learning and data augmentation at the token-level for the first time.

Data Collection and Annotation
In the following subsections, we describe the data collection and annotation process of SOLD.

Data Collection
We retrieved the instances in SOLD from Twitter using its API 7 and Tweepy Python library 8 .We collect data by using predefined keywords, which is a common method in offensive language detection dataset construction [103,104].As keywords, we use words that are often included in offensive tweets such as "you" (ෙතා් , උඹ) and "go" (පලයන්, පල).We also include anti-government (@NewsfirstSL) and pro-government (@adaderanasin) news accounts.The complete list of keywords that were used to collect SOLD is shown in Table 2.However, Sinhala is written in three ways in social media.(a) Sinhala written in Sinhala script (b) Sinhala written in Roman script, pronunciation-based and (c) Mixed script text that contains both Sinhala and Roman scripts.Since our goal is to construct a Sinhala offensive language identification dataset in Sinhala script, we use TwitterAPI's language filter to have the tweets only written with Sinhala script.Using these keywords and the filtering strategy, we collected 10, 500 tweets.
We do not collect Twitter user IDs to remove the users' personally identifiable information.We replace mentions of the usernames in the tweet with @USER tokens and URLs with <URL> tokens to conceal private information using regular expressions.

Annotation Task Design
We use an annotation scheme split into two levels deciding (a) Offensiveness of a tweet (sentence-level) and (b) Tokens that contribute to the offence at sentence-level (token-level).as shown in Figure 1.In the following section, we provide the definitions of sentence-level and token-level offensive language identification and the guidelines for each annotation task.

Sentence-level Offensive Language
Our sentence-level offensive language detection follows level A in OLID [104].We asked annotators to discriminate between the following types of tweets: • Offensive (OFF): Posts containing any form of non-acceptable language (profanity) or a targeted offence, which can be veiled or direct.This includes insults, threats, and posts containing profane language or swear words.• Not Offensive (NOT): Posts that do not contain offense or profanity.
Each tweet was annotated with one of the above labels, which we used as the labels in sentence-level offensive language identification.Having broad offensive and not-offensive labels provides us with the opportunity to perform transfer learning as the majority of the offensive language datasets such as OLID [104] (English), OGDT (Greek) [43] and MOLD (Marathi) [51].

Token-level Offensive Language
To provide a human explanation of labelling, we collect rationales for the offensive language.Following HateXplain [34], we define a rationale as a specific text segment that justifies the human annotator's decision of the sentence-level labels.Therefore, We ask the annotators to highlight particular tokens in a tweet that supports their judgement about the sentence-level label (offensive, not offensive).Specifically, if a tweet is offensive, we guide the annotators to highlight tokens from the text that supports the judgement while including non-verbal expressions such as emojis and morphemes that are used to convey the intention as well.These tokens can be used to train explainable models, as is shown in recent works [34,35,59].

Data Annotation
We follow prior work in the offensive language domain [26,104,105], and we annotate our data using crowd-sourcing.We used LightTag [106]9 , a text annotation platform, to annotate the tweets.As hate speech annotation can be influenced by the bias of the annotators [26], we collected judgement from diverse annotators as possible.For the annotation task, we recruited a team of ten annotators.All of them are native Sinhala speakers, aged between 25-40, and everyone had at least a bachelor's degree qualification.
First, we provided the annotators with several in-person and virtual training sessions on LightTag.Once they completed them successfully, we first conducted a pilot annotation study followed by the main annotation task.In the pilot task, each annotator was provided with randomly selected 500 tweets from the collected dataset which had a similar keyword distribution.The annotators were required to do sentence-level annotations and token-level annotations if a tweet was annotated as offensive.To clearly understand the task, they were provided with multiple examples along with the annotation guidelines.The primary purpose of the pilot task was to collect feedback from the annotators to improve the annotation guidelines and the main annotation task.Furthermore, these annotations were used to ensure the balance between offensive and not-offensive classes.
After the pilot annotation, once we had improved the annotation guidelines, we started with the main annotation task.Since the pilot annotation showed that the offensive percentage of the dataset falls within our requirements, we did not collect further tweets or keywords.The main annotation task consisted of 10, 000 tweets, that were not part of the pilot task.Each tweet was annotated by three annotators.To reduce the bias, we limit the maximum amount of annotation per person to 10% of the total annotations.Figure 2a shows the pairwise Fleiss' Kappa scores for each annotator in the main annotation task.As can be seen, the majority of the agreements fall between 0.7-0.8,indicating high agreement at the sentence-level.For the token-level, following TSD dataset [62], we computed the pairwise Kappa by using character offsets.Figure 2b shows the calculated scores.As can be seen, the majority of the agreements fall between 0.6-0.7.While the inter-annotator agreement is low compared to the sentence-level, it is comparable to similar token-level datasets such as TSD, where the mean pairwise Kappa was 0.55.Therefore, we believe that this agreement is reasonably high, given the highly subjective nature of the token-level offensive language identification task.
To decide on the gold label, we apply majority voting.For sentence-level offensive language identification, what more than two out of three annotators choose is selected as the gold label.Regarding offensive tokens, characters with more than two annotators annotate as offensive are provided as the ground truth.

SOLD: Sinhala Offensive Language Dataset
Table 1 shows several examples from the dataset along with English translations.The final dataset contains 10, 000 tweets, of which 4191 tweets are annotated as offensive (41%).This is comparable to existing datasets for offensive language detection, where the number of offensive instances is much less than that of non-offensive instances.Furthermore, it is worth noticing that SOLD has a higher percentage of offensive instances compared to other lowresource datasets in the domain.For example, in RUHSOLD [54], only 24% of the Urdu tweets were considered offensive by the majority of the annotators, and in [49] only 19% of the Turkish tweets were labelled as offensive.
We divided the dataset into a training set and a testing set using a random split.The training set was used mainly to train the machine learning models, and the sole purpose of the testing set was to evaluate the trained machine learning models.Following the random split, 75% instances from the original dataset were assigned for the training set, and the rest of the instances were assigned for the testing test.The dataset is released as an open-access dataset in HuggingFace Datasets [107] 10 .As can be seen in Figure 3, both training and testing sets have a similar distribution in the offensive and non-offensive classes.
We further analysed the length of the tweets as the length can be a limitation in attention-based neural networks [77].As shown in Figure 4, most tweets have between 0-20 tokens.Both the offensive class and the non-offensive class follow a similar pattern in the token distribution.Since the number of tokens per tweet is relatively low, attention-based neural networks can be used to model the task without truncating the texts.
Table 2 shows the keywords used to collect SOLD and the percentage of offensive tweets for each keyword in training, testing and full datasets.As can be seen, these words are offensive based on the context, as the majority of the offensive percentages are between 30% -50%.Therefore, a rule-based approach that depends on keywords will not perform successfully on this dataset.In the next section, we explore machine learning models that take context into account in detecting offensive language.

OFF (@USER what a fool , a presidential candidate should speak intelligently than this)
Table 1: Four tweets from the dataset, with their sentence level labels.Offensive tokens are highlisted in red.English translations are inside brackets

Experiments and Evaluation
The following sections will describe the experiments we conducted for sentencelevel and token-level offensive language identification.

Sentence-level Offensive Language Detection
We consider sentence-level offensive language detection as a text classification task.We experimented with several ML text classifier models trained on the  training set and evaluated them by predicting the labels for the held-out test set.As the label distribution is highly imbalanced, we evaluate and compare the performance of the different models using macro-averaged F1-score.We further report per-class Precision (P), Recall (R), F1-score (F1), and weighted average.The performance of the ML algorithms described below is shown in Table 3.All experiments were conducted using five different random seeds, and the mean value across these experiments is reported.Finally, we compare the performance of the models against the simple majority and minority class baselines.

SVC
Our simplest machine learning model is a linear Support Vector Classifier (SVC) trained on word unigrams.Before the emergence of neural networks, SVCs have achieved state-of-the-art results for many text classification tasks [108,109] including offensive language identification [104,110].Even in the neural network era, SVCs produce an efficient and effective baseline.

BiLSTM
As the first embedding-based neural model, we experimented with a bidirectional Long Short-Term-Memory (BiLSTM) model, which we adopted from a pre-existing model for Greek offensive language identification [43].The model consists of (i) an input embedding layer, (ii) two bidirectional LSTM layers, and (iii) two dense layers.The output of the final dense layer is ultimately passed through a softmax layer to produce the final prediction.The architecture diagram of the BiLSTM model is shown in Figure 5.Our BiLSTM layer has 64 units, while the first dense layer had 256 units.

CNN
We also experimented with a convolutional neural network (CNN), which we adopted from a pre-existing model for English sentiment classification [111].The model consists of (i) an input embedding layer, (ii) 1 dimensional CNN layer (1DCNN), (iii) a max pooling layer and (iv) two dense layers.The output of the final dense layer is ultimately passed through a softmax layer to produce the final prediction.For the BiLSTM and CNN models presented above, we set three input channels for the input embedding layers: pre-trained Sinhala FastText embeddings11 [112], Continuous Bag of Words Model for Sinhala12 [113] as well as updatable embeddings learned by the model during training.For both models, we used the implementation provided in OffensiveNN Python library 13 .Finally, we experimented with several pre-trained transformer models.With the introduction of BERT [76], transformer models have achieved state-of-the-art performance in many natural language processing tasks [76], including offensive language identification [7,81].From an input sentence, transformers compute a feature vector h ∈ R d , upon which we build a classifier for the task.For this task, we implemented a softmax layer, i.e., the predicted probabilities are y (B) = softmax(W h + b), where W ∈ R k×d is the softmax weight matrix, and k is the number of labels.In our experiments, we used three pre-trained transformer models available in HuggingFace model hub [114]; mBERT [76], SinBERT-large [115] 14 , xlm-roberta-large [83] (XLM-R) and XLM-T [116] 15 .The implementation was adopted from the DeepOffense Python library 16 .The overall transformer architecture is available in Figure 7.For the transformer-based models, we employed a batch-size of 16, Adam optimiser with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data.During the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated.The models were evaluated while training using an evaluation set that had one-fifth of the rows in training data.We performed early stopping if the evaluation loss did not improve over three evaluation steps.All the models were trained for three epochs.

Transformers
As can be seen in Table 3, all models perform better than the majority baseline.As expected, neural models outperform the traditional machine learning model, SVM.From the experimented word embedding models, fastText [112] performed best, providing a 0.82 Macro F1 score with the CNN architecture, even outperforming language specific transformer models such as SinBERT [115].The success of the CNN architecture in offensive language identification is similar to the previous research in English [104].From the transformer models, mBERT [76] does not perform well because mBERT is not trained on Sinhala.The poor results of the mBERT suggest that advanced techniques are required when pre-trained language models are applied to unseen languages [117].From all the models, XLM-R [83] performed best with a 0.83 Macro F1 score.This is closely followed by XLM-T [116] and CNN with fastText [112] having 0.82 Macro F1 scores.

Token-level Offensive Language Identification
We consider token-level offensive language detection as a token classification task.We experimented with several ML token classifier models trained on the training set and evaluated them by predicting the labels for the held-out test set.For the evaluation, we used the precision (P), Recall (R), and Macro F1 score of the offensive tokens.The performance of the ML algorithms described below is shown in Table 4.All experiments were conducted using five different random seeds, and the mean value across these experiments is reported.

BiLSTM
As the first embedding-based neural model, we experimented with a BiLSTM model, which we adopted from a pre-existing model for English toxic spans detection task [86].The model consists of (i) an input embedding layer, (ii) a bidirectional LSTM layer with 64 units, followed by (iii) a linear chain conditional random field (CRF) [118].Similar to the previous experiments, we set three input channels for the input embedding layers: pre-trained Sinhala FastText embeddings [112], Continuous Bag of Words Model for Sinhala [113] as well as updatable embeddings learned by the model during training.

Transformers
In token-level offensive language identification also, we experimented with several pre-trained transformer models.For a token classification task, transformer models add a linear layer that takes the last hidden state of the sequence as the input and produces a label for each token as the output.In this case, each token can have two labels; offensive and not offensive.In our experiments, we used the same three pre-trained transformer models we experimented with for sentence-level offensive language identification; mBERT [76], SinBERT-large [115], xlm-roberta-large [83] and XLM-T [116].The implementation was adopted from the MUDES Python library 17 .The overall transformer architecture is available in Figure 8.For the transformer-based models, we employed a batch-size of 16, Adam optimiser with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data.During the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated.The models were evaluated while training using an evaluation set that had one-fifth of the rows in training data.We performed early stopping if the evaluation loss did not improve over three evaluation steps.All the models were trained for three epochs.

Weakly Supervised Learning -Transformer+LIME
We utilised the binary classifiers that were trained to predict the offensive label of each post, and we employed LIME [58] at inference time to obtain offensive tokens [98,99].In LIME, new instances are generated by random sampling of the words that are present in the input.In other words, words are randomly left out from the input.The resulting new instances are then fed into the classifier, and a cloud of predicted probabilities is gathered.A linear model is then fitted, and coefficients for each token are the outputs of the LIME [58].We obtain a sequence of binary decisions (offensive, not offensive) for the tokens of the post by using a probability threshold (tuned on one-fifth of the training data) applied to the LIME outputs for each token.We refer to this method as Transformer+LIME.This method requires only sentencelevel offensive labels and does not require token-level annotations.Therefore, this is considered as a weakly supervised learning method [35].We used the implementation provided in lime Python library 18 .As can be seen in Table 4, all models perform better than the majority baseline.As expected, transformer models outperform the BiLSTM model.From the experimented word embedding models, fastText [112] performed best, similar to the sentence-level experiments.Additionally, we also experimented with mBERT [76].However, the initial results showed that mBERT performs even worse than baselines.This shows that token-level offensive language identification is a difficult task for language models when the language is unseen in the pre-train process.From all the models, XLM-R [83] performed best with a 0.72 Macro F1 score similar to the sentence-level results.This is closely followed by XLM-T [116] having a 0.70 Macro F1 score.It is important to note that the transformer model trained specifically on Sinhala; SinBERT [115] did not perform well compared to the multilingual transformer models such as XLM-R [83].
In Table 4, we also show the weakly supervised learning results obtained with LIME [58].Similar to the supervised models, XLM-R [83] performed best with a 0.45 Macro F1 score.Interestingly, XLM-T+LIME performs worse than SinBERT+LIME, despite the fact that the underlying XLM-T classifier is better (Macro F1 -0.82) at sentence-level than the underlying SinBERT model (Macro F1 -0.81).Overall, we can conclude that the weakly supervised models provided compatible results with the supervised models despite the fact that the latter is directly trained on offensive token annotations, whereas the former is trained with binary sentence-level annotations only.
With these results, we answer RQ1: How do the state-of-the-art machine learning models perform in Sinhala offensive language identification at sentence-level and token-level?.We showed that state-of-the-art machine learning models, such as XLM-R [83], perform well in identifying offence in both sentence and token levels.Furthermore, the results show that multilingual transformer models that support Sinhala, such as XLM-R [83] and [116], outperform language specific transformer model; SinBERT [115] in both sentence-level and token-level offensive language identification.
We also answer RQ3.1:How to demonstrate explainability of the sentence-level offensive language identification using token-level annotations in Sinhala?We employed LIME [58] on the transformer models trained at sentence-level and evaluated it using the token-level annotations in the test set.The results show that LIME based weakly supervised approach provides compatible results demonstrating the explainability of the sentence-level transformer models.

Transfer-learning Experiments
In a low resource language such as Sinhala, creating a large number of annotated instances can be a challenge due to the availability of qualified annotators.This is a huge limitation in improving the performance of machine learning models.The main goal of transfer learning experiments is to improve the performance of machine learning models in SOLD using an existing dataset without annotating more instances.As shown in Figure 9, in phase 1, we train a machine learning model on an existing dataset, and when we initialise the training process for SOLD in phase 2, we start with the saved weights from the phase 1.Since the majority of the existing datasets are from a different language, these experiments are usually referred to as cross-lingual transfer learning.As we discussed in Section 2, previous work has shown that a similar transfer learning approach can improve the results for Arabic, Greek, and Hindi [7,32] at sentence-level offensive language identification.
In order to perform effective cross-lingual transfer learning, the underlying word representations in two languages need to be in the same vector space [7].However, traditional word embedding models we used, such as FastText embeddings [112], and Continuous Bag of Words Model for Sinhala [113] are not in the same vector space with the word representations of English and other high-resource word embedding models 19 .Furthermore, initial experiments showed that the models based on FastText embeddings [112], and Continuous Bag of Words Model for Sinhala [113] do not improve with transfer learning.On the other hand, from the transformer models we experimented with, mBERT [119], XLM-R [83], and XLM-T [116], have shown cross-lingual properties.Therefore, we conduct the transfer learning experiments only with them.This is the first time that cross-lingual transfer learning has been experimented with in Sinhala offensive language identification.Furthermore, cross-lingual transfer learning for token-level offensive language identification has not been explored before, which can be interesting for many languages.We used different resource-rich languages and datasets for sentence-level and token-level, which we describe in the following sections.and Labels in all initial resources used in transfer learning experiments.Level refers to whether we used the sentence-level or token-level annotations.

Sentence-level Offensive Language Detection
For the sentence-level, we used several resources as the initial dataset.As the first resource, we used OLID [104], arguably one of the most popular offensive language identification datasets in English.We specifically used the OLID [104] level A tweets, which is similar to the sentence-level of SOLD.Also, in order to perform transfer learning from a closely-related language to Sinhala, we utilised a Hindi dataset used in the HASOC 2020 shared task [22].Hindi belongs to the Indo-Aryan language family, which is similar to Sinhala [120].Furthermore, since both languages originated in the Indian subcontinent, they are also culturally closely related.In HASOC, instances are annotated at the sentence-level with hate-offensive and non hate-offensive [22].We mapped the hate-offensive instances to our offensive class and non hate-offensive instances Table 6: Results for offensive language detection at sentence-level after transfer learning.Type refers to the machine learning algorithm used, Model refers to the embedding model used, and Dataset refers to the initial dataset that the model was trained on.We report weighted average F1 and macro F1 for each model (best in bold).With each result, we also report the difference of the same model with respect to non-transfer learning experiments in Table 3 as a percentage.The best result from Table 3 is shown in the last row.
to our not offensive class, following our sentence-level annotation guidelines.We also used a recently released Sinhala code-mixed dataset (CMCS) [56] as the initial dataset in transfer-learning experiments.In CMCS, 10,000 instances have been annotated in three classes; Hate-Inducing, Abusive and Not offensive [56].Before performing transfer learning, we mapped the Hate-Inducing and Abusive classes to a single offensive class following our definition of sentence-level offensive language labels.Mapping the offensive labels into a single offensive class has been a common approach in recent transfer learning research [7,32].These datasets are summarised in the first row in Table 5.
All the datasets we used for transfer learning experiments at sentencelevel contain Twitter data making them in-domain with respect to SOLD.However, since CMCS contains code-mixed texts, this will be the first time that transfer learning is experimented with between code-mixed Sinhala and Sinhala written in Sinhala script.As mentioned before, we conduct transfer learning experiments only with transformer models that have shown crosslingual properties such as mBERT [119], XLM-R [83], and XLM-T [116].
Results of the transfer learning experiments at the sentence-level are shown in Table 6.As shown in the results, transfer learning improved results in all the experiments except when XLM-R [83] trained with CMCS [56].The best result was given by XLM-R [83] when performing transfer learning with the Hindi dataset [22], which provided an improvement of more than 1% in Macro F1 when compared to the experiment without transfer learning.Furthermore, this is the best result achieved for the SOLD dataset at sentence-level.However, there is no clear indication from these experiments that the closely related language, Hindi, has an impact on transfer learning performance.Hindi [22] provided a bigger improvement with XLM-R [83] while English [104] provided a bigger improvement with XLM-T [116].We believe that the performance of transfer learning depends both on the initial dataset and underlying embeddings.It is important to note that transfer learning from the code-mixed Sinhala dataset, CMCS [56] provided fewer improvements compared to other datasets.In fact, CMCS [56] with XLM-R [83] reduced the results.We believe that the transformer models we experimented with have not seen code-mixed data in the training process, and therefore, they fail to align the embeddings between code-mixed Sinhala words and Sinhala words written in the Sinhala script.As a result, there is no advantage in using code-mixed data in transfer learning experiments.

Token-level Offensive Language Detection
For the token-level transfer learning experiments, we only used English datasets as token-level offensive labels are not available in other languages.We specifically used the HateXplain [34] token-level annotations and TSD [62] as the initial datasets.In HateXplain [34], instances are annotated as offensive or hateful at the sentence-level.The tokens have been annotated as to whether they contribute to the sentence-level label or not.The second dataset; TSD, was released as the official dataset in the Toxic Spans Detection task at SemEval 2021 (Task 5) [62].In TSD [62], if a post is toxic, the tokens have been annotated on whether they make the text toxic or not.Similar to our sentence-level experiments, we mapped the tokens labelled as toxic to our offensive class and not offensive class otherwise.These datasets are summarised in the second row in Table 5.
HateXplain dataset we used for transfer learning experiments at tokenlevel contains Twitter data [34] making them in-domain with respect to SOLD.However, the TSD dataset contains instances from an archive of the Civil Comments platform [62], a commenting plugin for independent news sites and therefore, making the dataset off-domain with respect to SOLD.This is the first time that transfer learning has experimented with token-level offensive language identification.
We also explore how transfer learning affects the explainability of sentencelevel models.To do that, we performed LIME [58] on the sentence-level models that were trained following transfer learning in the previous section and evaluated them on the token-level labels.As far as we know, this is the first time that transfer learning in offensive language identification has been explored with LIME [58].
The results for the token-level transfer learning experiments based on TSD [62] and HateXplain [34] are reported in the "Transformers" row in Table 7.The results with sentece-level transfer learning and LIME are reported in the "Transformers + LIME" row in Table 7.As shown in Table 7, transfer learning improved results in all the supervised experiments.The best result was given by XLM-R [83] when performing transfer learning with the TSD [62], which provided an improvement of close to 1% in Macro F1 when compared to the experiment without transfer learning.Interestingly, TSD [62] is off-domain compared to SOLD, yet it provides the biggest improvement.Furthermore, this is the best result achieved for the SOLD dataset at token-level.Similar to sentence-level, there is no clear evidence of which initial dataset improves results mostly in transfer learning experiments, as it depends both on the initial dataset and underlying embeddings.Overall, we can conclude that transfer learning improves results in token-level offensive language identification for Sinhala.
While transfer learning improved results in supervised token-level offensive language identification models, transfer learning did not improve weaklysupervised models in the majority of the experiments.As shown in Table 7, the token-level results dropped in several weakly-supervised models after performing transfer learning.For example, in Table 6, we observed that XLM-R [83] with transfer learning performed from OLID [104] provided the strongest model for sentence-level offensive language identification.However, when the same model was employed with LIME [58] to predict token-level labels, the results dropped by 1% in Macro F1.This is an interesting observation, given that the underlying transformer model gets stronger with transfer learning, but it does not necessarily improve the explainability of the models.
With the findings in this section, we answer RQ2.1: Do available resources from resource-rich languages combine with transfer-learning techniques aid the detection of offensive language in Sinhala at sentence-level and tokenlevel?.We performed transfer learning from different datasets and showed that transfer learning improves results in the majority of the experiments at sentence-level and all the experiments at token-level.The best results at both sentence-level and token-level that we have seen so far in SOLD were achieved after performing transfer learning in this section.These findings will be beneficial for many low-resource languages where the training data is scarce.
We also answer RQ3.2 regarding explainability; Does transfer-learning from resource-rich languages affect the explainability of the offensive language identification models?We employed LIME [58] on sentence-level models that resulted after transfer learning to predict token-level offensive language in a weakly-supervised approach.The results indicate a performance drop in most of the models suggesting that transfer learning does not necessarily improve the explainability of the models.There is a large number of recent research that has used transfer learning to improve the results in sentence-level offensive language identification, [7,51]; however, the researchers need to be aware of the fact that, transfer learning does not always improve the explainability.This finding will create a new direction in explainable ML research in offensive language identification.

Semi-supervised Data Augmentation
As we mentioned before, in a low resource language such as Sinhala, creating a large number of annotated instances is a challenge, and therefore, it is a major limitation in building ML models to detect offensive language.The second approach we propose to avoid this limitation is semi-supervised data augmentation which is also known as democratic co-learning [121].This technique is used to create large datasets with noisy labels when provided with a set of diverse models trained in a supervised way.Semi-supervised data augmentation has improved results in multiple tasks, including English offensive language identification [37], sentiment analysis [122], and time series prediction [123].
In our work, we collected additional 145,000 Sinhala tweets using the same methods described in Section 3. Rather than labelling them manually, we used the ML models trained in Section 5 to label them.For each tweet in the unannotated dataset, each ML model in Section 5, predicts the confidence for the offensive class resulting in eleven confidence values for each tweet.We release this dataset; SemiSOLD as an open-access dataset in HuggingFace Datasets [107] 20 .
In the following sections, we detail how SemiSOLD was used in sentencelevel and token-level experiments.As far as we know, this would be the first time that semi-supervised data augmentation is applied in Sinhala.Furthermore, semi-supervised data augmentation has not been explored before with explainable tokens, which can be interesting for many languages.

Sentence-level Offensive Language Detection
For the sentence-level, we used a filtering technique to filter the unannotated instances because the benefits of data augmentation can be hampered by noise in initial model predictions.We selected the three best sentence-level offensive language detection models from Section 5; XLM-R [83], XLM-T [116], and CNN with fastText [112].For each instance in SemiSOLD, we calculated the standard deviation of the confidences of these three models for the positive class, which corresponds to the uncertainty of the models.We used different threshold values for model uncertainty to filter the data from SemiSOLD.For the labels, we compute an aggregated single prediction based on the average predicted by each of the above-mentioned models.If the average is greater than 0.5, we label the instance as offensive, and not offensive otherwise.
We used three threshold values; 0.05, 0.1 and 0.15.For each threshold value, we filter the instances in SemiSOLD and augment it to the training set of SOLD.We train the same ML models we experimented with in Section 5 on the augmented training set.We evaluated the results on the testing set of SOLD.The results are shown in Table 8.
As shown in the results, all the models benefitted from semi-supervised data augmentation.The best result was produced by XLM-R with a 0.1 threshold.We discover two key observations from the results.(1) Models only improve with 0.05 and 0.1.Despite having more instances in the 0.15 threshold, it does not improve the results in many ML models.This is mainly because the 0.15 threshold adds a large number of uncertain noisy instances to the training set, and ML models find it difficult to learn from these instances.
(2) Smaller and lightweight models such as BiLSTM and CNN show notable improvements with data augmentation compared to large transformer models.This is similar to the previous experiments in data augmentation [37] where the results do not improve when the machine learning classifier is already strong.We can assume that the transformer models are already well trained for SOLD, and adding further instances to the training process would not improve the results for the transformer models.
With this finding, we answer RQ2.2: Can semi-supervised data augmentation improve the results for Sinhala offensive language identification at sentence-level?We showed that data augmentation could improve the results for ML models.However, it is important to find an optimal uncertainty threshold.As we demonstrated in the results, having too many noisy instances with a larger uncertainty threshold can lead to reduced performance in ML models.
The performance improvement of lightweight models can be an important research direction in knowledge distillation research.Knowledge distillation   yet well performing model [124].The smaller model is less demanding in terms of memory print and computing power and has a lower prediction latency encouraging green computing.Knowledge distillation has been explored in several NLP topics such as neural machine translation [125], language modelling [126], and translation quality estimation [127].Therefore, the development of SemiSOLD can open new avenues for knowledge destabilisation in low resource offensive language identification.

Token-level Offensive Language Detection
For the token classification tasks, the semi-supervised data augmentation technique we used with democratic co-learning and model uncertainty does not readily apply.While sentence-level seeks to minimise the divergence between the outputs of different models, for token classification, the number of label combinations grows exponentially with respect to the sequence length.Extracting model knowledge as if each combination is a different label category would be largely inefficient [128].
Considering this, we do not train supervised token-level models on the augmented data.Rather than that, we used the sentence-level models trained on augmented data in Section 7.1 to predict token-level labels using LIME, as we discussed in previous sections.The results are shown in Table 9.
As can be seen in Table 9, data augmentation improved the results of weakly supervised models in token-level offensive language detection.The best F1 score for "Transformers + LIME" was achieved with XLM-R [83] and 0.1 model uncertainty.Similar to the previous section, we notice a drop in the results with 0.15 model uncertainty.This is mainly because the noisy instances in the 0.15 threshold have weakened the sentence-level models, as we saw in Table 8 and therefore, they do not provide better results with LIME.Overall, 0.48 is the best result got for "Transformers + LIME" with SOLD.
With this finding, we answer RQ3.3: Can semi-supervised data augmentation improve the explainability of the sentence-level models?.As we experimented with LIME and transformers, we showed that data augmentation could improve explainability.However, it is important not to follow a greedy approach with data augmentation and only augment less noisy instances.Adding more noisy instances can lead to a weakened sentence-level model, which could impact the explainability.
Several large offensive language datasets with sentence-level annotations are publicly available for many languages.For the languages that do not have large offensive language datasets, it is straightforward to collect more data following a similar methodology we used to collect SemiSOLD.As we showed, the weakly supervised offensive token detector, "Transformers + LIME", can, in principle, perform even better if the underlying binary classifier is trained on a larger dataset.Therefore, this finding can be a huge step towards explainable offensive language detection in many languages.

Conclusion and Future Work
In this paper, we presented a comprehensive evaluation of Sinhala offensive language identification along with two new resources: SOLD and SemiSOLD.SOLD contains 10,500 tweets annotated at sentence-level and token-level, making it the largest manually annotated Sinhala offensive language dataset to date.SemiSOLD is a larger dataset of more than 145,000 instances annotated with semi-supervised methods.Both these results open exciting new avenues for research on Sinhala and other low-resource languages.
Our results show that state-of-the-art ML models can be used to identify Sinhala offensive language at sentence and token-level (answering RQ1).With respect to RQ2 addressing data scarcity in low-resource languages, we report that (1) transfer learning techniques from both English and Hindi result in performance improvement for Marathi in sentence-level and tokenlevel offensive language detection (answering RQ2.1) (2) the use of the larger dataset SemiSOLD combined with SOLD results in performance improvement for sentence-level offensive language identification, particularly for lightweight models such as BiLSTM and CNN (answering RQ2.2).With respect to RQ3 addressing explainability, we report that (1) transformer models trained on sentence-level combined with LIME can be used to predict offensive tokens demonstrating their explainability (answering RQ3.1 (2) sentence-level transfer learning from resource-rich languages do not necessarily improve explainability despite having a strong sentence-level model (answering RQ3.2 (3) semi-supervised data augmentation on sentence-level can improve the explainability (answering RQ3.3.We believe that these results shed light on offensive language identification applied to Sinhala and other low-resource languages as well.
In future work, we would like to extend SOLD's annotation to type and target annotations in offensive posts.This would allow us to identify common targets in Sinhala offensive social media posts and prevent targeted offence towards certain individuals and groups.We would also like to extend the dataset to other platforms, such as YouTube comments and news media comments.Finally, we would like to use the knowledge and data obtained from our work on Sinhala and expand it to closely-related Indo-Aryan languages to Sinhala, such as Dhivehi.Sachith Suraweera, Chandika Udaya Kumara and Ridmi Randima, the team of volunteer annotators that provided their free time and efforts to help us produce SOLD.

Figure 1 :
Figure 1: A translated example of SOLD.If an annotator marked a tweet as offensive, he/she was asked to highlight which tokens of the tweet justifies their decision.
(a) Class distribution in training set (b) Class distribution in testing set

Figure 3 :
Figure 3: Class distribution in SOLD.
(a) Token frequency distribution in training set (b) Token frequency distribution in testing set

Figure 4 :
Figure 4: Token frequency distribution in SOLD.

Table 2 :
The keywords used to collect SOLD and the percentage of offensive tweets for each keyword in training, testing and full datasets.Keywords are sorted from the offensive percentage in the full dataset.

Table 3 :
Results for offensive language detection sentence-level.Type refers to the machine learning algorithm used, and Model refers to the embedding model used.We report Precision (P), Recall (R), and F1 for each model/baseline on all classes (OFF, NOT) and weighted averages.Macro-F1 is also listed (best in bold).

Table 4 :
Results for offensive language detection at token-level.Type refers to the machine learning algorithm used, and Model refers to the embedding model used.We report Precision (P), Recall (R), and F1 scores for the offensive tokens for each model/baseline (best in bold).

Table 7 :
Results for offensive language detection at token-level after transfer learning.Type refers to the machine learning algorithm used, Model refers to the embedding model used, and Dataset refers to the initial dataset that the model was trained on.We report Precision (P), Recall (R), and F1 for each model.With each result, we also report the difference of the same model with respect to non-transfer learning experiments in Table4as a percentage.The best result from 4 is shown in the last row.

Table 8 :
Results for offensive language detection at sentence-level after data augmentation.STD shows the uncertainty threshold and Inst. is the number of total unlabelled instances augmented.Type refers to the machine learning algorithm used, and Model refers to the embedding model used.We report weighted average F1 and macro F1 for each model (best in bold).With each result, we also report the difference of the same model with respect to nontransfer learning experiments in Table3as a percentage.The best result from Table3is shown in the last row.
aims to extract knowledge from a top-performing large model into a smaller

Table 9 :
Results for offensive language detection at token-level after data augmentation.STD shows the uncertainty threshold and Inst. is the number of total unlabelled instances augmented.Type refers to the machine learning algorithm used, and Model refers to the embedding model used.We report Precision (P), Recall (R), and F1 for each model/baseline (best in bold).