DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.


Introduction
Sentiment 1, 2 analysis is the classification task of mining sentiments from natural language, which finds use in numerous applications such as reputation management, customer support, and moderating content in social media Agarwal et al. 2011;Mahesan 2019, 2020a). Sentiment analysis has helped industry to compile a summary of human perspectives and interests derived from feedback or even just the polarity of comments (Pang and Lee 2004;Thavareesan and Mahesan 2020b). Offensive language identification is another classification task in natural language processing (NLP), where the aim is to moderate and minimise offensive content in social media. In recent years, sentiment analysis and offensive language identification have gained significant interest in the field of NLP.
Social media websites and product review forums provide opportunities for users to create content in informal settings. Moreover, to improve user experience, these platforms ensure that the user communicates his/her opinion in such a way that he/ she feels comfortable either using native language or switching between one or more languages in the same conversation ). However, most NLP systems are trained on languages in formal settings with proper grammar, which creates issues when it comes to the analysis phase of ''user generated'' comments (Chanda et al. 2016;Pratapa et al. 2018). Further, most of the developments in sentiment analysis and offensive language identification systems are performed on monolingual data for high-resource languages, while the user-generated content in under-resourced settings are often mixed with English or other high-resource languages (Winata et al. 2019;Jose et al. 2020).
Code-mixing or code-switching is the alternation between two or more languages at the level of the document, paragraph, comments, sentence, phrase, word or morpheme. It is a distinctive aspect of conversation or dialogue in bilingual and multilingual societies (Barman et al. 2014). It is motivated by structural, discourse, pragmatic and socio-linguistic reasons (Sridhar 1978). Most of the social media comments are code-mixed, while the resources created for sentiment analysis and offensive language identification are primarily available for monolingual texts.
Code-mixing is a common phenomenon in all kinds of communication among multilingual speakers including both speech and text-based interactions. Codemixing refers to the way a bilingual/ multilingual speaker changes his or her utterance into another language. The vast majority of language pairs are underresourced with regards to code-mixing tasks Jose et al. 2020).
In this paper, we describe the creation of a corpus for Dravidian languages in the context of sentiment analysis and offensive language detection tasks. Dravidian languages are spoken mainly in the south of India (Chakravarthi et al. 2020c). The four major literary languages belonging to the language family are Tamil (ISO 639-Monolingual datasets are available for Indian languages for various research aims (Agrawal et al. 2018;Thenmozhi and Aravindan 2018;Kumar et al. 2020). However, there have been few attempts to generate datasets for Tamil, Kannada and Malayalam code-mixed text (Chakravarthi et al. 2020b, c;Chakravarthi 2020;Chakravarthi and Muralidaran 2021). We believe it is essential to come up with approaches to tackle this resource bottleneck so that these languages can be equipped with NLP support in social media in a way that is both cost-effective and rapid. To create resources for a Tamil-English, Kannada-English and Malayalam-English code-mixed scenario, we collected comments on various Tamil, Kannada and Malayalam movie trailers from YouTube.
The contributions of this paper are: 1. We present the dataset for three Dravidian languages, namely Tamil, Kannada, and Malayalam, for sentiment analysis and offensive language identification tasks. 2. The dataset contains all types 5 of code-mixing. This is the first Dravidian language dataset to contain all types of code-mixing, including mixtures of these scripts and the Latin script. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. 3. We provide an experimental analysis of logistic regression, naive Bayes, decision tree, random forest, SVM, BERT, DistilBERT, ALBERT, RoBERTa, XLM, XLM-R and Character BERT on our code-mixed data for classification tasks in order to create a benchmark for further research.

Related work
Sentiment analysis helps to understand the polarity (positive, negative or neutral) of the audience towards a content (comment, tweet, image, video) or an event (Brexit, presidential elections). This data on polarity can help in understanding public opinion. Furthermore, the inclusion of sentiment analysis can improve the performance of tasks such as recommendation system (Krishna et al. 2013;Musto et al. 2017), and hate speech detection (Gitari et al. 2015). Over the last 20 years, social media networks have become a rich data source for sentiment analysis (Clarke and Grieve 2017;Tian et al. 2017). Extensive research has been done for sentiment analysis of monolingual corpora such as English (Hu and Liu 2004;Wiebe et al. 2005;Jiang et al. 2019), Russian (Rogers et al. 2018), German (Cieliebak et al. 2017), Norwegian (Maehlum et al. 2019) and Indian languages (Agrawal et al. 2018;Rani et al. 2020). In initial research works, n-gram features were used widely for classification of sentiments (Kouloumpis et al. 2011). However recently, due to readily available data on social media, these traditional techniques have been replaced by deep neural network techniques. Patwa et al. (2020) conducted sentiment analysis on code-mixed social media text for Hindi-English and Spanish-English languages. However, sentiment analysis in Dravidian languages is under-studied. The use of aggressive, hateful or offensive language online has proliferated in social media posts because of various technological and sociological reasons.This downturn has encouraged the development of automatic moderation systems. These systems if trained on proper data can help detect aggressive speech thus moderating spiteful content on a public platform. Collection of such data has become a crucial part of social media analysis. To facilitate the researchers working on these problems, there have been shared tasks conducted on aggression identification in social media ) and offensive language identification (Zampieri et al. 2019) by providing necessary datasets. As English is a commonly used language on social media, a significant amount of research goes into the identification of offensive English text. However, many internet users prefer the use of their native languages. This has given rise to the development of offensive language identification dataset in Arabic, Danish, Greek, and Turkish languages (Zampieri et al. 2020). Inspired by this we developed resources for offensive language identification for Dravidian languages.
In the past few years, cheaper internet and increased use of smartphones have significantly increased social media interaction in code-mixed native languages. Dravidian language speakers (who are often bilingual with English as it is an official language in India) with a population base of 237 million 6 contribute to large portion of such interactions. Hence, there is an ever-increasing need for the analysis of code-mixed text in Dravidian languages. However, the number of freely available code-mixed dataset (Ranjan et al. 2016;Jose et al. 2020) are still limited in number, size, and availability. Sowmya Lakshmi and Shambhavi (2017) developed a Kannada-English dataset containing English and Kannada text with word-level code-mixing. Also, they employed a stance detection system to detect stance in Kannada-English code-mixed text (on social media) using sentence embeddings. Shalini et al. (2018) have used distributed representations for sentiment analysis of Kannada-English code-mixed texts through neural networks, which had three tags: Positive, Negative and Neutral. However, the dataset for Kannada was not readily available for research purposes. To give motivation for further research we conducted (Chakravarthi et al. 2020a, d;Mandl et al. 2020;) a shared task that provided Tamil-English, Kannada-English, and Malayalam-English code-mixed datasets using which participants trained models that identify the sentiments (task A) and offensive classes (task B) in both the languages.
Most of the recent studies on sentiment analysis and offensive language identification have been conducted on high-resourced languages from social media platforms. Models trained on such highly resourced monolingual data have succeeded in predicting sentiment and offensiveness. However, with the increased social media usage of bilingual users, a system trained on under-resourced codemixed data is needed. In spite of this need, no large datasets for Tamil-English, Kannada-English and Malayalam-English are available. Hence, inspired by Severyn et al. (2014), we collected and created a code-mixed dataset from YouTube. In this work, we describe the process of corpora creation for under-resourced Dravidian languages from YouTube comments. This is an extension of two workshop papers (Chakravarthi et al. 2020b, c) and shared tasks (Chakravarthi et al. 2020d). We present DravidianCodeMix corpora for Tamil-English (40,000 ? comments), Kannada-English (7000 ? comments) and Malayalam-English (nearly 20,000 comments) with manually annotated labels for sentiment analysis and offensive language identification. We used Krippendorff's alpha to calculate agreement amongst annotators. We made sure that each comment is annotated by at least three annotators and made the labelled corpora freely available for research purpose. For bench marking, we provided baseline experiments and results on 'Dravid-ianCodeMix' corpora using machine learning models.  3 Raw data Online media, for example, Twitter, Facebook or YouTube, contain quickly changing data produced by millions of users that can drastically alter the reputation of an individual or an association. This raises the significance of programmed  extraction of sentiments and offensive language used in online social media. YouTube is one of the popular social media platforms in the Indian subcontinent because of the wide range of content available from the platform such as songs, tutorials, product reviews, trailers and so on. YouTube allows users to create content and other users to comment on the content. It allows for more user-generated content in under-resourced languages. Hence, we chose YouTube to extract comments to create our dataset. We chose movie trailers as the topic to collect data because movies are quite popular among the Tamil, Malayalam, and Kannada speaking populace. This increases the chance of getting varied views on one topic. Figure 1 shows the overview of the steps involved in creating our dataset. We compiled the comments from different film trailers of Tamil, Kannada, and Malayalam languages from YouTube in the year 2019. The comments were gathered using YouTube Comment Scraper tool 7 . We utilized these comments to make the datasets for sentiment analysis and offensive language identification with manual annotations. We intended to collect comments that contain code-mixing at various levels of the text, with enough representation for each sentiment and offensive language classes in all three languages. It was a challenging task to extract the necessary text that suited our intent from the comment section, which was further complicated by the presence of remarks in other non-target languages. As a part of the preprocessing steps to clean the data, we utilized langdetect library 8 to tell different languages apart and eliminate the unintended languages. The Langdetect library, however, is a script detection library that filters out languages based on certain scripts. This has serious limitations as it misses out a number of languages written in non-conventional script. This explains why we still get data from other languages despite using this library. Examples of code-mixing in Tamil, Kannada and Malayalam corpora are shown in Figs. 2, 3, and 4 along with their translations in English. By keeping data privacy in mind, we made sure that all the user-related information is removed from the corpora. As a part of the textpreprocessing, we removed redundant information such as URL.
Since we collected corpora from social media, our corpora contain different types of real-world code-mixed data. Inter-sentential switching is characterised by change of language between sentences where each sentence is written or spoken in one language. Intra-sentential switching occurs within a single sentence, say one of the clause is in one language and the other clause is in the second language. Our corpora contains all forms of code-mixing ranging from purely monolingual texts in native languages to mixing of scripts, words, morphology, inter-sentential and intrasentential switches. We retained all the instances of code-mixing to faithfully preserve the real-world usage.

Methodology of annotation
We create our corpora for two tasks, namely, sentiment analysis and offensive language identification. We anonymized the data gathered from Youtube in order to protect user privacy.

Annotation process
In order to find volunteers for the annotation process, we contacted students in Indian Institute of Information Technology and Management-Kerala for Malayalam, Indian Institute of Information Technology-Tiruchirapalli and Madurai Kamaraj University for Tamil. For Kannada, we contacted students in Visvesvaraya College of Engineering, Bangalore University. The student volunteer annotators received the link to a Google Form and did the annotations on their personal computers. The authors' family members also volunteered to annotate the data. We created Google Forms to gather annotations from annotators. Information on gender, education background and medium of schooling were collected to know the diversity of the annotators. The annotators were cautioned that the user remarks may have hostile language. They were given a provision to discontinue with the annotation process in case the content is too upsetting to deal with. They were asked not to be partial to a specific individual, circumstance or occasion during the annotation process. Each Google form had been set to contain up to 100 comments and each page was limited to contain ten comments. The annotators were instructed to agree that they understood the scheme before they were allowed to proceed further. The annotation setup involved three stages. To begin with, each sentence was annotated by two individuals. In the second step, the data was included in the collection if both the annotations agreed. In the event of contention, a third individual was asked to annotate the sentence. In the third step, in the uncommon case that all the three of them disagreed, at that point, two additional annotators were brought in to label the sentences. Each form was annotated by at least three annotators.

Sentiment analysis
For sentiment analysis, we followed the methodology taken by Chakravarthi et al. (2020c), and involved at least three annotators to label each sentence. The following annotation schema was given to the annotators in English and Dravidian languages.  -Not in intended language: If the comment is not in the intended language. For example, for Tamil, if the sentence does not contain Tamil written in Tamil script or Latin script, then it is not Tamil. These comments were discarded after the data annotation process.

Offensive language identification
We constructed an offensive language identification dataset for Dravidian languages by adapting the work of Zampieri et al. (2019). We reduced the three-level hierarchical annotation scheme of this work into a flat scheme with five labels to account for the types of offensiveness in the comments and the sixth label Not in intended language accounts for comments written in a language other than the intended language. Examples for this are the comments written in other Dravidian languages using Roman script. To simplify the annotation decisions, the six categories into which each comment will be split into are as follows: -Not Offensive: Comment does not contain offence or profanity.
-Offensive Untargeted: Comment contains offence or profanity not directed towards any target. These are the comments which contain unacceptable language without targeting anyone.  -Offensive Targeted Individual: Comment contains offence or profanity which targets an individual. -Offensive Targeted Group: Comment contains offence or profanity which targets a group or a community. -Offensive Targeted Other: Comment contains offence or profanity which does not belong to any of the previous two categories (e.g. a situation, an issue, an organization or an event). -Not in indented language: If the comment is not in the intended language. For example, in Tamil task, if the sentence does not contain Tamil written in Tamil script or Latin script, then it is not Tamil. These comments were discarded after the data annotation process.
Examples of the Google Forms in English and native language for offensive language identification task are given in Figs. 7, 8, and 9. Once the Google Form was ready, we sent it out to an equal number of males and females to enquire their willingness to annotate. We got varied responses from them and so our distribution of male and female annotators involved in the task are different. From Table 1, we can see that only two female annotators volunteered to contribute for Tamil while there were more female annotators for Malayalam and Kannada. For offensive language identification, we can see that there is a balance in gender from Table 2. The majority of the annotators have received postgraduate level of education. We were not able to find volunteers of non-binary gender to annotate our dataset. All the annotators who volunteered to annotate the Tamil-English, Kannada-English and Malayalam-English datasets had bilingual proficiency in the respective code-mixed pairs and they were prepared to take up the task seriously. From Table 1 and 2, we can observe that the majority of the annotators' medium of schooling is English even though their mother tongue is Tamil, Kannada or Malayalam. For Kannada and Malayalam languages only one annotator from  DravidianCodeMix: sentiment analysis and offensive language... 779 each language received their education through the medium of their native language. Although the medium of education of the participants was skewed towards the English language, we were careful it would not affect the annotation task by ensuring that all of them are fully proficient in using their native language. We were aware that there could be other factors affecting the annotation decisions on offensive language such as the annotators' age, their field of education and their ideological stance. Due to privacy issues involved, we did not collect this information from annotators. A sample form (first assignment) was annotated by experts and a gold standard was created. The experts were a team of NLP researchers who have experience working with creating annotation standards and guidelines. We manually compared the gold standard annotations with the volunteer submission form. To control the quality of annotation, we eliminated the annotators whose label assignments in the first form were not good. For instance, if the annotators showed an unreasonable delay in responding or if they labelled all sentences with the same label or if more than fifty annotations in a form were wrong, we eliminated those contributions. A total of 22 volunteers and 23 volunteers, for sentiment analysis and offensive language identification tasks respectively, were involved in the process. Once they filled up the Google Form, 100 sentences were sent to them. If an annotator offered to volunteer more, the next Google Form was sent to them with another set of 100 sentences and in this way each volunteer chose to annotate as many sentences from the corpus as they wanted. We sent out the same comment forms to annotators but some of the forms were incomplete so we discarded them. Hence there is some difference between the sentiment dataset and offensive dataset. However, there is more than 98% comments overlap between sentiment dataset and offensive dataset.

Inter-annotator agreement
Inter-annotator agreement is a measure of the extent to which the annotators agree in their rating. This is necessary to ensure that the annotation scheme is consistent and that different raters are able to assign the same sentiment label to a given comment. There are two questions related to inter-annotator agreement: How do the annotators agree or disagree in their annotation? How much of the observed agreement or disagreement among the annotators might be due to chance? While the percentage of agreement is fairly straightforward, answering the second question involves defining and modelling what chance is and how to measure the agreement due to chance. There are different inter-annotator agreement measures that are intended to answer this in order to measure the reliability of the annotation. We utilized Krippendorff's alpha ðaÞ (Krippendorff 1970) to gauge the agreement between annotators because of the nature of our annotation setup. Krippendorff's alpha is a rigorous statistical measure that accounts for incomplete data and, consequently, does not require every annotator to annotate every sentence. It is also a measure that considers the level of disagreement between the anticipated classes, which is critical in our annotation scheme. For example, if the annotators differ among Positive and Negative class, this difference is more genuine than when they differ between Mixed feelings and Neutral state. a is sensitive to such disagreements. a is characterized by: D o is the observed disagreement between sentiment labels assigned by the annotators and D e is the disagreement expected when the coding of sentiments can be attributed to chance rather than due to the inherent property of the sentiment itself.
Here o ck n c n k and n refer to the frequencies of values in the coincidence matrices and metric refers to any metric or level of measurement such as nominal, ordinal, interval, ratio and others. Krippendorff's alpha applies to all these metrics. We used nominal and ordinal metric to calculate inter-annotator agreement. The range of a is between '0' and '1', 1 ! a ! 0. When a is '1' there is perfect agreement between the annotators and when '0' the agreement is entirely due to chance. Care should be taken in interpreting the reliability of the results shown by Krippendorf's alpha because reliability basically measures the amount of noise in the data. However, the location of noise and the strength of the relationship measured will interfere with the reliability of the estimate. It is customary to require a ! .800. A reasonable rule of thumb that allows for tentative conclusions to be drawn requires 0:67 a 0:8 while a ! .653 is the lowest conceivable limit. We used nltk 9 for calculating Krippendorff's alpha ðaÞ. The results of inter-annotator agreement between our annotators for different languages on both sentiment analysis and offensive language identification tasks are shown in Table 3. Tables 4 and 5 show the text statistics (number of words, vocabulary size, number of comments, number of sentences, and average number of words per sentences) for sentiment analysis and offensive language identification for Tamil, Malayalam and Kannada. The Tamil dataset had the highest number of samples while Kannada had the least on both the tasks. On average, each comment contained only one sentence. Average number of sentences per comment 1 1 1  Table 6 and Table 7 show the class distribution across Tamil, Malayalam and Kannada in sentiment analysis and offensive language identification tasks. Furthermore, tree-maps in Figs. 10 and 11 depict the comparative analysis of distribution of sentiment and offensive classes across languages. Figure 10 illustrates that there are more number of samples labelled ''Positive'' than any other class in all the languages. While the disparity between ''Positive'' and other classes is large in Tamil, it is not the case with Malayalam and Kannada. In Malayalam, ''Neutral state'' is the second-largest class in terms of distribution; 6502 number of comments labelled ''Neutral state'' could mean that most of the comments in Malayalam are vague remarks as the sentiment behind them is  Our datasets are stored in tab separated files. The first column of the tsv file contains the comments from YouTube and the second column has the final annotation.

Difficult examples
The social media comments that form our dataset are code-mixed showing a mixture of Dravidian languages and English. This poses a few major difficulties while annotating the sentiments and offensive language categories on our dataset. Dravidian languages are under-resourced languages and the mixing of scripts makes the annotation task difficult since the annotators must have learned both the scripts, be familiar with how English words are modified to native phonology and how the meaning of certain English words have a different meaning in the given local language. Reading and understanding the code mixed text often with nonstandardised spelling is difficult unless the annotator is well-versed in both the languages (Sridhar and Sridhar 1980). This created difficulty in finding volunteer annotators who were fluent in both the languages. Moreover, we have created the annotation labels with the help of volunteer annotators for three languages (not just one language). It is challenging and time consuming to collect this much amount of data from bilingual, volunteer annotators from three different language groups.
While annotating, it was found that some of the comments were ambiguous in conveying the right sentiment of the viewers. Hence the task of annotation for -Enakku iru mugan trailer gnabagam than varuthu -All it reminds me of is the trailer of the movie Irumugan. Not sure whether the speaker enjoyed Irumugan trailer or disliked it or simply observed the similarities between the two trailers. The annotators found it difficult to identify the sentiment behind the comment consistently. -Rajini ah vida akshay mass ah irukane -Akshay looks more amazing than Rajini. Difficult to decide if it is a disappointment that the villain looks better than the hero or a positive appreciation for the villain actor. Some annotators interpreted negative sentiment while some others took it as positive. -Ada dei nama sambatha da dei -I wonder, Is this our sampath? Hey!.
Conflict between neutral and positive. -Lokesh kanagaraj movie naalae.... English Rap....Song vandurum -If it is a movie of Lokesh kanagaraj, it always has an English rap song. Ambiguous sentiment. -Ayayo bigil aprm release panratha idea iruka lokesh gaaru -Oh Dear! Are you even considering releasing the movie Bigil, Mr.Lokesh?. This comment has a sinlge word 'garu' 10 which is a non-Tamil , non-English word borrowed from Telugu language which is a politeness marker. However, in this context the speaker uses the word sarcastically to insult the director because of the undue delay in releasing the movie. The annotators were inconsistent in interpreting this as offensive or not-Tamil. -No of dislikes la theriyudhu, idha yaru dislike panni irrupanga nu -It is obvious from the number of dislikes as to who would have disliked this (trailer). The comment below the trailer of a movie which talks about the caste issues in contemporary Tamil society. Based on the content of the trailer, the speaker offensively implies that the scheduled caste people are the ones who would have disliked the movie and not other people. Recognising the offensive undercurrent in a seemingly normal comment is difficult and hence these examples complicate the annotation process.
According to the instructions, questions about music director, movie release date and comments containing speaker's remarks about the date and time of watching the video should be treated as belonging to neutral class. However the above examples show that some comments about the actors and movies can be ambiguously interpreted as neutral or positive or negative. We found annotator disagreements in such sentences. Below, we give similar examples from Malayalam.
• Realistic bhoothanghalil ninnu oru vimochanam pratheekshikkunnu -Hoping for a deliverance from realistic demons. No category of audience can be pleased simultaneously. The widespread opinion is that the Malayalam film industry is advancing with more realistic movies. Therefore a group of audience who is more fond of action or non-realistic movies are not satisfied with this culture of realistic movies. In this comment, the viewer is not insulting this growing culture, but expecting that the upcoming film is of his favourite genre. Hence we labelled it non-offensive. • Ithilum valiya jhimikki kammal vannatha -There was an even bigger 'pendant earring'. 'Jhimikki kammal' was a trending song from a movie of the same actor mentioned here. The movie received huge publicity even before its release because of the song but it turned out to be a disappointment after its release. Thus the annotators got confused whether the comment is meant as an insult or not. But we concluded that the viewer is not offending the present trailer but marks his opinion as a warning for the audience to not judge the book by its cover. • Ithu kandittu nalla tholinja comedyaayi thonniyathu enikku mathram aano?-Am I the only person here who felt this a stupid comedy? The meaning of the Malayalam word mentioned here corresponding to the word 'stupid' varies with regions of Kerala. Hence the disparity in opinion between annotators who speaks different dialects of Malayalam was evident. Though in few regions it is offensive, generally it is considered as a byword for 'bad'. • aa cinemayude peru kollam. Ithu Dileep ne udheshichanu,ayale mathram udheshichanu -The name of that movie is good. It is named after Dileep and intended only for him. It is quite obvious that there is a chance of imagining several different movie names based on the subjective predisposition of the annotator. As long as the movie name is unknown here, apparently no insult can be proved and there is no profane language used in the sentence either. • Kanditt Amala Paul Aadai Tamil mattoru version aanu ennu thonnunu -It looks like another version of Amala Paul's Tamil movie Aadai. Here the viewer doubts the Malayalam movie 'Helen' is similar to the Tamil movie 'Aadai'. Though the movie 'Aadai' was positively received by viewers and critics, we cannot generalize and assume that this comment also as positive only because of this comparison. Hence we add it to the category of 'mixed feeling'. • Evideo oru Hollywood story varunnilleee. Oru DBT. -Somewhere there is a Hollywood storyline...one doubt. This is also a comparison comment of that same movie 'Helen' mentioned above. Nevertheless, here the difference is that the movie is compared with the Hollywood standard, which is well-known worldwide and is generally considered positive. Hence it is marked as a positive comment. • Trailer pole nalla story undayal mathiyarinu.-It was good enough to have a good story like the trailer. Here viewer mentioned about two aspects of the movie viz: 'trailer' and 'story'. He appreciates the trailer but doubts the quality of the story at the same time. We considered this comment positive because it is clear that he enjoyed the trailer and conveys strong optimism for the movie.

Benchmark systems
In this section, we report the results obtained in three languages for both the tasks in the corpora introduced above. Like many earlier studies, we approach the tasks as text classification tasks. In order to provide a simple baseline, we applied several traditional machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), K-Nearest Neigbours (KNN), Decision Trees (DT) and Random Forests (RF) separately, for both sentiment analysis and offensive language detection on the code-mixed datasets. We also conducted experiments with BERT, Character BERT, DistilBERT, RoBERTA, XLM, XLM-R on our code-mixed data for classification tasks to establish good, strong baselines (Tables 8 and 9).

Experiments setup
We used 90-5-5% randomly sampled data split for training, development and test set for all the experimental setup. All the duplicated entries were removed from the dataset before the split to make test and development data truly unseen. All the experiments are tuned to the development set and tested on the test set.

Logistic regression (LR):
LR is one of the base-line machine learning algorithms, which is also a probabilistic classifier used for the task of classification of data (Genkin et al. 2007). This is basically the transformed version of linear regression using the logistic function (Park 2013). Accordingly it takes the real-valued features as input which is later multiplied by a weight and the sum is fed to the sigmoid function rðzÞ also called the logistic function to obtain the class probability (Shah et al. 2020). The decision is made based on the value set as threshold. Sigmoid function is as given below:  DravidianCodeMix: sentiment analysis and offensive language... 787 Logistic regression has a close relationship with neural networks as the latter can also be viewed as a stack of several LR classifiers (de Gispert et al. 2015). Unlike Naïve Bayes which is a generative classifier, LR is a discriminative classifier (Ng and Jordan 2002). While Naïve Bayes holds strict conditional independence assumptions, LR is evidently more robust to correlated features (Jin and Pedersen 2018). It means that when there are more than one features say F1,F2,F3 which are absolutely correlated, it will divide the weight W among the features as W1,W2,W3 respectively. We evaluated the Logistic Regression model with L2 regularization to reduce overfitting. The input features are the term frequency inverse document frequency (TF-IDF) values of up to 3 g. This approach results in the model being trained only on this dataset without taking any pre-trained embeddings.

Support vector machine (SVM):
Support Vector Machine are a powerful supervised machine learning algorithm used mainly for classification tasks and for regression as well. The goal of an SVM is to find the hyperplane in an N-dimensional space which distinctly classifies the data points (Ekbal and Bandyopadhyay 2008). It means, this algorithm clearly draws the decision boundary line between the data points that belong to a particular category and the ones that do not fall into the category. This is applicable to any kind of data that is encoded as a vector. Therefore, if we could produce appropriate vector representations of the data in our hand, we can use SVM to obtain the desired results (Ekbal and Bandyopadhyay 2008). Here the input features are the same as in LR that is the Term Frequency Inverse Document Frequency (TF-IDF) values of up to 3 g. We evaluate the SVM model with L2 regularization.

Multinomial naive bayes (MNB)
This is a Bayesian classifier that works on the naive assumption of conditional independence of features. This means that each input is independent of the other and this is absolutely unrealistic for real data. Nevertheless, it simplifies several complex tasks and hence validates the need.
We evaluate a Naive Bayes classifier for multinomially distributed data, which is derived from Bayes Theorem that finds the probability of a future event given an observed event. MNB is a specialized version of Naive Bayes that is designed more for text documents. Whereas simple naive Bayes would model a document as the presence and absence of particular words, MNB explicitly models the word counts and adjusts the underlying calculations to deal with in. Therefore, the input text data is considered as the bag of words with the count of occurrence of words(frequency) alone considered and the position of words are ignored.
Laplace smoothing is performed using a ¼ 1 to solve the problem of zero probability and then evaluate the MNB model with TF-IDF vectors.

K-nearest neighbour (KNN)
KNN is used for the classification and regression problems but mostly used for classification task.The KNN algorithm stores all available data and classifies, on the basis of similarities, a new data point. This implies that it can be conveniently grouped into a well-suite group using the KNN algorithm as new data emerges. The KNN algorithm assumes that the new upcoming data is related to the available cases and places the new case into the column that is more similar to the categories available. KNN is a non-parametric algorithm as it does not make any assumption on underlying data ( (Nongmeikapam et al. 2017)). It is often referred to as a lazy learner algorithm because it does not automatically learn from the training set, but instead stores the dataset and performs an operation on the dataset at the time of classification. At the training point, the KNN algorithm only stores the dataset and then classifies the data into a group that is somewhat close to the current data as it encounters new data.
We use KNN for classification with 3, 4, 5, and 9 neighbours by applying uniform weights.

Decision tree (DT)
The decision tree develops models of classification or regression in the context of a tree structure. A dataset is broken down into smaller and smaller subsets, while an associated decision tree is gradually built at the same time. A tree with decision nodes and leaf nodes is the final product. Therefore, a decision tree classification works by generating a tree structure, where each node corresponds to a feature name, and the branches correspond to the feature values. The leaves of the tree represent the classification labels. After sequentially choosing alternative decisions, each node is recursively split again, and finally, the classifier defines some rules to predict the result. Decision trees can accommodate high dimensional data and perform classification without needing much computation. In general, a decision tree classifier has reasonable accuracy. While speaking about its cons, they are vulnerable to mistakes in classification problems having many classes and a comparatively limited number of training examples. Moreover, it is computationally costly for preparation which implies the method of growing a decision tree is expensive in terms of computation. Each candidate splitting area must be organized at each node before it can find the best split. Combinations of fields are used in some algorithms and a search must be made for optimum combination weights. Pruning algorithms can also be costly, because it is important to shape and compare multiple candidate sub-trees. Here, maximum depth was 800, and minimum sample splits were 5 for DT. The criteria were Gini and entropy.

Random forest (RF)
Random forest is an ensemble classifier that makes its prediction based on the combination of different decision trees trained on datasets of the same size as training set, called bootstraps, created from a random resampling on the training set itself (Breiman 2001). Once a tree is constructed, a set of bootstraps, which do not include any particular record from the original dataset [out-of-bag (OOB) samples], is used as test set. The error rate of the classification of all the test sets is the OOB estimate of the generalization error. RF showed important advantages over other methodologies regarding the ability to handle highly non-linearly correlated data, robustness to noise, tuning simplicity, and opportunity for efficient parallel processing. Moreover, RF presents another important characteristic: an intrinsic feature selection step, applied prior to the classification task, to reduce the variables space by giving an importance value to each feature. RF follows specific rules for tree growing, tree combination, self-testing and post-processing, it is robust to overfitting and it is considered more stable in the presence of outliers and in very high dimensional parameter spaces than other machine learning algorithms (Caruana and Niculescu-Mizil 2006). We evaluate the RF model with the same features as DT.

BERT
BBERT is a language representation model that uses both left and right context conditioning with Masked Language Model training objective in a semi-supervised way (Devlin et al. 2019). These deep contextual representations could be extended to a classification head to fine-tune BERT on downstream NLP tasks. We use BERT with the classification head for classification and fine-tune all parameters in an end to end fashion. We used the huggingface library 11 to do experiments.

CharacterBERT
Many language representation models have adopted the transformers architecture as their fundamental building component due to BERT's success. Interestingly enough, the wordpiece tokenization in BERT works on most of the NLP tasks, but they are also the reason behind making BERT a complex model in the case of a

DistilBERT
DistilBERT is a smaller, cheaper variation of BERT with 40% parameters with 95% of performance from BERT. (Sanh et al. 2019) leveraged pre-trained knowledge distillation along with a smaller language model that achieves similar performances on downstream NLP tasks with less inference time. Knowledge distillation is a compression technique that utilizes a student-teacher model where student i.e. small model learns the behaviour of the teacher i.e. large model with the help of distillation loss.

ALBERT
ALBERT (Lan et al. 2019) is a transformer model with fewer parameters than that of BERT trained on self-supervised loss. The foundation of the model is based on the two basic parameter techniques. The first one factorizes embedding parameterization where a large vocabulary embedding matrix is split into small matrices. The second one shares parameters with cross-layers resulting in the reduction of parameters overall. We utilized ALBERT as one of our experiments to study if the claimed performance gain over BERT is observed in our case.

RoBERTa
RoBERTA (Liu et al. 2019) unlike BERT is not trained on the next sentence prediction training objective. Instead, larger mini-batches and learning rates are incorporated while training the language model with the Masked Language Modelling objective. RoBERTA with its optimum design choices exceeds the

XLMNet
XLNet use autoregressive (AR) language modeling to estimate the probability distribution of a text corpus while avoiding the usage of the [MASK] token and making concurrent independent predictions. It is accomplished via AR modeling, which gives a logical approach to describe the product rule for factoring the joint probability of the projected tokens.

XLM-R
On a number of cross-lingual benchmarks, XLM-RoBERTa was suggested as an unsupervised cross-lingual representation technique that considerably outperformed multi-lingual BERT (Conneau et al. 2020). XLM-R was trained on Wikipedia data from 100 languages and fine-tuned for assessment and inference on a variety of downstream tasks.

Results and discussion
The results of the experiments with the classifiers described above for both sentiment analysis and offensive language detection are shown in terms of precision, recall, F1-score and support in Tables 10, 11, 12, 13, 14, and 15. We used sklearn 12 to develop the models. A macro-average will compute the metrics (precision, recall, F1-score) independently for each class and average them. Thus this metric treats all classes equally, and it does not take the attribute of class imbalance into account. A weighted average takes the metrics from each class just like a macro average, but the contribution of each class to the average is weighted by the number of examples available for it. The number of comments belonging to different classes from both tasks is listed as the support values in respective tables.
For sentiment analysis, the performance of the various classification algorithms ranges from being inadequate to average on the code-mixed dataset. Logistic regression, random forest classifiers and decision trees were the ones that fared comparatively better across all sentiment classes. To our surprise, we see that SVM performs poorly, having a worse heterogeneity than the other methods. The precision, recall and F1-score are higher for the ''Positive'' class followed by the ''Negative'' class. All the other classes performed very poorly. One of the reasons is the nature of the dataset as the classes ''Mixed feelings'' and ''Neutral state'' are challenging to label for the annotators owing to the problematic examples described before. It could be observed from Table 12, the highest weighted average precision for sentiment analysis is 0.68 from Multinomial Naive Bayes (MNB), followed by CharBERT and XLM with the highest recall of 0.62, and finally, the highest weighted F-score of 0.59 from multiple classifiers (BERT, CharBERT, XLM).
For offensive language detection, all the classification algorithms perform equally poorly. We see that logistic regression and random forest are the ones that performed relatively better than the others. The precision, recall and F1-score are higher for the ''Not Offensive'' class followed by the ''Offensive Targeted Individual'' and ''OL'' classes. The reasons for the poor performance of other classes are as same as sentiment analysis. From the tables, we see that the classification algorithms have performed better on the task of sentiment analysis in comparison to that of offensive language detection. One of the main reasons could be the differences in the distributions of the classes among the two different tasks. In the case of an Offensive task, we could observe the highest weighted average precision (0.78), recall (0.76) and F-score (0.74) from MNB, RF/RoBERTA/XLM and DistilBERT respectively. When it comes to the sentiment analysis dataset in Kannada, out of the total of 7671 sentences 46% and 19% belong to the ''Positive'' and the ''Negative'' classes respectively while the other classes share 9%, 11% and 15% respectively. This distribution is better when compared to the Kannada dataset for offensive language detection task where 56% belong to ''Not Offensive'', while the other classes share a low distribution of 4%, 8%, 6%, 2%, 24%. Although the distribution of offensive and non-offensive classes is skewed in all the languages, we were able to observe that an overwhelmingly higher percentage of comments belonged to non-offensive classes in Tamil and Malayalam datasets than Kannada. 72.4% of comments in Tamil and 88.44% comments in Malayalam datasets were non-offensive while in Kannada only 55.79% of the total comments were non-offensive. This explains why the precision, recall and F-score values of identifying the non-offensive class are consistently higher for Tamil and Malayalam data than Kannada.
Since we collected the posts from movie trailers, we got more positive sentiment than others as the people who watch trailers are more likely to be interested in movies and this skews the overall distribution. However, as the code-mixing phenomenon is not incorporated in the earlier models, this resource could be taken as a starting point for further research. There is significant room for improvement in code-mixed research with our dataset. In our experiments, we only utilized the machine learning methods, but more information such as linguistic information or hierarchical meta-embedding can be utilized.

Conclusion
This work introduced code-mixed dataset of the under-resourced Dravidian languages. This data set comprises more than 60,000 comments annotated for sentiment analysis and offensive language identification. To improve the research in the under-resourced Dravidian languages, we created an annotation scheme and achieved a high inter-annotator agreement in terms of Krippendorff a from voluntary annotators on contributions collected using Google Form. We created baselines with gold standard annotated data and presented our results for each class in precision, recall, and F-Score. We expect this resource will enable the researchers to address new and exciting problems in code-mixed research. In future work, we intend to investigate whether we can apply these corpora to build corpora for other under-resourced Dravidian languages.