1 Introduction

The social media platforms of the twenty-first century have evolved into the nerve center of divisive viewpoints, claims, and conflicts [1, 2]. The convenience of accessing information from social media not only contributes to the proliferation of good conversations, but it also makes phenomena such as cyberbullying and hate speech possible [3]. Despite the progress that has been made around LGBTQ+ rights, the internet continues to be an unwelcoming place for LGBTQ+ people. The increasing frequency, severity, and complications of hate crimes committed online are mirrored in the offline world in the following ways: Hate crimes directed at LGBTQ+ people and their allies have shown a sharp rise in the past three years.Footnote 1 A report on hate crimes committed online due to homophobia, biphobia, and transphobia was presented in 2020 in the UK by the LGBTQ+ anti-violence organization called Gallop.Footnote 2 The organization polled 700 persons who identified as LGBTQ+ and circulated the survey through online community networks of LGBTQ+ activists and individuals [4]. The findings offer cause for concern: In the past five years, eight out of ten people have been exposed to hate speech online, and one out of five people have reported being the target of online abuse at least 100 times. The percentage of transgender people who encounter online harassment is significantly higher (93%) than in the case of cisgender people (70%). It is particularly concerning that 18 percent of people indicated that offline occurrences were associated with online abuse that occurred on the internet [5]. These numbers paint a troubling picture of the reality that LGBTQ+ persons face on a daily basis.

Homophobia/transphobia is a type of abuse that can take the shape of physical violence such as murder, mutilation, or beating; explicit sexual violence such as rape, molestation, or penetration; or a breach of privacy in the form of the disclosure of personal information [6,7,8]. The comment “Gays ought to be shot dead” is one of the examples. Other examples of homophobia/transphobia comments include “Gays should be stoned,” “Someone should rape that lesbo to make her into straight,” “You should kill yourself,” “You lesbos, I know where you live, I will visit you tonight,” and “Knock the gay out of him”; these are all comments that have been directed at socially vulnerable LGBTQ+ individuals.

Fig. 1
figure 1

Process flow diagram. IAA (Inter-annotator Agreement) calculation

Automatic recognition of homophobic and transphobic terminology on the internet could make it simpler to block damaging anti-LGBTQ+ content and advance the internet toward the achievement of equality, diversity, and inclusion. While considerable effort has been devoted to identifying aggression [9], misogyny [10, 11], and racism [12], homophobic or transphobic verbal abuse has received significantly less attention than racist or other hate speeches. The lack of annotated homophobic and transphobic data has hindered the creation of homophobic and transphobic speech detection systems. As in the rest of the world, socially vulnerable LGBTQ+ individuals in India are subjected to various kinds of online abuse that perpetuates and legitimizes homophobic attacks, the inferior social standing of LGBTQ+ individuals, sexual assault, assault, and mistreatment of them, and contempt toward them [13,14,15,16,17,18]. Moreover, the online harassment of these vulnerable persons may evolve into systematic bullying campaigns launched on social media to target and, in some cases, brutally threaten LGBTQ+ individuals who are famous on social media [19,20,21].

In this study, we introduce a dataset for the identification of homophobia/transphobia not only in English but also in under-resourced Tamil (ISO 639-3: tam) and code-switched Tamil–English languages.

  • We propose the identification of homophobia/transphobia in online social media comments to remove hate speech toward socially vulnerable LGBTQ+ individuals.

  • We apply the schema to create a multilingual homophobia/transphobia dataset for the identification of hate speech toward socially vulnerable LGBTQ+ individuals. This is a new large-scale dataset of English, Tamil, and Tamil–English (code-mixed) YouTube comments with high-quality annotation.

  • We perform an experiment on our homophobia/transphobia dataset using different state-of-the-art machine and deep learning models to create benchmark systems.

  • We also perform a shared task at LTEDI-ACL 2022 workshop to improve the research on the homophobia/transphobia detection in online comments. We present the results obtained by and the methodology undertaken by the international researchers who used our data.

2 Related works

Online social media platforms are becoming increasingly infested with hate speech, especially homophobic and transphobic speech. Hate speech is distinguished from other types of speech on social media by the fact that it is directed toward a particular group of people; from this point of view, it is also distinct from offensive speech, which consists solely of the use of language that is considered to be vulgar or otherwise inappropriate [22,23,24]. The detrimental impacts of online homophobic and transphobic speech on individual psychological well-being, as well as wider intergroup relations, have been the subject of empirical studies conducted by social scientists [25, 26]. The emotional toll that exposure to excluding and homophobic/transphobic speech takes on socially vulnerable LGBTQ+ people is significant [27,28,29]. Higher levels of bias may be observed in conjunction with expanding experiences of desensitization when populations consume online content with larger degrees of toxicity. In the context of online conversations, we define toxicity as comments that provoke immediate toxic responses. While these toxicity triggers may vary by group and issue due to varying linguistic norms and usages [30]. This is because the toxicity of the content that is consumed online is increasing. Furthermore, widespread hate speech, in the long run, adds to an increased possibility of radicalization directed toward socially vulnerable LGBTQ+ groups.

In the body of academic research, the definition of online hatred, homophobia, and transphobia has frequently been studied through a variety of theoretical perspectives and conceptual frameworks, including social psychology, human–computer interaction, politics, and aspects of legislation and regulation [31,32,33,34]. The identification of online hate speech toward LGBTQ+ people in large-scale interactions using methods that may be scaled up accordingly constitutes a significant challenge from a computational point of view. Recent developments in machine learning and natural language processing (NLP) have led to significant advancements in the area of automated hate speech identification [35]. These advancements have resulted in major progress. For instance, deep learning strategies have been effectively employed in cutting-edge methods for identifying hate speech, and these strategies have been successful in adequately accounting for the complex linguistic traits that characterize hate speech online [35, 36]. Based on a particular case study and/or the application of baseline datasets, the various machine learning approaches each employ their unique terminology to describe what constitutes hatred. Approaches that are lexicon- or embedding-based have also been created thanks to work that is more theoretically motivated. These approaches have special applicability for investigating the targets of hate speech [37].

In general, solutions to these challenges are found through the application of machine learning-based text categorization algorithms. These methods can range from supervised learning to transfer learning and from traditional shallow machine learning to deep learning [38]. Simple approaches can determine whether an instance of communication constitutes hate speech based on whether it contains a potentially hateful keyword. However, these algorithms are unable to identify hateful content that is only implicitly hatefulFootnote 3 and does not employ certain keywords [39,40,41]. In addition, some of these keywords may be used mockingly and are not always considered offensive (e.g., swine, trash, etc.). These procedures lead to the identification of a large number of false positives [42, 43]. Apart from a few exploratory studies, there is a lack of development and testing of models using data from multiple social media platforms. This is the case despite the fact that hate has been observed as a problem on multiple online social media platforms, such as Reddit, YouTube, Wikipedia, Twitter, and so on. Several different datasets have been developed to locate instances of hate speech [44], racial bias in hate speech [45], countering hate speech [46], hierarchically labeled hate speech [47], and abusive language [48]; detect bullying [49], and identify offensive language [50, 51].

While there is a substantial body of work in the field of NLP that deals with binary gender prejudice, hate speech, and abusive and offensive languages in general, the research landscape assessing the damages suffered by members of the LGBTQ+ community online is relatively limited. Wu et al. [52] investigated the linguistic behavior that may be discovered in LGBTQ+-generated Chinese literature and demonstrated that standard methods trained to discern gender from text fail in more complicated dimensions. Ljubešić et al. [53] developed emotion lexicons for Croatian, Dutch, and Slovene and then utilized these lexicons to search for texts that include or exclude socially undesirable speech on the subjects of migration and LGBTQ+. The fact that research on this subject is still in its infancy means that it is plagued by a number of shortcomings that are linked to both the particular goals and nuances of offensive language directed toward homophobia and transphobia, as well as the nature of the classification task in general, which prevents systems from achieving ideal results. One of the most significant difficulties is the inherent difficulty in defining offensive language, as well as the pervasive ambiguity in the usage of similar phrases (such as abusive, poisonous, harmful, hateful, or violent language), which vary from culture to culture and are susceptible to highly subjective interpretations depending on the individual. If we utilize studies on “sissy boys” to explain transphobia, the former of which is one of the cruelest labels ever coined and gives rise to the perception that a transgender person suffers from a psychiatric disorder, then we may comprehend the relevance of this vital but basic contrast—between sexism and androcentrism. How researchers and members of society see homosexuality and homophobia during a time that the test of the study is being produced also has an effect on the measures of homophobia that are used.

3 Homophobia and transphobia

Homophobia refers to negative attitudes and reactions toward homosexuals. Homophobia has been variously described by authors as a cultural phenomena, a set of attitudes, and a psychological characteristic [8]. It is believed that cultural “homophobia” serves to preserve traditional sex role disparities. Homophobia is characterized in terms of attitudes as a collection of established, unfavorable attitudes toward homosexual people [54]. As a personality dimension, “homophobia” is associated with rigidity, authoritarianism, conservatism, and intolerance for ambiguity and deviation [55]. Fear, ignorance, and a lack of knowledge on and tolerance for sexual choice have given rise to a second group of ideas toward homosexuality. One is that gays molest children. The misconception that only gays would engage in such relationships further restricts such conduct [56]. Another example is the notion that homosexuals are promiscuous (many sexual encounters with multiple partners). As many people bear moral objections to promiscuity, associating homosexuality with promiscuous behavior reinforces traditional beliefs regarding ethical sexual activity. In this instance, homosexuality falls under the broader religious system of promiscuity. Such beliefs obfuscate the fact that many homosexuals engage in long-term partnerships.

Homophobia (prejudice toward lesbian and gay individuals) is distinguished from transphobia (prejudice against transgender individuals) based on the perceived social status challenges posed by lesbian and gay individuals vs transgender individuals [57]. The gender identification of one’s sexual partners may influence one’s own sexual orientation, which refers to the person(s) to whom one feels a strong sexual attraction to [58]. Transgender individuals are people who live with a gender identity that is different from traditional heteronormative definitions and may or may not also seek gender affirmation surgery. While gay and lesbian individuals are defined by their sexual orientation, transgender individuals have a gender identity that is different from traditional heteronormative definitions [59]. These people do not adhere to the accepted conventions of gender identities and gender roles or cross over from one gender to another. Transphobia, thus, focuses on non-heteronormative gender identity and possibly non-gender heteronormative gender roles, whereas homophobia focuses on non-heteronormative gender identity and sexual orientation [60]. Transphobia differs from homophobia in that it encompasses not only revulsion and irrational fear of transgender and transsexual individuals but also cross-dressers, feminine men, and masculine women. That is, it is concerned with gender roles and gender identity and not necessarily sexual orientation [61].

Both homophobia and transphobia are terms that refer to the negative attitudes toward people who identify as homosexual or transgender, respectively [7]. Transphobia is characterized as a serious issue that impacts the lives of a great number of people. It is the fear of and/or hatred toward transgender people. People who identify as transgender are typically excluded from and ignored in homosexual communities as well as heterosexual societies. Due to ignorance and animosity, many transgender people are prohibited from coming out or identifying themselves as trans, which further obscures the community of transgender people. We came up with a hierarchical taxonomy that has two levels of classification. In the first place, we will differentiate between content that is homophobic, content that is transphobic, and content that is not anti-LGBTQ+.

3.1 Homophobic content

Homophobic content can be described as “an attitude of animosity against male or female homosexuals.” Lesbophobia, gayphobia, and biphobia are all families of phobias that target different subgroups of the LGBQI+ community. However, there is a difference between general homophobia and more specific forms of the condition. Under the umbrella term of homophobia, this article addresses lesbophobia, gayphobia, and biphobia. Homophobic content is a type of harassment that involves the use of pejorative labels (such as “fag” or “homo”) or denigrative phrases (such as “don’t be a homo” or “that’s so gay” or “that’s so lesbo”) directed against people who are gay, lesbian, bisexual, queer, or gender non-conforming. Content that supports, promotes, urges, or incites violence against LGBQ+ individuals or groups suggests a purpose or desire to damage or cause harm to LGBQ+ individuals and is considered as homophobic content in our paper.

3.2 Transphobic content

“Transphobia” refers to hostile responses to people who are perceived to be “trans.” The term “trans” is typically used to describe people whose designations of their gender are independent from either their assigned gender or from the administrative sex category listed on their original birth certificate [62]. Transphobia refers to hostile responses to people who are perceived to be “trans.” It may be defined as a feeling of repulsion against those who do not comply with the gender standards of society. It manifests in the form of prejudice, discrimination, harassment, and, sometimes, acts of violence directed toward transgender people [63]. Although it is impossible to determine the whole scope of the problem, many people have been on the receiving end of acts of discrimination, aggression, victimization, and sexual assault that were motivated by the victim’s gender identification. The brutal killing of hundreds of transgender people all around the world is perhaps the most horrifying manifestation of transphobia [64]. Pejorative terms that are used to degrade transgender individuals in a vulnerable state are known as transphobic pejoratives. It includes idioms that indicate implicit animosity or fury against transgender people, such as “she-male,” “it,” and “9,” as well as phrases that are openly insulting and derogatory, such as “tranny,” “trannie,” “cross-dresser,” or “drag.” In a similar vein, the phrases “not man enough,” “not women enough,” “will never be a complete man,” and “will never be full women” are all synonymous with one another. It includes not only declaring the desire to take action against transgender persons but also expressing preferences for how they should be treated, which may include using threatening language, engaging in physical violence, engaging in sexual assault, or invading someone’s privacy.

3.3 Non-anti-LGBTQ+ content

This refers to content that does not contain any homophobic or transphobic slurs, pejoratives, or threats in the manner that was defined in previous sections. Most of the time, this issue has nothing to do with exploitation or socially vulnerable LGBTQ+ persons in general. For example, one may encourage people to like the video, subscribe to the channel, or like the comment one left on the video. On the other hand, it may include using different types of abusive words that are not anti-LGBTQ+.

4 Dataset construction

YouTube is popular across the Indian subcontinent as a result of the vast amount of content on the platform that can be accessed on the internet. Some of the content that can be found on YouTube includes music, courses, product evaluations, trailers, and other similar videos. YouTube enables people to upload their own content, which may then be discussed by other users. As a result, it allows for more content to be developed by users in languages that have few resources. This also applies to vulnerable members of the LGBTQ+ community who view videos and leave comments on the videos to which they relate. We made the decision to compile our data from the social media comments posted on YouTube,Footnote 4 which is the most widely used platform throughout the world for voicing one’s opinion on a specific video.

In India, a country where the LGBTQ+ do not have equal marital rights, vulnerable young people in the LGBTQ+ community are defined as an “invisible” minority and one of the most significant “at-risk” groups of adolescents. This is an expected description. These people have no other way to locate persons with comparable experiences other than to search for them on social media. We did not utilize any comments from personal coming out stories by LGBTQ+ persons because they contained private information, and we did not want to disclose them. Instead, we compiled a collection of videos from well-known users on YouTube that explain LGBTQ+ issues in the hope that more people will have a positive outlook. To guarantee that our dataset has an adequate amount of homophobic and transphobic abuse, we started by selecting certain films of pranks uploaded to YouTube by users with usernames such as “Gay Prank,” “Transgender Prank,” and “Legalizing Homosexuality.” There were some videos that discussed the advantages of transgenderism; nevertheless, the majority of the videos from both popular channels and news channels portrayed transgender individuals as persons who take advantage of others and start disputes. It was challenging to locate a video on YouTube that discussed LGBTQ+ concerns in Tamil, as the topic is still taboo, marital equality is not legal, and until recently, homosexuality was criminalized in India [65].

For the purpose of collecting the comments, YouTube Comment Scraper toolFootnote 5 was utilized. These comments were used leveraged by us in the process of creating our datasets with manual annotations. The gathering of Tamil comments was one of our primary objectives. However, we found that the text contained a significant amount of English as well as a mixture of other languages.

Table 1 Raw dataset statistics by language based on Langdect
Fig. 2
figure 2

World cloud for English training data

Fig. 3
figure 3

World cloud for Tamil and Tamil–English training data

Code mixing is a common and natural phenomenon among Tamil speakers as a result of their bilingual and multilingual language use. Users of social media write in the Roman alphabet for convenience (phonetic typing), which increases the likelihood of code-mixing with a language that uses the Roman alphabet. The Tamil language has a native script that has its own unicode, but users still choose to write in the Roman alphabet instead [66].

We had a hard time getting the most important information from the comment section that suited our goal. This was complicated by the fact that there were responses in languages that were not our target. We used the Langdetect libraryFootnote 6 to separate languages and tell them apart as part of the process of getting the data ready for cleaning. We split the data into three groups, such as Tamil and English-based langdetect library. We kept the code-mixed Tamil–English that was left. As per data privacy laws, we removed all information about users from the corpus. For preparing for the test, we got rid of extra information, such as URLs. Most of the comments were written in Roman letters and either used Tamil or English grammar mixed with Tamil words. Between August 2020 and February 2021, all the comments were gathered. After cleaning, our data reduced. The final dataset ready for annotation is shown in Table 1. The wordcloud in Figs. 2 and  3 presents the distribution of words in training data for the shared task.

4.1 Annotation

After pre-processing and cleaning the text, we reviewed it thoroughly to verify that it did not include any personally identifiable information. When developing training datasets for sensitive issues such as ours, producing high-quality annotations is a huge challenge. Several variables influence the production of such annotations, including the cultural influences and personal biases of the annotators. As LGBTQ+ affairs are a controversial subject, it was difficult to locate annotators. For English, we utilized members of our in-house LGBTQ+ groups. It was difficult to locate willing Tamil annotators; even Tamil speakers in culturally progressive regions such as Europe, the UK, and the USA did not want to be linked with this project out of fear that they would be considered part of the LGBTQ+ community. After a lengthy search, we discovered LGBTQ+ or LGBTQ+ ally volunteer annotators. We trained them by showing them YouTube movies explaining what LGBTQ+ meant and whether it is normal.

Table 2 Dataset statistics after annotation
Table 3 Dataset distribution at class level

We collected the annotator’s email address using Google Forms so that they may only annotate once. Each Google Form allows for a maximum of 100 comments. For comments in the English language, we circulated the Google Form among members of the LGBTQ+ organization of the National University of Ireland, Galway. For Tamil and code-mixed language, we emailed the Google Form to several LGBTQ+ societies in Tamil Nadu, but we received very few answers. In the end, seven annotators offered to annotate the combined Tamil and Tamil–English code. All of them are multilingual in Tamil and English and were ready to take the work seriously. Further, three people responded for English. Because we contacted people through college societies, every annotator was either a graduate or a postgraduate. Each annotator identifies as LGBTQ+ or as an ally of the LGBTQ+ community. At least three annotators annotated every remark. If more than three annotators agreed, the labels were accepted; otherwise, they were marked as disputes. Annotators and the writers of this work examined the dispute under the direction of the author, who established the annotation taxonomy, was familiar with the literature on homophobia and transphobia, and is fluent in Tamil and English. The facilitator was responsible for promoting debate among the annotators and ensuring that the final labels adhered to the taxonomy. The debate took place online using Google Meet. Each argument was resolved until the annotators reached consensus on the final label. If no consensus could be reached, those comments were eliminated from the study’s dataset. We paid 300 rupees per hour to Indian annotators; the typical wage in India is around Rs. 16,000 per month (Rs. 533 per day), according to moneymint.com.Footnote 7 We removed the comments on which a consensus couldn’t be reached ultimately. The final data statistics after annotation is shown in Table 2. Table 3 shows the classwise distribution of dataset. From Fig. 4, we can see that there is less transphobic and homophobic content in English compared to Tamil and Tamil–English.

4.2 Ethical concerns

Data from social media is very sensitive, especially when it has to do with the LGBTQ+ community. We took great care to reduce the risk of people being identified in the data by removing personal information such as names, but we did not remove celebrity names. However, to look into equality, diversity, and inclusion (EDI), we had to keep track of information on race, gender, sexual orientation, ethnicity, and philosophical views. Annotators could only see anonymous posts and promised not to get in touch with the person who made them. Researchers who want to use the dataset for research will only be able to do so if they agree to follow ethical rules.

Fig. 4
figure 4

Dataset distribution at class level

4.3 Inter-annotator agreement (IAA)

We sought agreement from the majority of the annotators for aggregating the annotations on homophobic/transphobic comments; the comments that did not receive a majority agreement in the first round were collected and placed in a second Google Form so that more annotators may contribute them. Following the last round of annotation, we computed the inter-annotator agreement. Using Krippendorff’s alpha \((\alpha )\), we measure the clarity of the annotation and report on inter-annotator agreement. Krippendorff’s alpha is a statistical measure of annotator agreement that reveals how well the resultant data conforms to the underlying data [67]. Although Krippendorff’s alpha is computationally expensive, it is more relevant in our case, as more than two annotators annotated the comments and not all phrases were annotated by the same annotator. Further, it is unaffected by missing data; permits flexibility in sample sizes, categories, and the number of raters; and may be used to any measurement level, including nominal, ordinal, interval, and ratio. Krippendorff’s alpha is obtained by the following:

$$\begin{aligned} \alpha = 1 - \frac{D_o}{D_e} \end{aligned}$$
(1)

where \(D_o\) is the observed disagreement between the homophobic/transphobic labels given by the annotators, and \(D_e\) is the disagreement that is predicted when the coding of homophobic/transphobic may be ascribed to chance rather than to an intrinsic quality of the homophobic/transphobic label itself.

$$\begin{aligned}{} & {} D_o = \frac{1}{n}\sum _{c}\sum _{k}o_{ck\;metric}\;\delta ^2_{ck} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} D_e = \frac{1}{n(n-1)} \sum _{c}\sum _{k}n_c \;.\;n_{k\;metric}\,\delta ^2_{ck} \end{aligned}$$
(3)

Here, \(o_{ck}\;n_c\;n_k\;\) and n refer to the frequencies of values in the coincidence matrices, and metric refers to any metric or level of measurement such as nominal, ordinal, interval, ratio, and others.

The values “0” and “1” are included in the range of \(\alpha \), which may be written as \(1 \ge \alpha \ge 0\). When \(\alpha \) is “1,” the annotators are in complete agreement with one another, but when it is “0,” the annotators’ agreement is the result of solely random chance. \(\alpha \) \(\ge \).80, as this is the standard requirement. While \(0.67 \le \alpha \le 0.8 \) is required as an acceptable rule of thumb that allows for preliminary inferences to be formed, \(\alpha \ge \).653 is the lowest feasible limit. We used nltkFootnote 8 to compute \((\alpha )\). When we used the nominal measure to determine the level of agreement between our annotations, we obtained Krippendorf’s alpha values of 0.67, 0.76, and 0.54 for English, Tamil, and Tamil–English, respectively. We have shown the details of the dataset in Table 2 and classwise distribution in Table 3.

5 Benchmark experiments

To examine our dataset, baseline models were created. We constructed three corpora, each of which includes monolingual texts in either Tamil or English, as well as multilingual texts in a code-mixed version of Tamil and English. As data was taken from social media, the text in the corpus contains a lot of noise. Therefore, various punctuation marks, tags, and symbols such as emojis and @ signs were included in the YouTube comments. To gather the data in a clean state, pre-processing procedures, namely the removal of punctuation, stop words, and tags, were utilized. We employed a stratified sampling approach using K-folds to divide the dataset into groups; every group had precisely the same percentage of labels. This allowed us to compare the results of our analysis more accurately. We decided to employ stratified sampling due to the imbalance in our dataset. For the purpose of cross-validation, we divided the data into five folds. Several different baseline models are constructed by employing various distinct feature collections and learning techniques. Machine learning models with different embeddings, such as TF-IDF, count vectorizer, BERT [68] embeddings, and fastText embeddings [69], are utilized for both monolingual and code-mixed datasets. Further, classifiers such as logistic regression, naive Bayes, random forest, support vector machines, and decision trees are utilized in the construction of baseline models with the aforementioned embeddings [70].

Table 4 Results for English dataset

The deep learning model was constructed utilizing a bidirectional LSTM (BiLSTM) layer. The model consisted of an embedding layer where the input vectors are vectorized using BERT embeddings, followed by a BiLSTM layer, a flatten layer, and two dense layers. The model was developed with Keras layers [71]. Using a linear, fully connected layer with the Softmax activation function, the probability distribution across the classification classes was generated, and the class with the highest probability was selected as the final label. All of our machine learning and deep learning models were trained using Google Colab Pro.Footnote 9

The performance of the classification model had to be evaluated once the technique and architecture of the classifier was chosen and constructed, respectively, to identify whether the classification model can correctly place unknown data into appropriate classes. To evaluate the efficacy of the classification algorithm, we made use of several different metrics, such as accuracy, precision, and recall, in addition to the F1 score, which are described as follows:

$$\begin{aligned}{} & {} Recall (R)=\frac{T P}{(T P+F N)} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} Precision (P)=\frac{T P}{(T P+F P)} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} Accuracy=\frac{(T P+T N)}{(T P+T N+F P+F N)} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} F 1=\frac{(2 \times Precision \times Recall )}{( Precision + Recall)} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} P_{mac}=\frac{1}{L} \sum _{i=1}^{L} P_i \end{aligned}$$
(8)
$$\begin{aligned}{} & {} R_{mac}=\frac{1}{L} \sum _{i=1}^{L} R_i \end{aligned}$$
(9)
$$\begin{aligned}{} & {} F 1_{\textrm{mac}}=\frac{1}{L} \sum _{i=1}^{L} 2 \times \frac{P_{\textrm{mac}} \times R_{\textrm{mac}}}{P_{\textrm{mac}}+R_{\textrm{mac}}} \end{aligned}$$
(10)
$$\begin{aligned}{} & {} P_{\textrm{weighted}}=\sum _{i=1}^{L}(P_i \times Weight_i) \end{aligned}$$
(11)
$$\begin{aligned}{} & {} R_{\textrm{weighted}}=\sum _{i=1}^{L}(R_i \times Weight_i) \end{aligned}$$
(12)
$$\begin{aligned}{} & {} F 1_{\textrm{weighted}}=\sum _{i=1}^{L}(F 1_i \times Weight_i) \end{aligned}$$
(13)

where TP, TN, FP, and FN refer to True Positive, True Negative, False Positive, and False Negative, respectively.

Table 5 Results for Tamil dataset
Table 6 Results for Tamil–English Dataset

Tables 4, 5, and 6 illustrate the classification performance of several machine and deep learning models paired with a variety of features for a dataset with three classes. For English, the accuracy ranges anywhere from 0.63 to 0.94. The overall macro average score is lower than 0.4 for both accuracy and recall questions, as well as the F1 score for all three classes. We feel that there is a greater open field for future study on the identification of an ideal model for homophobia/transphobia detection because the macro average penalizes models that do not perform well with minority classes and because our dataset is severely unbalanced.

Fig. 5
figure 5

Results for English data

Fig. 6
figure 6

Results for Tamil data

Fig. 7
figure 7

Results for Tamil–English data

Fig. 8
figure 8

Results for benchmarking systems

When it comes to Tamil, the accuracy ranges anywhere between 0.61 and 0.92. According to the data, it is clear that fastText with RF offers the most advantageous set of characteristics for the Tamil language. It has come to our attention that, of all the model and feature combinations tested, random forest with BERT embedding achieves the greatest weighted F1 score in both the English and Tamil–English code-mixed scenarios. The combination of random forest with fastText embedding yields the greatest weighted F1 score for the Tamil language. As Tamil–English code-mixed settings make use of romanized writing and English words, they are practically indistinguishable from standard English. Based on the outcomes of our experiments with all three languages and the three different class label configurations, we find that a combination of deep learning and machine learning worked significantly better than either deep learning or machine learning alone. We have conducted experiments with only BiLSTM and multilingual BERT for the deep learning settings. In some of the configurations, the multilingual BERT’s performance was inferior to that of every other classifier, so for the benchmarking performance, we left out the MBERT from the accuracy comparison. We have shown the experiment results of our benchmark on our new dataset, and this study could be used as a base for creating new resources on detecting homophobia and transphobia in other under-resources languages such as Kannada, Hindi, and Malay.

6 Task setting and evaluation setting

A dataset compiled from social media comments in Tamil, English, and Tamil–English will be analyzed to search for homophobic and transphobic utterances. This will be the core objective of this effort. This work involves classifying comments and posts at the comment/post level. A system must decide if a comment is homophobic, transphobic, or non-anti-LGBTQ+ content. Even if a single remark or post in the dataset is composed of many sentences, the average sentence length throughout the corpus is just one. Annotations at the level of comments and posts are included in the corpus. The participants were provided with datasets in Tamil, English, and Tamil–English for the purposes of creating, training, and testing homophobia/transphobia detection model.

The participants were provided with English, Tamil, and Tamil–English development, training, and test datasets. In the first phase, development, training, and validation, data were made available to the participants so that they may train and develop homophobia/transphobia detection systems for any of the three languages. Participants had the option of performing cross-validation on the training data or using the validation dataset for early assessments and the development set for sharing hyperparameters. The objective of this phase was to verify that all participant-developed systems were ready for review prior to the release of test results. The application was selected for examination and for creating the ranking list. The accuracy of the predictions was measured against gold standard labels.

All the datasets have an unbalanced distribution of homophobia and transphobia classes. Most comments in the Tamil–English code-mixed dataset belong to the non-anti-LGBTQ+ content (5385) class, indicating a class imbalance as seen in the Table 3. In the monolingual dataset, the non-anti-LGBTQ+ content (Tamil: 3205 and English 4657) class emerged as the majority class, compared to the other two categories. This disparity was rectified by selecting the macro-averaged F1 score (F) official evaluation metric task significant variance number of instances in different classes. Macro-averaging gives the same weight to all classes, irrespective of their size. We utilized a Scikit learn classification report tool.Footnote 10 Participants submitted up to five test runs, with one of them serving as official runs that would be scored and shown on the leader board. If no official runs were specified, the most recent contributions from each team were assumed to be official. In their papers, we allowed groups to explore the distinctions between their systems. The goal was to compare the effectiveness of various setups on the test set.

Table 7 Rank list for Tamil language

6.1 Participants methodology

A total of 98 people signed up to take part in this shared task. Finally, we received a total of 10 submissions in the Tamil language, 13 submissions in the English language, and 11 submissions in the Tamil–English language. The processes to be followed and the results obtained from carrying out these activities have been outlined. The articles listed below should be referred to for additional in-depth information on the following topics:

ABLIMET [74] utilized a method that focuses on the fine-tuning of the pre-trained language model. This model performs processing on the target data and then normalizes the output of that processing using a layer normalization module. This is followed by two fully connected layers. They utilized the Roberta-base model for the English subtask and the Tamil-Roberta model for the Tamil and Tamil–English subtasks. All of these are pre-trained language models.

bitsa_nlp [75] used famous distinctive models primarily based on transformer architecture and a data augmentation approach for oversampling the English, Tamil, and Tamil–English datasets. They implemented various pre-trained language models based on transformer architectures, namely BERT, multilingual BERT (mBERT), XLM-RoBERTa, IndicBERT, and HateBERT, to classify the detection of homophobic and transphobic content.

For experiments with the code-mixed datasets, SSNCSE_NLP [77] used a mix of word embeddings, classifiers, and transformers. They used TF-IDF and a count vectorizers with some models, such as SVM, MLP, random forest, and K-nearest neighbors, and simple transformers, such as LaBSE, tamillion, and IndicBERT, to pull out the features.

To vectorize comments, NAYEL [72] tried out TF-IDF with bigram models. Then, they used a set of classification algorithms such as support vector machine, random forest, passive aggressive classifier, Gaussian naive Bayes, and multilayer perceptron. From these models, they chose support vector machine as the best because, among all the models, it was the most accurate.

Nozza [78] used finely tuned models to classify the task. They chose two large language models, BERT and RoBERTa. They chose HateBERT because it was more accurate than other models and provided better results than BERT. The team tried out ensemble modeling, made with a meta-classifier that uses each machine learning classifier’s predicted label as a vote for the final label they present as a prediction. Moreover, they offered two ways to decide how an ensemble should work: majority voting and weighted voting.

Sammaan [76] constructed the classifier with a collection of transformer-based models. They placed second for English, eighth for Tamil, and 10th for Tamil–English. Experimentation was conducted using BERT, RoBERTa, HateBERT, IndicBERT, XGBoost, random forest classifier, and Bayesian optimization models.

UMUTeam [73] combined contextual and non-contextual sentence embeddings with linguistic components collected from a self-developed tool using neural networks. This team placed seventh in English, third in Tamil, and second in Tamil–English.

Table 8 Rank list for English language
Table 9 Rank list for Tamil–English dataset

6.2 Results and discussion of shared task

A total of 98 individuals registered for this shared task. Fourteen teams presented conclusive results for the Tamil, English, and Tamil–English datasets. Tables 7, 8, and 9 provide the rank lists for Tamil, English, and Tamil–English, respectively. We ranked the teams using the average macro F1 score, which recognizes the F1 score in each label and determines their unweighted average. The runs were placed in decreasing order by the macro F1 scores. By fine-tuning a pre-trained language model, the ABLIMET team achieved the best results solely with the English dataset. For this English subtask, their pre-trained language model used the Roberta-base model. They offer RoBERTa as the best model for this English dataset based on these models. Using the macro F1 score, we can determine that this transformer model performed well in comparison to other models. However, the team’s performance in the Tamil and Tamil–English subtasks was quite poor. They ranked fifth in Tamil and sixth in Tamil–English because the accuracy of their models was inferior. As a consequence of performing data balancing in these activities for running the model, they achieved better outcomes than other teams. The ARGUABLY team did well in Tamil and Tamil–English classification challenges, utilizing machine and deep learning architectures. Other groups also fared better on this job, particularly those organized with a fine-tuning strategy, pre-trained models, and transformer models such as BERT [68], mBERT, XLM-RoBERTa [79], IndicBERT [80], HateBERT [81], etc. They include TF-IDF and count vectorizer, among others, for extracting features from datasets.

7 Conclusion

We propose a dataset that contains the high-quality, expert categorization of homophobic and transphobic content taken from comments posted in many languages on YouTube. In comparison to the numerous other annotated datasets utilized for various classifications, the one that was produced in this work offers very less information. Nevertheless, to the best of our knowledge, this is the first dataset that has been developed for the purpose of analyzing homophobia and transphobia in multilingual comments written in Tamil, English, and Tamil–English. Within the confines of a supervised classification framework, we carried out an exhaustive empirical investigation in which we evaluated a wide variety of feature selection approaches. These approaches included machine and deep learning methods. We also conducted a shared task to encourage research on homophobia and transphobia detection systems. The results of our research showed that detecting homophobic and transphobic language in multilingual and multicultural contexts is a difficult challenge that needs to be tackled.