1 Introduction

Sexism, defined as stereotyping, prejudice, or discrimination based on a person’s sex, occurs in various overt and subtle forms, permeating personal as well as professional spaces. While men and boys are also harmed by sexism, women and girls suffer the brunt of sexist mindsets and resultant wrongdoings. With increasingly many people sharing recollections of sexism experienced or witnessed by them, the automatic classification of these accounts into well-conceived categories of sexism can help fight this oppression, as it can better equip authorities formulating policies and researchers of gender studies to analyze sexism.

The detection of sexism differs from and can complement the classification of sexism. In a forum where instances of sexism are mixed with other posts unrelated to sexism, sexism detection can be used to identify the posts on which to perform sexism classification. Moreover, we observe the distinction between sexist statements (e.g., posts whereby one perpetrates sexism) and the accounts of sexism suffered or witnessed (e.g., personal recollections shared as part of the #metoo movement). We also note the prior work on detecting or classifying personal stories of sexual harassment and/or assault [10, 21]. In this paper, we focus on classifying an account (report) of sexism involving any set of categories of sexism.

Most of the existing research on sexism classification [4, 19, 20] considers at most five categories of sexism. Further, the majority of prior approaches associate only one category of sexism with an instance of sexism. Having mutually exclusive categories of sexism is unreasonable and limiting, as substantiated by the given example. ‘A colleague once saw me washing my coffee mug before leaving the office and “joked” if I was practicing for my “home duties” It’s sad that he doesn’t see the problem with men not bearing half the load of household work’. Associated categories for this account of sexism are ‘Role stereotyping’: False generalizations about some roles being more suitable for women; also applies to similar mistaken notions about men, ‘Moral policing’: The promotion of discriminatory guidelines for women under the pretense of morality; also applies to statements that feed into such narratives and ‘Hostile work environment’: Sexism suffered at the workplace; also applies when sexism perpetrated by a colleague elsewhere makes working worrisome for the victim.

To the best of our knowledge, Parikh et al. [28] is the only work that explores the multi-label categorization of accounts of sexism using machine learning and considers more than three categories of sexism. It provides the largest dataset containing accounts drawn from ‘Everyday Sexism Project’Footnote 1, where experiences of sexism are shared from all over the world. The textual accounts are annotated using 23 categories of sexism formulated with the help of a social scientist. However, they perform sexism classification among 14 categories derived by merging some sets of categories. This prohibits distinguishing within category pairs such as {moral policing, victim blaming} and {motherhood-related discrimination, menstruation-related discrimination}. We overcome this limitation by carrying out a fine-grained (23-class) classification by building on the same labeled dataset.

Most existing approaches for the categorization of sexism are entirely supervised in nature (use no unlabeled data). The biggest labeled dataset available for sexism classification [28] comprises around 13, 000 accounts. In contrast, the accounts of sexism narrated on just the ‘Everyday Sexism Project’ website comfortably number in several hundred thousands. Effectively tapping this large volume of data has the potential to enhance the classification performance by overcoming potential weaknesses stemming from the limited training data, especially because of the fine-grained aspect of the problem. As far as we are aware, the only existing approach that uses unlabeled data for sexism classification does so for merely fine-tuning a pre-trained model used for computing sentence representations [28]. We formulate the first set of methods for utilizing unlabeled instances in a more involved manner for sexism classification. The proposed data augmentation techniques are broadly based on self-training, a semi-supervised learning paradigm, in terms of the workflow. We augment the existing labeled data through the selective addition of pseudo-multi-labeled unlabeled samples.

We develop our semi-supervised methods keeping in mind the fact that, unlike in single-label multi-class (or binary) classification, a single instance can be tagged with up to 23 categories in our multi-label multi-class classification. We identify textual diversity with reference to (a subset of) the original labeled set as a useful constituent in the evaluation of candidate unlabeled instances and construct mechanisms of building it into our approach. We also seek to incorporate the desirable quality of low class imbalance into it in ways suited for multi-label semi-supervised classification. We present multiple procedures for combining these elements into a unified semi-supervised learning approach. We also add the unlabeled instances that are similar to the original labeled set that are hard to classify. We generate label confidence scores associated with each sample for all the pseudo-labeled samples that are selected through these semi-supervised data augmentation methods. As the probability predicted by the classifier will have some critical information about each label, we calculate the label confidence scores associated with each label in each sample. We formulate the loss function which utilizes these label confidence scores as a weights.

We also proposed neural architecture for the multi-label classification of accounts of sexism, which integrates domain-adapted BERT model named as tBERT along with biLSTM attention into an end-to-end training for sexism classification. As the domain-adapted tBERT model is tuned on unlabeled accounts of sexism, it may generate better sentence representation for sexism classification than the representations produced by original BERT model. These representations are complemented by sentence representations built from word embeddings as a function of trainable neural network parameters. To produce the final representation, the concatenated sentence representation is then passed to biLSTM followed by attention.

As there is a scarcity of labeled data for few categories in 23 categories of sexism, we devise a multi-level training approach for multi-label sexism classification where we train models sequentially at different categorization levels. The problem of scarcity can be alleviated by reducing the number of categories via proper merging, but it does not serve our purpose of fine-grained classification. Hence, we use training using reduced category sets as supervised pre-training steps for the final 23-class classification to benefit from the higher sample to category ratios associated with these pre-training steps. We initialize (most of) the weights of the model at each level with the corresponding weights of the model trained at the previous level (with fewer categories). We integrate a proposed neural model with multi-level training and our loss function to form a method that outperforms numerous baselines with a clear margin.

Our key contributions are summarized below.

  • To the best of our knowledge, this is the first work to consider as many as 23 categories for sexism classification. We introduce a set of semi-supervised methods to augment the labeled data that are tailored made for multi-label multi-class sexism classification.

  • We devise mechanisms aimed at enhancing the textual diversity in the resultant expanded labeled set, alleviating the skew in the original class distribution and favoring samples which are hard to classify.

  • We propose a neural architecture combining biLSTM and an attention mechanism with a domain-tailored BERT model that allows for end-to-end training for multi-label sexism classification.

  • We propose a loss function that makes use of the label confidence scores associated with the each pseudo-labeled sample in augmented data.

  • We develop a multi-level training method for multi-label sexism classification wherein we train models sequentially at different levels.

  • Several proposed methods outperform numerous baselines, including the existing state-of-the-art, across various established metrics.

The rest of our paper is structured as follows. Section 2 describes related prior work. Section 3 discusses the semi-supervised data augmentation methods that we propose for multi-label sexism classification. Section 4 describes the proposed multi-label sexism classification approaches and also the loss function. Experimental results and observations are provided in Sect. 5. We conclude with a summary in Sect. 6.

2 Related Work

In this section, we review the work on the classification of sexism after noting some distantly related work. Though our work involves classifying accounts of sexism, prior work on the classification of sexist or misogynous statements (e.g., tweets wherein one perpetrates sexism or misogyny) is also included in this review. We also present work relating to the identification and classification of hate speech, as some of it applies to our work to a certain degree in the detection of sexist hate.

Melville et al. [25] apply topic modeling to data gathered from the ‘Everyday Sexism Project’ and map the semantic relations between the topics. ElSherief et al. [16] study user involvement with posts related to gender-based violence and their language variations. Warner and Hirschberg [39] detect hate speech using an SVM classifier using brown clusters, n-grams and the occurrence of words. Corazza et al. [11] explore different datasets from different languages to examine multilingual hate speech detection. The authors used models such as Bi-LSTM and SVM to construct models for detecting hate speech. We note that sexism detection can complement sexism classification by preceding it to remove posts unrelated to sexism. The detection of sexism is performed by some hate speech classification approaches that include sexism as a category of hate [12]. [5] explored various deep learning approaches such as fastText, RNN and CNN to classify the given tweet as racist, sexist and neither. Gao et al. [18] perform hate speech detection in a weakly supervised fashion. Waseem and Hovy [40] classified tweets as sexist, racist or neither using character n-grams along with extra-linguistic features. Zhang and Luo [46] explored the word embeddings with a combination of GRU and CNN and skipped CNN to classify tweets as sexism, racism, both and non-hate. Qian et al. [33] provide a hierarchical conditional variational autoencoder model for fine-grained hate speech classification. Rodríguez-Sánchez et al. [34] detect the sexism on tweets using BILSTM and BERT models in Spanish language. Chiril et al. [9] introduced the method for detecting sexism reports/denunciations from actual sexist material that is specifically addressed to a target or defines a target. [8] created an dataset for sexism detection in French tweets and explore several deep learning architectures like BERT, CNN, CNN-LSTM, bi-LSTM with attention. Frenda et al. [17] present an approach for detecting sexism and misogyny from tweets. Plaza-Del-Arco et al. [32] perform hate speech detection in Spanish tweets for the partially superposed domains of xenophobia and misogyny by comparing different machine learning and deep learning approaches.

Burnap and Williams [6] build a data-driven model of cyberhate to identify disability, race and sexual orientation using bag-of-words, dictionary and text parser to extract typed dependencies. Schrading et al. [35] identify pieces of text discussing domestic abuse on Reddit. Nobata et al. [27] extract linguistic, character n-grams, semantic and syntactic features to detect abusive comments and analyze the abusive language over time from the corpus of Yahoo Finance and News comments. Van Hee et al. [37] use n-gram and sentiment lexicon features to identify and classify cyberbullying. Agrawal and Awekar [3] explore cyberbullying detection across social media platforms using deep learning methods including biLSTM with attention. Unlike these papers, we seek to categorize sexism and misogyny, which include hate speech directed at women but are not limited to hate. Zhong et al. [47] detect cyberbullying in comments posted on Instagram images through features obtained from captions and images.

Karlekar and Bansal [21] investigate RNN, CNN and a combination of RNN and CNN for categorizing personal experiences of sexual harassment into one or more of three classes. For the classification of personal stories of sexual harassment, [42] uses a density matrix encoder inspired by quantum mechanics. Khatua et al. [22] explore deep learning methods to classify sexual violence into one of four categories. In Anzovino et al. [4], tweets identified as misogynist are classified as discredit, stereotype and objectification, dominance, or derailing, sexual harassment, and threats of violence using features involving n-grams, part of speech (POS) tags and text embedding. A four-class categorization of sexist tweets is carried out by Jafarpour et al. [19] which deal with threats and harassment by improving the training data using knowledge graphs like ConceptNet. Suvarna and Bhalla [36] identified victim blaming language on Twitter using transfer learning-based classification method. Chowdhury et al. [10] detect personal recollections of sexual harassment from Twitter posts. In Jha and Mamidi [20], tweets are classified as hostile, benevolent or non-sexist using biLSTM with attention, fastText and SVM. While its categorization of sexism relates to how it is stated, our work concentrates on aspects such as where it occurs, who perpetrates it and what an instance of sexism involves.

Parikh et al. [28] address the multi-label categorization of accounts of sexism. They create the largest dataset and provide state-of-the-art classifier for the sexism classification. The classifier combines sentence embeddings generated using a BERT [13] model with those generated from ELMo [31] and GloVe [30] embeddings using biLSTM with attention and CNN. Abburi et al. [1] explore a multitask approach for semi-supervised sexism classification that deploys three auxiliary tasks such as estimating the topic proportion distribution, predicting the cluster label and detecting an account of sexism without inflicting any manual labeling cost. They also explore the objective functions that make use of label correlations present in the training data. As far as we know, our work presents the first semi-supervised data augmentation approaches for the multi-label classification of accounts describing any type(s) of sexism.

3 Semi-Supervised Data Augmentation for Multi-label Sexism Classification

This section presents proposed methods that employ semi-supervised learning for classifying an account of sexism (also referred to as a post henceforth) such that the categories can co-occur. We begin by laying the groundwork for the description of our methods.

3.1 Basic Self-Training

Self-training [2, 44] is a semi-supervised learning approach that helps augment the set of labeled instances by selectively adding unlabeled samples. For performing a task such as classification (or regression) in the presence of unlabeled and labeled data, a typical self-training method first trains a model (e.g., classifier) on the labeled instances. Next, it applies the model to the unlabeled instances and identifies a subset of them to be added to the training set, along with the predicted labels, based on criteria such as the confidence scores associated with the model’s predictions. After expanding the training set by adding this pseudo-labeled subset, a new classifier is trained on the augmented set. This process is repeated until some stopping criteria such as the stabilization of model parameters and the number of iterations completed are satisfied.

3.2 Base Model

We employ a deep learning model that generates the output probabilities for all the labels through the sigmoid nonlinearity on its last (dense) layer as our base classifier. The loss function used for training the model is a weighted mean of label-wise binary cross-entropy values.

3.3 Proposed Approach

While there exists some prior work on the classification of textual records involving sexism, most methods are supervised. We observe that the accounts of sexism reported on the ‘Everyday Sexism Project’ website alone hugely outnumber those provided in the biggest existing labeled dataset for sexism classification, Parikh et al. [28]. The performance of a sexism classification method can be improved by leveraging this sizable chunk of unlabeled data. We explore semi-supervised techniques based on self-training to utilize unlabeled accounts of sexism. We devise multiple methods of expanding the set of labeled data using unlabeled instances befitting the multi-label nature of instances of sexism.

We first formulate a basic method based on self-training tailor-made for the multi-label problem configuration. We also propose other methods built on top of it with a view to (1) improving the proportions of positive (relevant) samples across categories, (2) improving the class balance keeping in mind the mutual non-exclusivity of category labels, (3) encouraging textual diversity in the newly labeled (pseudo-labeled) data relative to the original training set and (4) favoring the samples which are hard to classify. We also develop some combinations involving these proposed methods. The augmented data generated by any of our methods can be used for training any supervised classification model. We now describe the proposed semi-supervised methods in depth.

3.3.1 Basic Self-Training for Multi-label Classification

The most fundamental, indispensable factor in determining which unlabeled instances should be considered to be added to the original labeled set during self-training is the confidence of correctness associated with each prediction. In single-label (multi-class) classification, one can simply treat the classification probability corresponding to the one predicted class as this confidence. This procedure is inapplicable to a multi-label case wherein the base classifier outputs a probability of applicability for each label. We observe that our baseline multi-label classifier generates the probabilities following sigmoid (as opposed to softmax) nonlinearities and that predictions need to be made by rounding the per-label (per-sample) probabilities \(p_j\). Since this implies that the probability linked with the prediction for unlabeled sample \(u_k \in U\) for label \(l_j\) is either \({\hat{p}}_{kj}^{\sigma }\) or \(1 - {\hat{p}}_{kj}^{\sigma }\), we mandate that at least one of these two quantities exceeds a threshold (hyper-parameter T) for each label for \(u_k\) to qualify for being added to the labeled set.

3.3.2 Improving Positive Sample Proportions Across Categories (IPSPC)

In addition to the basic confidence-based check, this method subjects unlabeled instances to another qualifying test relating to the number of predicted labels using the base classifier. The intuition is that the higher the number of predicted labels, the greater the number of labels for which relevant (positive) samples are contributed. Since the number of positive samples is outweighed by the negative counterpart by a substantial margin across most labels, we attempt to counter this skew by picking the unlabeled instances with a certain minimum number of predicted labels (hyper-parameter \(P_{min}\)). In order to maximize the label correctness of the chosen pseudo-labeled set, we also avoid candidate samples with an unreasonably high number of predicted labels (hyper-parameter \(P_{max}\)).

3.3.3 Favoring Low-Support Labels

While the previous method seeks to improve the per-category ratios of positive to negative sample counts generally, this method attempts to correct the class imbalance between categories while creating the augmented dataset. We present two methods that order unlabeled samples (adhering to the checks proposed earlier) based on notions of support that we design for multi-label classification. We then pick the lowest-support \(Top_p\) percent of the samples as the pseudo-labeled set in each iteration, where hyper-parameter \(Top_p\) is empirically determined. For the first method (Support.uniform), support for an unlabeled sample \(u_k \in U\) is defined as

$$\begin{aligned} support\_uniform(u_k)= \frac{\sum _{j \in {P_k}^+}\sum _{i=1}^{M^*}y_{ij}}{|{P_k}^+|}, \end{aligned}$$
(1)

where \({P_k}^+\) is the set of labels predicted for \(u_k\), U denotes the unlabeled data, \(M^*\) is the number of labeled samples in the current iteration, and \(y_{ij}\) is 1 if category \(l_j\) is given for sample \(x_i\) in the labeled data and 0 otherwise.

Since this method considers the average coverage in the labeled data across all predicted labels for a sample, a sample linked with some weak labels (labels with low frequencies in the labeled data) and some extremely strong labels may get rated lower than one linked with no weak label and some moderately strong labels. In Support.weakest, we explicitly take into account only the weak labels for the notion of support. In each iteration, we determine weak labels based on the coverage (frequency) of each label in the current labeled data. Specifically, a label \(l_j\) is weak if \(\sum _{i=1}^{M^*}y_{ij} < S_m\) and strong otherwise, where \(S_m\) is a hyper-parameter. We disregard all strong labels while calculating the support for an unlabeled sample with at least one weak predicted label. For samples with predictions involving no weak classes, we resort to the previous support computation. For the rest, we compute the support as follows.

$$\begin{aligned} support\_weakest(u_k)= \frac{\sum _{j \in {P_k}^+}v_j}{|\{z \mid \sum _{i=1}^{M^*}y_{iz} < S_m\text {,}\ z \in {P_k}^+\}|}, \end{aligned}$$
(2)

where \(v_j = \sum _{i=1}^{M^*}y_{ij}\) if \(\sum _{i=1}^{M^*}y_{ij} < S_m\) and 0 otherwise.

3.3.4 Seeking Textual Diversity

We identify the utility of selecting pseudo-labeled data such that it complements the existing labeled data as opposed to being an expanded version of it, especially in terms of linguistic characteristics. This family of methods aims to introduce greater textual diversity with respect to (a subset of) the current labeled data in the set of samples of being added in each iteration. Candidate unlabeled samples (meeting the qualifying criteria proposed in the first two methods) are ranked as per how far they are from existing labeled samples in terms of the corresponding vector representations. We create a variant of the state-of-the-art deep learning model for the original labeled dataset given in Parikh et al. [28] for generating the embedding for the text of a given sample. Cosine distance is used as the distance metric. The highest-rank \(Top_p\) percent of the samples are added to the labeled data in each iteration. In Diversity.uniform, diversity for a sample \(u_k \in U\) is given by,

$$\begin{aligned} diversity\_uniform(u_k)= \frac{\sum _{i=1}^{M^*} cos\_dist(post\_rep(u_k), post\_rep(x_i))}{M^*}, \end{aligned}$$
(3)

where \(post\_rep\) refers to the vector representation for a sample. Diversity.uniform picks the most distinct unlabeled samples w.r.t. the current labeled set in a label-independent manner. We develop the Diversity.label method to incorporate per label diversity, avoiding indiscriminate comparisons against all labeled samples. For a candidate sample, for each label predicted for it, we compute the average of the distances against only the labeled samples bearing that label. Each candidate is scored using the average of these label-wise averages. Our formulation can be expressed as,

$$\begin{aligned} diversity\_label(u_k)= \frac{\sum _{j \in {P_k}^+}\frac{\sum _{i=1}^{M^*}y_{ij} cos\_dist(post\_rep(u_k), post\_rep(x_i))}{\sum _{i=1}^{M^*}y_{ij}}}{|{P_k}^+|} \end{aligned}$$
(4)

3.3.5 Combining Previously Proposed Methods

We develop two ways of integrating our methods favoring low-support labels and seeking greater textual diversity to explore if their individual strengths combine well.

  1. (1)

    Score computation: We calculate the combined score for a candidate \(u_k\) from unlabeled data U as follows.

$$\begin{aligned} score(u_k) = \frac{ diversity\_uniform(u_k) ~\text { (or }~ diversity\_label(u_k)\text {)}}{support\_uniform(u_k) ~\text { (or }~ support\_weakest(u_k)\text {)}} \end{aligned}$$
(5)

The or in the equation above simply indicates that we consider all four (2X2) combinations stemming from the previous two proposed families of methods. From the pseudo-labeled candidate instances which pass the screenings previously described, \(Top_p\) percent of the instances with the highest combined scores are chosen for labeled data augmentation.

  1. (2)

    Intersection: In each iteration, we employ a method favoring low-support labels and a method seeking greater textual diversity each to pick the pseudo-labeled samples. We then intersect the sets of pseudo-labeled samples selected by these two methods and augment the current labeled set with the resultant set of samples. In this way of combining the previously proposed methods also, we explore pairing each of the two methods favoring low-support labels with each of the two methods seeking greater textual diversity.

3.3.6 Favoring Hard Samples

In this family of methods, we augment the labeled dataset with pseudo-labeled samples that are similar to labeled samples that we deem hard to classify. We define the notion of hard samples in a label-specific manner as well as generically. Each method orders unlabeled samples (that adhere to the qualifying criteria proposed in the first two methods) based on their vector representation similarities with (a subset of) the labeled samples identified as hard based on the corresponding notion of hardness. Then, we choose the most similar \(Top_p\) percent of the samples as the pseudo-labeled set in each iteration, where \(Top_p\) is a hyper-parameter.

We hold out a part of the training data and train a new classifier (following the same architecture as the base model) on the rest. We identify the hard samples from this held-out data using the new classifier’s predictions on it. For each sample in the held-out data, for each label, we compute the probability of the correct binary (applicable or not) prediction from the classifier-produced probabilities. For sample \(x_h\) from the held-out data and label \(l_j\), this probability \({\hat{p}}_{hj}^{c}\) is calculated in the following manner.

$$\begin{aligned} {\hat{p}}_{hj}^{c} = {\left\{ \begin{array}{ll} {\hat{p}}_{hj}^{\sigma } \text { , if } y_{hj} = 1\\ 1-{\hat{p}}_{hj}^{\sigma } \text { , otherwise} \end{array}\right. } \end{aligned}$$
(6)

In a proposed method which we name hard.uniform, we deem a held-out data sample \(x_h\) hard if the mean of the probabilities \({\hat{p}}_{hj}^{c}\) across all labels is less than a threshold (hyper-parameter \(T_{hu}\)). We also experiment with replacing the threshold-based check for picking the hard labeled samples; we sort all held-out data samples \(x_h\) by the means of the probabilities \({\hat{p}}_{hj}^{c}\) across all labels and deem a certain number of top samples hard. The threshold-based method outperforms this variant. Once the hard labeled samples are chosen, we compute the average similarity of for an unlabeled sample \(u_k\) with them using cosine similarity as follows.

$$\begin{aligned} hard\_uniform\_similarity(u_k)= \frac{\sum _{h=1}^{H^*} cos\_similarity(post\_rep(u_k), post\_rep(x_h))}{H^*}, \end{aligned}$$
(7)

where \(H^*\) is total number of hard labeled samples identified and \(post\_rep\) refers to the vector representation for a sample computed using the method mentioned in 3.3.4.

In our hard.label method, we identify hard labeled samples from the held-out data per label. Sample \(x_h\) is identified as hard for label \(l_j\) if \({\hat{p}}_{hj}^{c}\) is below a threshold (hyper-parameter \(T_{hl}\)). We experiment with a variant aimed at removing the threshold-based check in this case too; for each label \(l_j\), we sort all held-out data samples \(x_h\) by the probabilities \({\hat{p}}_{hj}^{c}\) and deem a certain number of top samples hard. This variant under-performs the threshold-based method. We make use of the label-wise sets of hard samples as follows. For an unlabeled sample \(u_k\), for each label predicted for it with a non-empty hard sample set, we compute the average of the cosine similarities to the hard samples for that label. The final similarity score is the average of these label-wise averages. In the event that the hard sample sets for all the labels predicted for \(u_k\) are empty, it is assigned \(hard\_uniform\_similarity(u_k)\).

$$\begin{aligned} hard\_label\_similarity(u_k)= \frac{\sum _{j \in \{j \mid H_j^*> 0, j \in {P_k}^+\}}\frac{\sum _{h=1}^{H_j^*} cos\_similarity(post\_rep(u_k), post\_rep(x_h))}{H_j^*}}{|\{j \mid H_j^* > 0, j \in {P_k}^+\}|}, \end{aligned}$$
(8)

where \(H_j^*\) is the number of hard labeled samples identified for label \(l_j\).

3.3.7 Generating Label Confidence Scores

For each unlabeled sample \(u_k\) that is added to the training data by any of the data augmentation methods, a confidence score is computed. Confident score \(c_{kj}\) for each unlabeled sample, per each predicted probability \({\hat{p}}_{kj}^{\sigma }\) is computed as follows,

$$\begin{aligned} c_{kj} = {\left\{ \begin{array}{ll} {\hat{p}}_{kj}^{\sigma },&{} \text {if } {\hat{p}}_{kj}^{\sigma } \ge 0.5, \forall j \\ 1-{\hat{p}}_{kj}^{\sigma }, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

These confidence scores can be used during the fully supervised classification.

4 Proposed Multi-Label Sexism Classification Approaches

In this section, we detail our approaches for carrying out the multi-label classification of accounts of sexism. We begin the section with the description of the proposed neural architecture. Next, we discuss the multi-level training approach that we develop for multi-label classification by creating organizing the categories of sexism in a meaningful hierarchy. We conclude the section by specifying the details of the proposed loss function utilizing label confidence scores.

4.1 Proposed Sexism Classification Architecture

Figure 1 shows our proposed sexism classification architecture. Each account of sexism (raw input text) is represented as multiple 3-d tensors, each of which embeds each word of each sentence. The ELMo and GloVe word embedding methods generate \({\mathcal {R}}^{|S| \times |W| \times d_e}\) and \({\mathcal {R}}^{|S| \times |W| \times d_g}\) tensors, respectively, where |S| denotes the maximum number sentences per post (account of sexism), |W| is the maximum number of words per sentence, and \(d_e\) and \(d_g\) represent the embedding dimensions for ELMo and GloVe, respectively. We also employ a domain-adapted BERT variant named tBERT for embedding words; it produces a \({\mathcal {R}}^{|S| \times |W| \times d_b}\) tensor, where \(d_b\) is the tBERT embedding dimension. tBERT is created by further training a pre-trained, generic BERT model in an unsupervised manner using unlabelled accounts of sexism with a view to producing more effective representations for sexism classification than those produced by off-the-shelf BERT models. We incorporate tBERT into our end-to-end training for sexism classification (the weights of tBERT are updated during the training).

Fig. 1
figure 1

Proposed Sexism Classification Architecture

We construct vector representations for the word-embedded sentences in the three 3-D tensors using bidirectional LSTM and an associated attention mechanism [43]. For each sentence, the biLSTM layer produces |W| h-dimensional hidden states (one for each of the |W| time steps). These hidden states are aggregated into a vector representation by the attention layer (for each sentence). Overall, this results in three \({\mathcal {R}}^{|S| \times h}\) tensors (one corresponding to each of the three 3-D word embedding tensors), where h is the bi-LSTM output length. These sentences representing 2-D tensors are concatenated to produce a \({\mathcal {R}}^{|S| \times 3h}\) tensor. The sequence of sentence vectors in this tensor is then passed to biLSTM followed by the attention, resulting in the post representation. Finally, a fully connected layer with the sigmoid nonlinearity generates the output probabilities for all the labels.

As the cross-entropy loss used for single-label multi-class classification problem is not applicable to the multi-label classification setup, we employ the extended binary cross-entropy (EBCE) loss used in Parikh et al. [28]. It is formulated as a weighted mean of label-wise binary cross-entropy values, where the weights are meant for neutralizing the class imbalance.

Fig. 2
figure 2

Category hierarchy for multi-level training for multi-label sexism classification

4.2 Multi-level Training Using a Category Hierarchy

We devise a coarse-to-fine training approach for multi-label sexism classification wherein we sequentially train models using categories of sexism of different levels of granularity. The motivation behind this coarse-to-fine training is the scarcity of labeled data for some categories in the 23-class fine-grained categorization scheme. The scarcity problem can be alleviated by simply merging categories appropriately, but it does not serve the fine-grained classification objective. Hence, we use the training on a reduced category set (created through category merging) as a supervised pre-training step. The intuition is that the higher sample to category ratio could lead to better weights and initializing the final model with (most of) those weights (as opposed to randomly) could therefore yield a superior fine-grained classifier. Moreover, this supervised pre-training may also in turn benefit from another instance of supervised pre-training involving even fewer categories. Therefore, we carry out a sequence of training steps at different categorization levels. The architectures of the models at all levels are the same except for the final, dense layer. At each level, the dense layer of each model has the same number of output units as the category set size.

In order to perform the multi-level training, we create a category hierarchy from the original 23 fine-grained categories of sexism under the direction of a social scientist. Figure 2 specifies this three-level hierarchy of categories of sexism. m() denotes the merging of categories. At each level, the merged categories are shown in a different color. The category hierarchy consists of 8 categories at the most abstract level (level 1) and 15 categories in the middle level.

First, we train a model using the level 1 categories with a final, dense (fully connected) layer with 8 output units. The values of the weights of this trained level 1 model except those of the final, dense layer are used to initialize the corresponding weights in the level 2 model. We then train this model using the training data modified using the 15 middle-level categories. The same is repeated for the third, final, fine-grained level. Except for the final layer, the level 3 model’s weights are initialized with the values of the weights of the trained level 2 model. We then train the 23-category level 3 model, which we perform the desired fine-grained sexism classification with.

4.3 Proposed Confidence-Modified Binary Cross-Entropy Loss

We formulate a variant of the extended binary cross-entropy (EBCE) loss that is aimed at utilizing the label confidence scores. These per-label confidence scores may be generated by a pseudo-labeling data augmentation method. The proposed loss, which we refer to as Confidence-modified Binary Cross-Entropy (CBCE), also involves weights through which we see to correct the class imbalance. We formulate the CBCE loss as follows.

$$\begin{aligned} CBCE = - \frac{1}{L} \sum _{i=1}^{L} \frac{1}{N} \sum _{j=1}^{N} w_{jy_{ij}} c_{ij} \left\{ y_{ij} \log ({\hat{p}}_{ij}^{\sigma }) + (1 - y_{ij}) \log (1 -{\hat{p}}_{ij}^{\sigma }) \right\} \end{aligned}$$
(10)

Here, L is the number of samples and N is the number of classes. \({\hat{p}}_{ij}^{\sigma }\) is the predicted probability of label \(l_j\) being applicable to sample \(x_i\). \(c_{ij}\) is the confidence score for label \(l_j\) for sample \(x_i\). \(y_{ij}\) is defined as follows.

$$\begin{aligned} y_{ij} = {\left\{ \begin{array}{ll} 1, \text { if label } l_j \text { is applicable to sample } x_i\\ 0, \text { otherwise} \end{array}\right. } \end{aligned}$$
(11)

The weights for correcting class imbalance \(w_{jz}\) are computed as follows.

$$\begin{aligned} w_{jz} = \frac{L}{2|\{x_i \mid y_{ij} = z, 1 \le i \le L\}|} \end{aligned}$$
(12)

5 Experiments

This section discusses all the experimental results of the proposed methods compared to several baseline methods and analyzes related to our method. Our code and all the hyper-parameter values used are available at this link.Footnote 2

5.1 Dataset

Parikh et al. [28] introduce a dataset comprising 13, 023 accounts of sexism, each account of sexism labeled with at least one of the 23 categories. They formulated the 23 categories of sexism under the direction of a social scientist by considering gender-related discourse and campaigns [14, 15, 24, 26] as well as possible impact on public policy. Most of the 10 annotators involved had studied topics related to gender and/or sexuality formally. Moreover, a three-phase method of annotation was pursued to ensure that the categorization of each account of sexism in the final dataset involved the labeling of it by at least two annotators. Finally, after the three phases of annotation, the dataset comprises 13,023 accounts of sexism, where each account of sexism is labeled with at least one of the 23 categories. Table 1 provides the description of the categories. We use this original labeled dataset (only) to train all supervised baselines.

Table 1 Descriptions of the categories of sexism used in the dataset [28]
Table 2 Linguistic analysis of the labeled dataset

We provide a linguistic analysis of this labeled data using Linguistic Inquiry and Word Count (LIWC), a text analysis tool [29]. We focus on the LIWC scores for the Work, Money, Religion and Body categories. Details concerning how the LIWC scores are computed can be found in the LIWC 2015 [29]. Table 2 shows the scores for these LIWC categories for all the categories of sexism in the dataset. For each class of sexism, we compute the LIWC scores for all posts tagged with that class label and take the mean of all those scores to obtain the category-level scores reported. We observe the highest scores for the Work and Money LIWC categories for Pay gap. As expected, the maximum score for the Religion LIWC category is found for Religion-based sexism. For the Body LIWC category, we find the highest score for Body shaming. Table 2 also lists 4-grams from the textual accounts of sexism associated with each category of sexism.

In this paper, we also devise semi-supervised methods to automatically expand this dataset. Unlabelled instances of sexism are obtained from ‘Everyday Sexism Project’ which has several hundred thousand accounts of sexism from observers and survivors. We shortlist 70, 000 shortest unlabeled instances of sexism containing a minimum of 7 words each. Short instances are selected in order to maximize the similitude to the labeled data [28]. Our data augmentation methods select a subset of these 70, 000 accounts of sexism for augmenting the training data.

5.2 Evaluation Metrics

Evaluation metrics for multi-label classification problems differ from the standard metrics used in cases where the classes cannot co-occur. We report results for a number of established metrics, namely Subset Accuracy (SA), instance-based F1 (\(F_{ins}\)), instance-based accuracy (Acc), F1 macro (\(F_{mac}\)) and F1 micro (\(F_{mic}\)) [28, 45]. Subset Accuracy, which measures the fraction of the exact matches, is the strictest metric.

These metrics are mathematically expressed as follows.

$$\begin{aligned} F_{inst}&= \frac{2 P_{inst} R_{inst}}{P_{inst} + R_{inst}}, where\nonumber \\ P_{ins}= & {} \frac{1}{P} \sum _{i=1}^{P} \frac{|\mathbf {y_i} \cap {\hat{\mathbf {y}}_i}|}{|{\hat{\mathbf {y}}_i}|}, \nonumber \\ R_{ins}&= \frac{1}{P} \sum _{i=1}^{P} \frac{|\mathbf {y_i} \cap {\hat{\mathbf {y}}_i}|}{|\mathbf {y_i}|} \end{aligned}$$
(13)
$$\begin{aligned} F_{mac}&= \frac{1}{Q} \sum _{j=1}^{Q} F(TP_j, FP_j, FN_j), \nonumber \\ F_{mic}&= F\left( \sum _{j=1}^{Q} TP_j, \sum _{j=1}^{Q} FP_j, \sum _{j=1}^{Q} FN_j\right) , \end{aligned}$$
(14)
$$\begin{aligned} where TP_j&= |\{x_i \mid l_j \in (\mathbf {y_i} \cap {\hat{\mathbf {y}}_i}), 1 \le i \le P\}|,\nonumber \\ FP_j &= |\{x_i \mid l_j \in ({\hat{\mathbf {y}}_i}-\mathbf {y_i}), 1 \le i \le P\}|\nonumber \\ FN_j&= |\{x_i \mid l_j \in (\mathbf {y_i} - {\hat{\mathbf {y}}_i}), 1 \le i \le P\}|, \nonumber \\ F(TP^*, FP^*, FN^*)&= \frac{2 TP^*}{2 TP^* + FN^* + FP^*} \end{aligned}$$
(15)
$$\begin{aligned} Acc= \frac{1}{P} \sum _{i=1}^{P} \frac{|\mathbf {y_i} \cap {\hat{\mathbf {y}}_i}|}{|\mathbf {y_i} \cup {\hat{\mathbf {y}}_i}|} \end{aligned}$$
(16)
$$\begin{aligned} SA= \frac{1}{P} \sum _{i=1}^{P} 1_{\mathbf {y_i} = {\hat{\mathbf {y}}_i}} \end{aligned}$$
(17)

Here, P denotes number of posts. \(\mathbf {y_i}\) is set of true labels applicable to post \(x_i\). \({\hat{\mathbf {y}}_i}\) is the set of predicted labels for post \(x_i\). Q denotes the number of classes and \(l_j\) denotes the \(j^{th}\) of the Q labels.

5.3 Baselines

All the below-mentioned deep learning architectures end with a dense layer with the sigmoid activation and trained with the EBCE loss.

Random

For each test sample, labels are selected randomly as per their normalized frequencies in the training data.

Traditional Machine Learning (TML)

We experiment with logistic regression (LR), support vector machine (SVM) and random forests (RF) classifiers. All the classifiers applied on two feature sets, namely TF-IDF on word unigrams and bigrams (Word-ngrams) and the average of the ELMo vectors [31]. This gives rise to six combinations: word-ngrams with LR, word-ngrams with SVM, word-ngrams with RF, ELMO with LR, ELMO with SVM and ELMO with RF.

Deep Learning (DL:) LSTM-based Architectures

  • biLSTM: The word embeddings corresponding to each post are fed through a bidirectional LSTM.

  • biLSTM-Attention and Hierarchical-biLSTM-Attention: The biLSTM-Attention is similar to biLSTM, but with the attention scheme from Yang et al. [43]. Hierarchical-biLSTM-Attention is similar to Yang et al. [43], but GRUs are replaced with LSTMs. For each post, the word embeddings are passed through the biLSTM-attention to create a sentence representation. The sentence representation is then fed to another biLSTM-attention.

  • USE-biLSTM-Attention and BERT-biLSTM-Attention : Sentence embeddings are generated using USE [7] and BERT via bert-as-service [41] separately and passed through a biLSTM with attention.

CNN-biLSTM and CNN-based Architectures

  • C-biLSTM: This architecture is somewhat similar to approach [21] and one of the variants of C-LSTM architecture [48]. After applying the convolution operation on each post’s word vectors, the feature maps are stacked with the filter dimension in order to generate a series of window vectors, which are then passed through biLSTM.

  • CNN-biLSTM-Attention: This architecture is similar to Wang et al. [38], where word embeddings of each sentence are fed to convolutional and max-over-time pooling layers. These sentence representations are then fed to a biLSTM with attention.

  • CNN-Kim: Word vectors of a post are passed through convolutional and max-over-time pooling layers similar to Kim [23].

Semi-supervised: Classification methods

  • tBERT-biLSTM-Attention: This architecture is similar to BERT-biLSTM-Attention except that the pre-trained BERT model is fine-tuned using unlabeled instances of sexism [28].

  • Opti-DL: This is the best-performing model by [28]. The neural model concatenates sentence representations obtained using a BERT [13] model tuned using unlabeled instances of sexism with those generated from ELMo [31] and GloVe [30] embeddings separately using biLSTM with an attention scheme. The combined sentence vectors are passed through biLSTM with attention to produce the post representation.

Data augmentation methods

  • Random Sampling + Opti-DL: We randomly choose the same number of unlabelled samples as those generated by our best data augmentation method (Diversity.label \(\cap \) Support.weakest) using proposed architecture (PA) as a base classifier, label them using the Opti-DL model and augment the original labeled set with them, and then train Opti-DL on the expanded set.

  • Mean-based self-training + Opti-DL: We adapt basic self-training [44] approach for multi-label classification. Iteratively, we train Opti-DL model to generate the pseudo-label unlabeled samples and add them to the training set only if the mean of the model given probabilities for the predicted class labels exceeds certain threshold.

5.4 Results

For all deep learning approaches, the pre-processing steps we perform involve eliminating some non-alpha-numeric characters and additional spaces, lower casing and zero padding input tensors as necessary. Each sentence of more than 35 words is split into several sentences when splitting a post into sentences, maintaining the maximum sentence length of 35 words.

From original labeled data, 15% is used for validation, and 15% is used for testing. During the testing phase, the validation set was merged into the training set. After augmenting the original labeled data, the base classifier and the proposed architecture were also trained on the union of the augmented data and validation set for the semi-supervised methods. For all the deep learning methods, the mean of the results obtained over three runs is reported for all the metrics. For each proposed data augmentation method, data augmentation is carried out over 3 iterations.

Table 3 provides the results produced by various traditional machine learning baselines. It is observed that among all the combinations of features and classifiers experimented with the traditional machine learning, averaged ELMo embeddings with logistic regression perform best for most of the metrics.

Table 3 Results for traditional machine learning baselines
Table 4 Results for Deep Learning (DL) and semi-supervised baselines as well as different sets of proposed methods

Table 4 provides the sexism classification results for random, various deep learning and semi-supervised baselines as well as for different sets of proposed methods. The random baseline works incredibly poorly as expected, illustrating the complexity of the fine-grained multi-label classification. For the deep learning baselines, we find ELMo to be better than GloVe for word embeddings across multiple baselines and hence reported on ELMo-based results with EBCE loss. Hierarchical-biLSTM-Attention is the best deep learning baseline across all the metrics, and it outperforms its traditional machine learning counterpart. Overall the best baseline is the semi-supervised Opti-DL method. BERT model tuned on domain-specific unlabeled instances of sexism named as tBERT works better than the vanilla BERT counterpart and USE. Augmented data generated by mean-based self-training approach using Opti-DL as the base classifier performed poorly compared to Opti-DL. Randomly selecting the same number of samples from the unlabelled accounts of sexism as our best-proposed data augmentation method does and labeled them using the Opti-DL worsens its performance to a degree.

Several proposed methods outperform all baselines across all the metrics. The maximum improvement in the performance is observed for subset accuracy. For all the proposed data augmentation methods, the best classification baseline (Opti-DL) is used to generate augmented data as well as a final classifier (to predict the labels). Our best-performing data augmentation method is Diversity.label \(\cap \) support.weakest, which prioritizes samples that are the most distinct compared to the existing labeled data and have the lowest-support weak (low-coverage) predicted label sets. Among the methods that seek greater textual diversity, Diversity.uniform produces the best results for most metrics which select the samples that are more distinct in a label independent manner. Support.uniform shows the best performance among the variants of favoring low-support labels that picks the samples that has low coverage. Among the methods favoring hard samples, Hard.uniform performs best for most of the metrics. We report all the combinations involving our diversity-based and support-based methods through the computation of combined scores and intersection, where S() and \(\cap \) denote the score-based and intersection-based integration, respectively. S(Diversity.label,Support.uniform) performs best for the score-based integration combinations.

In proposed supervised classification methods, Proposed Architecture (PA) leads to the best results where the sentence representations obtained from ELMO and GLOVE are concatenated with the tBERT (BERT tuned on unlabelled instances of sexism) representations. PA without pre-training BERT worsens its performance to a degree, as the model is not tuned on the domain-specific data. The performance of hierarchical multi-level training approach with PA slightly decreases compared to the PA when trained on the original labeled data. Further, we consider the best data augmentation method (Diversity.label \(\cap \) support.weakest), best classification method (PA) and best baseline (Opti-DL) and tried different combinations. Out of the combined methods, PA used to generate the augmented data as well as a final classifier (PA-aug) performs best compared to the other combinations. On top of PA-aug, CBCE loss and hierarchical multi-level are experimented separately and observes that the performance is improved compared to PA-aug. Finally, the best performance is observed with a combined proposed method involving multi-level PA-aug along with our proposed CBCE loss function.

We analyze the performance of the data augmented by our best data augmentation method (Diversity.label \(\cap \) Support.weakest) and best data augmentation baseline (mean-based self-training) using PA as base classifier across different neural classifiers. Figure 3 depicts the \(F_{ins}\), \(F_{mac}\) and SA for five deep learning baseline classifiers and proposed architecture. The figure demonstrates the relative efficacy of our data augmented by our best augmentation method across different classifiers for all the three metrics. For CNN-based architectures, it is visible clearly that the data augment by our best augmentation method performs better with good margin compared to the data augment by best baseline.

Fig. 3
figure 3

Performance of our best method (Diversity.label \(\cap \) Support.weakest) and best baseline (mean-based self-training) for data augmented (using PA as the base classifier) with different neural classifiers

Fig. 4
figure 4

Coverage of positive and negative samples per label for the original data and data added by our best method (Diversity.label \(\cap \) Support.weakest) using PA as a base classifier

Figure 4 highlights the coverage of positive and negative samples for each of the 23 labels in the original labeled data and contrasts it against the improved positive sample proportion in the data contributed (added data) by our best method (Diversity.label \(\cap \) support.weakest ) using PA as a base classifier. The ratio of the standard deviation and mean for the (positive) label coverage for the original data is 1.074, whereas its added data counterpart is 1.022, indicating that our data augmentation method also reduces the class imbalance to a degree.

Fig. 5
figure 5

Performance of our best classifier (PA) and the best classification baseline (Opti-DL) on data augmented by different methods using Opti-DL as a base classifier

Fig. 6
figure 6

Performance of the best classification baseline (Opti-DL) on data augmented by different methods using Opti-DL and PA as a base classifiers

Figures 5, 67 portray the \(F_{ins}\), \(F_{mac}\) and SA for four data augmentation methods with different combinations of PA and Opti-DL at different stages. In Fig. 5, we render the performance of our best classifier (PA) and the best classification baseline (Opti-DL) on data augmented by different data augmentation methods using Opti-DL as a base classifier. The figure shows that PA as a final classifier performed best for all the methods across all the metrics compared to Opti-DL. In Fig. 6, for each data augmentation method, we depict the performance of the best classification baseline (Opti-DL) on data augmented by different methods using best classification baseline (Opti-DL) and best classifier (PA) as base classifiers. The figure demonstrates that the data generated by the best classifier (PA) perform best compared to the best classification baseline (Opti-DL) for all the data augmentation methods. In Fig. 7, for each data augmentation approach, we show the performance of our best classification method (PA) and the best classification baseline (Opti-DL) on data augmented by different methods using PA and Opti-DL as base classifiers, respectively. The figure shows relative efficacy of PA at both the stages compared to the Opti-DL. Overall, Figs. 567 show that our PA performed well as a base classifier, as a final classifier and at both stages (as base and final classifier) for all the data augmentation methods across all the metrics compared to Opti-DL.

Fig. 7
figure 7

Performance of our best classifier (PA) and the best classification baseline (Opti-DL) on data augmented by different methods using PA and Opti-DL as a base classifiers

Table 5 Effect of using our best classification method (PA) as the base classifier for data augmentation method and/or as the final classifier (using Diversity.label \(\cap \) support.weakest for data augmentation)

For the best data augmentation method (Diversity.label \(\cap \) support.weakest), we also showed the effect of using our best classification method (PA) as the base classifier and/or as the final classifier. Table 5 shows a comparison of PA and the best classification baseline (Opti-DL) at different stages for all the metrics. It is observed that PA generates best augmented data as well as performed well as a final classifier compared to Opti-DL at both the stages.

Fig. 8
figure 8

Class-wise sexism classification F-scores for the best baseline (Opti-DL) and our overall best method (multi-level PA-aug with CBCE loss)

Table 6 Accounts of sexism correctly classified with our overall best method (multi-level PA-aug with CBCE loss) but not with the best baseline (Opti-DL)
Table 7 Examples of the pseudo-labeled accounts of sexism added by our Diversity.label \(\cap \) Support.weakest using PA as the base classifier

Figure 8 compares the class-wise performance of our overall best method (multi-level PA-aug with CBCE loss) with that of the best baseline (Opti-DL). For each class, average of the F scores over three runs is shown for both the methods. For a majority of the classes, the F score of the proposed method outperforms the baseline F score.

Table 6 shows accounts of sexism from the test set for which our best proposed method (multi-level PA-aug with CBCE loss) makes all correct predictions, but the best baseline (Opti-DL) does not. It also provides the average cosine similarity scores w.r.t. the original labeled data and the new pseudo-labeled data produced by our best data augmentation method (Diversity.label \(\cap \) support.weakest), computed using vector representations of the posts given by the PA. It shows that the test samples are more similar to the newly added pseudo-labeled data. We also report the per-label coverage, defined as the fraction of samples bearing the label, for the two sets. The higher coverage values seen for our approach could partly account for its state-of-the-art performance.

Table 7 shows few examples of the pseudo-labeled accounts of sexism that are added by our best data augmentation method (Diversity.label \(\cap \) support.weakest) using PA as the base classifier. It also shows the categories associated with each post predicted by our PA. It is inferred from these samples that our proposed method is able to predict the correct categories.

For each method, all the hyper-parameter tuning is done using the validation data. The hyper-parameters values for our proposed semi-supervised approaches are as follows. \(P_{min}\) and \(P_{max}\) are set to 4 and 7, respectively. The minimum labeled data coverage (\(S_m\)) is set to 1300 by observing the class distribution in the original labeled data. For each method, we pick the optimal confidence threshold T and \(Top_p\) based on \(F_{macro}\) on the validation set. For Diversity.label \(\cap \) support.weakest, our best-performing method, these values are 0.75 and 0.9, respectively.

The amount of pseudo-labeled data chosen to be added by each of our methods varies according to hyper-parameter values where the best performance on the validation set is observed. For the best augmentation method (Diversity.label \(\cap \) support.weakest) using our best classifier (PA) as a base classifier, the highest \(F_{macro}\) is seen at the third iteration, and the corresponding data generated cumulatively till that point amounts to 8994 samples, resulting in the augmented dataset consisting of 22017 labeled samples. The added data sizes for the other data augmentation methods using PA as a base classifier with comparable hyper-parameter configuration range from 8k to 10k. Our best-proposed approach produces fewer data compared to some other proposed data augmentation methods, confirming the importance of keeping the quality of the pseudo-labeled set high through effective sample selection methods and other mechanisms.

6 Conclusion

We investigated semi-supervised learning for the fine-grained classification of accounts of sexism using 23 categories of sexism. We proposed a set of methods based on self-training, designed for the multi-label formulation, for capitalizing on unlabeled instances of sexism to augment the training data. We devised a loss function which capitalizes on label confidence scores computed for each pseudo-labeled sample in the augmented data. We also proposed an neural architecture involving a domain-adapted BERT model that is trained end-to-end to improve the fine-grained sexism classification performance. We also devised a coarse-to-fine training approach for multi-label sexism classification where we train the models sequentially using categories of sexism of different levels of granularity.

Our proposed methods perform superiorly to a variety of traditional machine learning and deep learning baselines across many standard metrics including Subset Accuracy, the strictest metric. Several of the proposed semi-supervised methods that augment the labeled data with pseudo-labeled samples picked from unlabeled data yield better results than the best baseline (Opti-DL). Our best-performing data augmentation method (Diversity.label \(\cap \) Support.weakest) seeks to enhance textual diversity and improve class imbalance. Our proposed sexism classification architecture, which combines biLSTM and attention with a domain-adapted BERT model in an end-to-end trainable manner, also outperforms all baselines. Our best combined method (multi-level PA-aug with CBCE loss) further improves the performance; it achieves an instance-based F-score (\(F_{ins}\)) of 0.757, \(F_{mac}\) of 0.583 and SA of 0.357, whereas the best baseline produces an instance-based F-score (\(F_{ins}\)) of 0.714, \(F_{mac}\) of 0.546 and SA of 0.242.

A direction for future work is to tailor and extend the approaches for sexism classification for the identification of sexism, with a view to developing a pipeline in which sexism detection is first carried out to identify posts related to sexism and sexism classification is performed only on those posts. Another possible direction is to investigate neural approaches for the identification and categorization of specific forms of sexism such as sexist stereotyping (including but not limited to role stereotyping), sexual harassment and misogyny. Given the presence of social media data in languages other than English, we could also explore multi-lingual sexism detection and classification.