Words Can Be Confusing: Stereotype Bias Removal in Text Classification at the Word Level

Shen, Shaofei; Zhang, Mingzhe; Chen, Weitong; Bialkowski, Alina; Xu, Miao

doi:10.1007/978-3-031-33383-5_8

Shaofei Shen¹⁰,
Mingzhe Zhang¹⁰,
Weitong Chen¹¹,
Alina Bialkowski¹⁰ &
…
Miao Xu^10,12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13938))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1395 Accesses

Abstract

Text classification is a widely used task in natural language processing. However, the presence of stereotype bias in text classification can lead to unfair and inaccurate predictions. Stereotype bias is particularly prevalent in words that are unevenly distributed across classes and are associated with specific categories. This bias can be further strengthened in pre-trained models on large natural language datasets. Prior works to remove stereotype bias have mainly focused on specific demographic groups or relied on specific thesauri without measuring the influence of stereotype words on predictions. In this work, we present a causal analysis of how stereotype bias occurs and affects text classification, and propose a framework to mitigate stereotype bias. Our framework detects potential stereotype bias words using SHAP values and alleviates bias in the prediction stage through a counterfactual approach. Unlike existing debiasing methods, our framework does not rely on existing stereotype word sets and can dynamically evaluate the influence of words on stereotype bias. Extensive experiments and ablation studies show that our approach effectively improves classification performance while mitigating stereotype bias.

You have full access to this open access chapter, Download conference paper PDF

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Detecting Inappropriate Comments to News

Media Bias in German News Articles: A Combined Approach

Keywords

1 Introduction

Text classification tasks in natural language processing (NLP) can be influenced by stereotype words, which are words associated with specific categories or groups based on emotions, politics, or demographic features [1]. Studies have shown that stereotype words such as pink and blue are often associated with girls and boys, respectively [2]. Additionally, words such as Varicella or Alzheimer are associated with specific age groups. The distribution of stereotype words across different document categories can result in stereotype bias in classification models trained on datasets containing these words [14]. This bias can have significant implications, particularly in sentiment analysis, where the results can influence decision-making processes.

Stereotype bias in text classification is caused by oversimplified correlations between stereotype words and text categories. However, in text classification, the semantic relationships of the words should be the basis for classification, not the existence of specific words [15]. In this paper, we focus on the problem of detecting and alleviating stereotype bias in text classification. To solve this problem, three challenges need to be addressed. Firstly, identifying the set of stereotype words that can potentially introduce stereotype bias is critical [18]. Such words are usually domain-specific and difficult to identify universally across different domains, given that document classification tasks are often domain-specific, such as in movie reviews or medical research. Secondly, accurately estimating the degree to which a word contributes to stereotype bias in predictions for a document is challenging. This is especially true for complex deep models, and the same words may contribute differently in different documents due to the interdependence between words. Simply removing a particular stereotype word may not eliminate stereotype bias. Thirdly, aside from stereotype bias in classifiers, there may be stereotype bias in widely-used pre-trained word embedding models [24]. Accessing the training data of these models to detect stereotype words is difficult, and it is challenging to measure and reduce stereotype bias without the original training data. Addressing these challenges is essential for developing more inclusive NLP models that are free from stereotype bias.

Previous works have attempted to alleviate various forms of stereotype bias, with a particular focus on removing gender bias in language models [4, 24, 25]. In [25], the authors reduced gender bias through data augmentation using an occupation word set associated with gender bias. Similarly, [24] proposed reducing gender bias in the word embedding stage. [4] demonstrated bias amplification in language models and used posterior regularization to address it. Other forms of bias, including label bias [17], context-word bias [17], race bias [6], demographic bias [10], and implicit bias [9], have also received attention. However, these works typically rely on fixed word sets considered as stereotype words derived from a different domain, which may not be effective in the current domain. While some works [17] have proposed selecting stereotype words, their selection strategy is based solely on the TextRank score [13], without considering other important metrics, such as word imbalance. Moreover, these works assume that all words contribute equally as stereotype bias creators, without taking into account the fact that different words may contribute differently in different documents.

In this work, a causal graph is constructed to analyze how the stereotype bias from texts and pre-trained models affects classifications, building on previous works. Based on this causal graph, a novel framework is proposed for detecting and alleviating stereotype bias using a counterfactual method to address the three challenges mentioned earlier. Initially, word distribution statistics and word importance (i.e. SHAP value) in prediction are used to determine a dynamic stereotype word set. Subsequently, a fusion model is adopted to learn the relationship between semantics, stereotype words, and text categories. During the prediction stage, the counterfactual approach was used to alleviate the bias from stereotype words. In contrast to previous works, this study utilizes real-time word importance in document-level predictions and domain-level word distributions to identify stereotype words in different document domains, and focuses more on semantically relevant words instead of context words [17]. In summary, the contributions of this work are three-fold.

We investigate the stereotype bias in text classification from a causal perspective, analyzing how stereotype words from both texts and pre-trained models influence classification results.
We propose a novel framework to detect and remove stereotype bias, which involves detecting stereotype words based on word importance and word distribution statistics, training a fusion model to learn the relationship between semantics, stereotype words, and text categories, and utilizing a counterfactual approach for unbiased prediction. To the best of our knowledge, this is the first work that systematically addresses the stereotype bias caused by semantic words without relying on a prior thesaurus.
We conduct extensive experiments to demonstrate the effectiveness of our framework in achieving unbiased classification, and we compare our results with state-of-the-art approaches for unbiased text classification [17].

Related Work. The word-level bias in language models has attracted the growing interest of researchers. Apart from the aforementioned works on gender bias [24, 25], the most recent work handled gender bias via an adaptation perspective and treated gender groups as different domains [3]. Apart from the gender bias, other works also focus on intended bias [22] and the stereotype bias generated from words [3, 17, 25], word embeddings [24], and pre-trained models [14]. These works inspired us to model the causal relationship among texts, sources of stereotypes bias, and predictions.

As for the debiasing methods, the counterfactual approach is attracting increasing attention. The counterfactual approach utilizes a dummy value as the counterfactual and aims to remove the indirect effect of confounders on the treatment variables [16]. [17] proposed to use a counterfactual method to remove the bias from imbalanced labels and semantically-irrelevant words. [21] removed the bias in fake news classification and [20] mitigated the bias in text understanding and hypothesis inference via counterfactual debiasing. As discussed in the aforementioned challenges, training data is generally unavailable or inaccessible in pre-trained models. In this case, the counterfactual approach is applicable to mitigate the potential stereotype bias in pre-trained models and text classifiers.

2 Methodology

2.1 Problem Formulation

Let $\textbf{D}$ and $\textbf{Y}$ denote the text documents and text categories, respectively. Considering a pre-trained word embedding model $\textbf{h}$ and classification model $\textbf{g}$, the goal of the text classification is to train the classification model $\textbf{g}$ to maximize the classification accuracy of $(\textbf{g}\circ \textbf{h})(\textbf{D})$. In the ideal view of the training process, the semantics of documents can be learned through a two-stage model, which first classifies semantics and then performs text classification. Then the semantics will be the main basis of the text classification [14]. However, when training from a pre-trained model, the text classification model will inherit any existing stereotype relationships of the pre-trained model.

To construct the causal relationship among these variables, we analyse the word-level stereotypes first. The words in one document can be divided into three groups: the semantic-irrelevant words which have no contribution to the semantics and the further text category predictions, the normal words which are related to the semantics but will not involve stereotypes in the predictions, and the stereotype words that affect the semantics and introduce the stereotype bias in the predictions meantime. In addition, the pre-trained word embedding model may also involve stereotype bias due to the pre-training dataset. Figure 1(a) demonstrates the causal relationship among these groups of words, semantics, and text category. Ideally, the pre-trained word embedding M should also contribute to the semantics X and be the confounder of causal path $X\rightarrow Y$. However, the causal effect from M on X is hard to estimate and we remove the path $M\rightarrow X$ for easier implementation in the experiments. Considering all the sources of bias, the prediction results can be denoted as:

$$\begin{aligned} Y_{x(d,s),m,s} = Y(X = x(d,s),M = m,S = s) \end{aligned}$$

(1)

where Y is a prediction function based on word embedding m, normal words d, stereotype words s, and the specific semantics x(d, s).

The unbiased prediction requires using the semantic as the only direct causal variable to the predictions. Then we need to remove the causal effect from the other two causal variables: stereotype bias from pre-trained model $\textbf{M}$ and stereotype words $\textbf{S}$ as shown in Fig. 1(b). Then the debiasing goal can be denoted as:

$$\begin{aligned} Y_{x(d,s)} = Y(X = x(d,s)) \end{aligned}$$

(2)

where Y is only decided on the semantic x(d, s) of the texts.

Based on the analysis and causal relationship in Fig. 1, our framework contains three stages: stereotype word set construction, fusion model training to learn the causal effect from the sources of bias, and unbiased prediction.

2.2 Stereotype Words Detection

To mitigate the stereotype bias from the training documents, the first stage of our framework is to select the potential words that may lead to stereotypes. As Fig. 1 shows, we assume the stereotype words have a direct causal effect on the semantics and predictions at the same time. Therefore, we can focus on the words that contribute to predictions and utilize the word importance on predictions to detect stereotype words. [17] proposes to use the TextRank-based method to calculate the word importance in the document. However, TextRank can only select the keywords and does not consider whether these words affect downstream tasks while we are aiming to get the word importance on the downstream predictions in this work. Therefore, we adopt the post-training SHAP value, which can provide the feature importance (i.e. word importance in this work) for the predictions [7, 12] and provide the contribution of each word to the predictions based on the same prediction model. Moreover, another characteristic to select the stereotype words is the word distribution in different classes.

Specifically, as shown in stage 1 of Fig. 2, after the initial training stage, we calculate the SHAP values as the word importance for the words in training data and select the set of words $\textbf{D} + \textbf{S}$ which contributes to the predictions. Then to select the potential stereotype words, we calculate the word distributions in each document class and rank them via information entropy:

$$\begin{aligned} H(w) = -\sum _{c \in C}p(c|w)\log p(c|w) \end{aligned}$$

(3)

where w is all semantic-relevant words and C is the text category set. Lower H(w) means a more imbalanced distribution in different text classes and a higher potential to involve stereotypes in the predictions. Then we set the proportion of stereotype words as a parameter and select the percentage of data from the ranking of H(w).

2.3 Fusion Model Training

After selecting the potential stereotype words, we build a fusion model to estimate the direct causal effect from the pre-trained model and stereotype words, which is shown as $S\rightarrow Y$ and $M\rightarrow Y$ in Fig. 1. Inspired by [20, 21], we build two models $\hat{y}_{d,s}$ and $\hat{y}_{s}$. We use the original texts as input of the first model and predict the corresponding text categories to capture the causal relationships in the paths $S\rightarrow X \leftarrow D$ and $S\rightarrow Y$. This model is trained to learn the causal effect from semantic-relevant words to the semantic and then to the prediction results. To train this model, we use the cross-entropy loss function as shown in

$$\begin{aligned} L_{x} = \sum _{c}y_c \log (\hat{y}_{d,s,c}) \end{aligned}$$

(4)

where $y_c$ and $\hat{y}_{d,s,c}$ means the ground-truth and prediction probability on the text category c.

Another model $\hat{y}_{s}$ is used to estimate the causal effect in $S\rightarrow Y$. In this model, we preserve the stereotype words $\textbf{S}$ and semantic-irrelevant words $\textbf{U}$ and mask the normal words $\textbf{D}$ as the model input. The output of the model is still the corresponding text category predictions. Then we optimize the second model by the following loss function:

$$\begin{aligned} L_{s} = \sum _{c}y_c \log (\hat{y}_{s,c}) \end{aligned}$$

(5)

where $\hat{y}_{s,c}$ means the prediction probability of the word-based classification model on the text category c. Moreover considering that the stereotype influence from the pre-trained model is intrinsic and does not affect the training process, we assume that the overall prediction is a linear combination of stereotype influence $\hat{y}_m$ and the fusion of $\hat{y}_{x(d,s)}$ and $\hat{y}_s$:

$$\begin{aligned} P(Y = y|x(d,s),m,s)&= y(x(d,s),m,s) \nonumber \\&= f(\hat{y}_{x(d,s)}, \hat{y}_s, \hat{y}_m)\nonumber \\&= f(\hat{y}_{x(d,s)}, \hat{y}_s) + \hat{y}_m \end{aligned}$$

(6)

As for the fusion model of $f(\hat{y}_{x(d,s)}, \hat{y}_s)$, we adopt the following fusion strategy to combine the model predictions: $\hat{y}_{x(d,s)} + \alpha \text {tanh}(\hat{y}_s)$, where $\alpha $ is a hyperparameter, $\sigma $ is sigmoid function, and $\text {tanh}$ is the tanh activation function. And the corresponding loss function of this fusion model is set to be:

$$\begin{aligned} L_{f} = \sum _{c}y_c \log (f(\hat{y}_{x(d,s)}, \hat{y}_s)) \end{aligned}$$

(7)

2.4 Unbiased Prediction

The third stage of our framework is to mitigate the stereotype bias from the text category predictions. As we have mentioned in Sect. 3.1, we aim to remove the direct causal effect from the set of stereotype words $\textbf{S}$ and pre-trained word embedding model $\textbf{M}$. The total effect (TE) stands for all the direct and indirect causal effects of the causal variable on the outcome, which can be denoted as:

$$\begin{aligned} \text {TE} = P(Y = y|x(d,s), m, s) - P(Y = y|x(d^*,s^*), m^*, s^*) \end{aligned}$$

(8)

Then the direct causal effect of m and s can be represented by the natural direct effect (NDE):

$$\begin{aligned} \text {NDE} = P(Y = y|x(d^*,s^*), m, s) - P(Y = y|x_{c^*,s^*}, m^*, s^*) \end{aligned}$$

(9)

where $d^*$,$s^*$, and $m^*$ represent the counterfactual value of $\textbf{D}$, $\textbf{S}$, and m respectively. Specifically, the counterfactual values $d^*$ and $s^*$ can be obtained by the masked values based on the training dataset and the value of $m^*$ can be set as any value and will not influence the indirect effects shown in (10).

Finally, as shown in Fig. 1 (b), we aim to cut all the direct causal effects from $\textbf{M}$ and $\textbf{S}$ to the text categories $\textbf{Y}$. Therefore, we use the total indirect effects (TIE) to remove the stereotype bias from the pre-trained model and stereotype words:

$$\begin{aligned} \text {TIE}&= \text {TE} - \text {NDE} \nonumber \\&= P(Y = y|x(d,s), m, s) - P(Y = y|x(d^*,s^*), m, s)\nonumber \\&= y(x(d,s), m, s) - \sigma y(x(d^*,s^*), m, s)\nonumber \\&\approx f(\hat{y}_{x(d,s)}, \hat{y}_s) - \sigma f(\hat{y}_{x(d^*,s^*)}, \hat{y}_s) \end{aligned}$$

(10)

where $\sigma $ is a hyperparameter to control the influence of the stereotype bias on the prediction results.

3 Experiments

3.1 Settings

To validate the effectiveness of our debiasing framework, we conduct experiments on multiple text classification datasets using different classifiers and mitigate the stereotype bias via our proposed framework. Then we compare our results with two state-of-the-art works on bias mitigation [17, 22]. In experiments, we concentrate on the effectiveness of our proposed methods on classification results and word-level fairness. The code of the experiments is available^{Footnote 1}.

Baseline. We choose three representative text classifiers as the baselines of our framework. The first one is TextCNN [5] which is based on the convolutional neural network (CNN) to extract the textual features and the TextCNN requires word embedding as the input, which can utilize pre-trained word embeddings. Another model is TextRCNN [8], which uses the bi-directional recurrent networks to capture the contextual information and utilize CNN for future feature extraction. The last one is RoBERTa [11], which uses dynamic masking and a larger pre-training set than BERT. RoBERTa can reach better generalization and robustness in text classification tasks. The three models all require pre-trained word embedding as inputs to involve the stereotype bias from the pre-trained model in downstream text classification tasks. To compare with the SOTA debiasing works, we choose two methods: IPS-Weight [22] and CORSAIR [17] for comparison. IPS-Weight uses the inverse propensity score as the instance weights to reduce the intended bias while CORSAIR removes the bias from the context words and imbalanced classes by removing counterfactual predictions.

Dataset. We conduct experiments on nine text classification datasets. Among these datasets, six datasets are the same as used in [17]: HyperPartisan, Twitter, ARC, SCIERC, Economy, and Parties. In addition, we also use an Amazon product review datasets [23]. We adopt the same pre-processing procedures on these datasets as [17].

Evaluation. We evaluate the framework from two perspectives: classification performance and word-level fairness. Considering the class imbalance of our dataset, we use Macro-F1 to measure the text classification performance. As for the word fairness, we adopt the evaluation framework shown in [19]. For each word in the dataset, we compare the prediction distribution of the data that contains this word with the even distribution and calculate the Jensen-Shannon divergence (JS). We use the average JS of all the words as the fairness metric.

Parameters. Then we use the grid search method to decide the specific value of parameters in the experiments. We set the batch size as 32 for the training and test data and then we use the Adam optimizer with a learning rate of 5e-4 for all three classification models during both the initial training and fusion training stages which have 20 epochs respectively. Then we set the proportion of the stereotype words as 5%, and set the $\alpha $ in the fusion training stage as 0.1. As for the parameter $\sigma $ in the unbiased predictions, we search for the best $\sigma $ from 0 to 2 with a stride of 0.05 based on the validation results. The experiments are conducted on three servers with NVIDIA RTX A5000 GPUs and the results are the average results of three rounds of experiments using different seeds.

Table 1. Classification performances compared with the State-of-the-art methods (%)

Full size table

3.2 Classification Performance

Table 1 shows our proposed methods’ classification performance (Macro F1 score) and the comparison with two SOTA methods. The higher results mean a better classification performance. The rows of BASELINE stand for the results without any debiasing methods. The rows of KEYWORD and IMBWORD represent the results of ablation studies where we regard all the words that have positive contributions to predictions as stereotype words in the experiments of KEYWORD and we mark all the words that have large entropy as stereotype words in the rows of KEYWORD.

From the results, our proposed methods have average improvements of 4.23%, 4.88%, and 4.82% using TextCNN, TextRCNN, RoBERTa from the baselines across the nine datasets. As for the two comparison methods, the improvements of our proposed methods are much more significant and stable across different datasets. Then compared with the results of two ablation studies, the proposed method considering both the word importance and entropy can reach better classification results in most of the datasets and the results are slightly lower than the ablation methods in SCIERC and Parties datasets. Moreover, the results that only consider the word importance(KEYWORD) result in a higher Macro F1 score than the results of IMBWORD in all three baseline models, which implies that semantic words are one source of bias in text classification.

3.3 Stereotype Word Fairness

Table 2 shows the word-level fairness of our proposed methods, ablation studies, and two comparison techniques. Considering that the stereotype words are not fixed among different classification models, we calculate the fairness of all the words instead of the stereotype words in the texts in Table 2. Lower results mean better fairness in the prediction results.

Compared with the baselines, our proposed methods can reach average improvements of 1.64, 0.28, and 0.73 respectively across all the datasets. From the results, we can find our proposed methods can reach lower fairness metrics in TextCNN and RoBERTa models in most datasets. The improvements in fairness are not as significant as the improvements in F1 scores because we use the fairness on the whole word set in the documents instead of the stereotype word set. The larger stereotype word set easily leads to a smaller word fairness. Compared with three methods: CORSAIR, KEYWORD, and IMBWORD, our proposed methods select a smaller and more accurate stereotype word set for debiasing and can reach competitive results.

Table 2. Word-level fairness of proposed methods

Full size table

3.4 Proportion of Stereotype Words

In this section, we implement further experiments on the influences of stereotype word proportion in the document set. We choose HyperPartisa and try 20 different proportions from 5%, 10%, to 100%. The Macro F1 scores and fairness are recorded in Fig. 3. For the TextRCNN model, the proportion of 15% can reach the highest F1 scores of more than 0.78 and rather low fairness around 17.00 while the best stereotype word proportion for the TextCNN model is around 50%, where the classification F1 is the highest: 0.73 and the word fairness is about 17.50. In addition, for the RoBERTa model, the F1 score can reach 0.75 when we set the proportion as 5%. In the meantime, the fairness is around 17.17 under this proportion. The different proportions show the effects of pre-trained embedding models and classification models on the stereotype bias. Similar to HyperPartisa, in other datasets, we can also select the best stereotype word proportions that can retain higher F1 scores and lower word fairness.

4 Conclusion

In this work, we follow previous works and focus on the potential word-level stereotype bias in text classification. We analyse the generation of bias in a causal view and propose a novel framework for bias mitigation. Our framework includes stereotype detection, fusion model training and unbiased prediction. Different from previous works, our framework can detect words that have a direct contribution to the predictions and does not rely on an external thesaurus. The experiments show better and more stable performances on multiple datasets and the ablation studies prove the effectiveness of the two parts in stereotype word detection. Moreover, we also explore the influences of the proportions of selected stereotype words. In future work, we will refine and weaken our assumptions on the proposed causal graph. We will include the causal path from the pre-trained model to the semantic variables and model the corresponding causal effects.

Notes

1.
https://github.com/DATA-Transpose/StereotypeWords.

References

Badjatiya, P., Gupta, M., Varma, V.: Stereotypical bias removal for hate speech detection task using knowledge-based generalizations. In: WWW (2019)
Google Scholar
Frassanito, P., Pettorini, B.: Pink and blue: the color of gender. Childs Nerv. Syst. 24, 881–882 (2008)
Article Google Scholar
Huang, X.: Easy adaptation to mitigate gender bias in multilingual text classification. In: NAACL (2022)
Google Scholar
Jia, S., Meng, T., Zhao, J., Chang, K.: Mitigating gender bias amplification in distribution by posterior regularization. In: ACL (2020)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014)
Google Scholar
Kiritchenko, S., Mohammad, S.M.: Examining gender and race bias in two hundred sentiment analysis systems. In: NAACL-HLT (2018)
Google Scholar
Kokalj, E., Skrlj, B., Lavrac, N., Pollak, S., Robnik-Sikonja, M.: BERT meets shapley: extending SHAP explanations to transformer-based classifiers. In: EACL (2021)
Google Scholar
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI (2015)
Google Scholar
Liu, H., Jin, W., Karimi, H., Liu, Z., Tang, J.: The authors matter: Understanding and mitigating implicit bias in deep text classification. arXiv preprint arXiv:2105.02778 (2021)
Liu, J., et al.: Fair representation learning: An alternative to mutual information. In: Zhang, A., Rangwala, H. (eds.) KDD (2022)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lundberg, S.M., Lee, S.: A unified approach to interpreting model predictions. In: NIPS (2017)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: EMNLP (2004)
Google Scholar
Nadeem, M., Bethke, A., Reddy, S.: Stereoset: measuring stereotypical bias in pretrained language models. In: ACL/IJCNLP (2021)
Google Scholar
Nasukawa, T., Yi, J.: Sentiment analysis: capturing favorability using natural language processing. In: K-CAP (2003)
Google Scholar
Pearl, J.: Direct and indirect effects. In: Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 373–392 (2022)
Google Scholar
Qian, C., Feng, F., Wen, L., Ma, C., Xie, P.: Counterfactual inference for text classification debiasing. In: ACL/IJCNLP (2021)
Google Scholar
Sun, P., Wu, B., Li, X., Li, W., Duan, L., Gan, C.: Counterfactual debiasing inference for compositional action recognition. In: MM (2021)
Google Scholar
Sweeney, C., Najafian, M.: A transparent framework for evaluating unintended demographic bias in word embeddings. In: ACL (2019)
Google Scholar
Tian, B., Cao, Y., Zhang, Y., Xing, C.: Debiasing NLU models via causal intervention and counterfactual reasoning. In: AAAI (2022)
Google Scholar
Wu, J., Liu, Q., Xu, W., Wu, S.: Bias mitigation for evidence-aware fake news detection by causal intervention. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR (2022)
Google Scholar
Zhang, G., Bai, B., Zhang, J., Bai, K., Zhu, C., Zhao, T.: Demographics should not be the reason of toxicity: mitigating discrimination in text classifications with instance weighting. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) ACL (2020)
Google Scholar
Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Google Scholar
Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V., Chang, K.: Gender bias in contextualized word embeddings. In: NAACL-HLT (2019)
Google Scholar
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Gender bias in coreference resolution: Evaluation and debiasing methods. In: NAACL-HLT (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Queensland, Brisbane, Australia
Shaofei Shen, Mingzhe Zhang, Alina Bialkowski & Miao Xu
University of Adelaide, Adelaide, Australia
Weitong Chen
RIKEN, Tokyo, Japan
Miao Xu

Authors

Shaofei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weitong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alina Bialkowski
View author publications
You can also search for this author in PubMed Google Scholar
Miao Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weitong Chen .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Hisashi Kashima
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Tsuyoshi Ide
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, S., Zhang, M., Chen, W., Bialkowski, A., Xu, M. (2023). Words Can Be Confusing: Stereotype Bias Removal in Text Classification at the Word Level. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13938. Springer, Cham. https://doi.org/10.1007/978-3-031-33383-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-33383-5_8
Published: 26 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33382-8
Online ISBN: 978-3-031-33383-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Words Can Be Confusing: Stereotype Bias Removal in Text Classification at the Word Level

Abstract

Similar content being viewed by others