Abstract
Toxicity on the Internet is an acknowledged problem. It includes a wide range of actions from the use of obscene words to offenses and hate speech toward particular users or groups of people. However, there also exist other types of inappropriate messages which are usually not viewed as toxic as they do not contain swear words or explicit offenses. Such messages can contain covert toxicity or generalizations, incite harmful actions (crime, suicide, drug use), and provoke “heated” discussions. These messages are often related to particular sensitive topics, e.g. politics, sexual minorities, or social injustice. Such topics tend to yield toxic emotional reactions more often than other topics, e.g. cars or computing. At the same time, not all messages within “flammable” topics are inappropriate. This work focuses on automatically detecting inappropriate language in natural texts. This is crucial for monitoring user-generated content and developing dialogue systems and AI assistants. While many works focus on toxicity detection, we highlight the fact that texts can be harmful without being toxic or containing obscene language. Blind censorship based on keywords is a common approach to address these issues, but it limits a system’s functionality. This work proposes a safe and effective solution to serve broad user needs and develop necessary resources and tools. Thus, machinery for inappropriateness detection could be useful (i) for making communication on the Internet safer, more productive, and inclusive by flagging truly inappropriate content while not banning messages blindly by topic; (ii) for detection of inappropriate messages generated by automatic systems, e.g. neural chatbots, due to biases in training data; (iii) for debiasing training data for language models (e.g. BERT and GPT-2). Towards this end, in this work, we present two text collections labeled according to a binary notion of inappropriateness (124,597 samples) and a multinomial notion of sensitive topic (33,904 samples). Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on a human intuitive understanding of what is not acceptable and harmful. To devise an objective view of inappropriateness, we define it in a data-driven way through crowdsourcing. Namely, we run a large-scale annotation study asking workers if a given chatbot-generated utterance could harm the reputation of the company that created this chatbot. High values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of a sensitive topic in an objective way we use guidelines suggested by specialists in the Legal and PR departments of a large company. We use the collected datasets to train inappropriateness and sensitive topic classifiers employing both classic and Transformer-based models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Babakov, N., Logacheva, V., Kozlova, O., Semenov, N., & Panchenko, A. (2021). Detecting inappropriate messages on sensitive topics that could harm a company’s reputation. In Proceedings of the 8th Workshop on Balto–Slavic Natural Language Processing, pp. 26–36, Kiyv. Association for Computational Linguistics.
Banko, M., MacKeen, B., & Ray, L. (2020). A unified taxonomy of harmful content. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 125–137, Online. Association for Computational Linguistics.
Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Pardo, F., Manuel, R., Rosso, P., & Sanguinetti, M. (2019). SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63, Minneapolis, Minnesota. Association for Computational Linguistics.
Bogoradnikova, D., Makhnytkina, O., Matveev, A., Zakharova, A., & Akulov, A. (2021). Multilingual sentiment analysis and toxicity detection for text messages in russian. In 2021 29th Conference of Open Innovations Association (FRUCT), pp 55–64.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Breitfeller, L., Ahn, E., Jurgens, D., & Tsvetkov, Y. (2019). Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1664–1674, Hong Kong. Association for Computational Linguistics.
Cecillon, N., Labatut, V., Dufour, R., & Linarès, G. (2020). Wac: A corpus of wikipedia conversations for online abuse detection. In LREC.
Chung, Y.-L., Kuzmenko, E., Tekiroglu, S. S., & Guerini, M. (2019). CONAN-COunter NArratives through nichesourcing: A multilingual dataset of responses to fight online hate speech. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2819–2829, Florence. Association for Computational Linguistics.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, arXiv:1911.02116.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online. Association for Computational Linguistics.
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language.
Dawid, A. P., & Skene, A. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of The Royal Statistical Society Series C, 28, 20–28.
de Gibert, O., Perez, N., García-Pablos, A., & Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 11–20, Brussels, Belgium. Association for Computational Linguistics.
Dinan, E., Abercrombie, G., Stevie, B. A., Spruit, S., Hovy, D., Boureau, Y.-L., & Rieser, V. (2021). Anticipating safety issues in e2e conversational ai: Framework and tooling.
Dinan, E., Humeau, S., Chintagunta, B., & Weston, J. ( 2019). Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4537–4546, Hong Kong. Association for Computational Linguistics.
Dixon, L., Li, J., Sorensen, J., Thain, N., & Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, pp. 67–73, New York. Association for Computing Machinery.
Drisko, J. W., & Maschi, T. (2016). Content analysis. Pocket Guide to Social Work Re.
Fersini, E., Nozza, D., & Rosso, P. (2018). Overview of the evalita 2018 task on automatic misogyny identification (ami). In EVALITA@CLiC-it.
Gautam, A., Mathur, P., Gosangi, R., Mahata, D., Sawhney, R., & Shah, R. R. (2020). # metooma: Multi-aspect annotations of tweets related to the metoo movement. InProceedings of the International AAAI Conference on Web and Social Media, Vol. 14, pp. 209–216.
Han, X., & Tsvetkov, Y. (2020). Fortifying toxic speech detectors against veiled toxicity. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7732–7739, Online. Association for Computational Linguistics.
Hessel, J., & Lee, L. (2019). Something’s brewing! early prediction of controversy-causing posts from discussion features. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1648–1659, Minneapolis, Minnesota. Association for Computational Linguistics.
Jigsaw multilingual toxic comment classification. https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification, 2019. Accessed 13 Jan 2021.
Jigsaw unintended bias in toxicity classification. https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification, 2019. Accessed 13 Jan 2021.
Jigsaw. (2018). Toxic comment classification challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. Accessed 01 March 2021.
Jigsaw. (2019). Jigsaw unintended bias in toxicity classification. https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification. Accessed 01 March 2021.
Jigsaw. (2020). Jigsaw multilingual toxic comment classification. https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification. Accessed 01 March 2021.
Joshi, R., Karnavat, R., Jirapure, K., & Joshi, R. (2021). Evaluation of deep learning models for hostility detection in Hindi text. CoRR, arXiv:2101.04144.
Kaggle. (2019). Russian language toxic comments. https://www.kaggle.com/blackmoon/russian-language-toxic-comments. Accessed 01 March 2021.
Kaggle. (2020). Toxic Russian comments. https://www.kaggle.com/alexandersemiletov/toxic-russian-comments. Accessed 01 March 2021.
Karan, M., & Šnajder, J. (2019). Preemptive toxic language detection in Wikipedia comments using thread-level context. In Proceedings of the Third Workshop on Abusive Language Online, pp. 129–134, Florence. Association for Computational Linguistics.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar. Association for Computational Linguistics.
Krippendorff, K. (1980). Content analysis: An introduction to its methodolog.
Kuratov, Y., & Arkhipov, M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language.
LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1990). Handwritten digit recognition with a back-propagation network. In D. Touretzky (Ed.), Advances in neural information processing systems. (Vol. 2). Burlington: Morgan-Kaufmann.
Lees, A., Borkan, D., Kivlichan, I., Nario, J., & Goyal, T. (2021). Capturing covertly toxic speech via crowdsourcing. In Proceedings of the First Workshop on Bridging Human-Computer Interaction and Natural Language Processing, pp 14–20, Online. Association for Computational Linguistics.
Mollas, I., Chrysopoulou, Z., Karlos, S., & Tsoumakas, G. (2021). Ethos: An online hate speech detection dataset.
Nobata, C., Tetreault, J. R., Thomas, A., Mehdad, Y., & Chang, Y. (2016). Abusive language detection in online user content. In Bourdeau, J., Hendler, J., Nkambou, R., Horrocks, I. & Zhao, B. Y., (eds.) In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, April 11–15, 2016, pp. 145–153. ACM.
Ousidhoum, N., Lin, Z., Zhang, H., Song, Y., & Yeung, D.-Y. (2019). Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4675–4684, Hong Kong. Association for Computational Linguistics.
Ovadia, Seth. (2004). Ratings and rankings: Reconsidering the structure of values and their measurement. International Journal of Social Research Methodology, 7(5), 403–414.
Pandey, R., Purohit, H., Stabile, B., & Grant, A. (2018). Distributional semantics approach to detect intent in twitter conversations on sexual assaults. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 270–277.
Park, J. H., Shin, J, & Fung, P. (2018). Reducing gender bias in abusive language detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2799–2804, Brussels, Belgium, Association for Computational Linguistics.
Passonneau, Rebecca, J. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In LREC.
Pavlopoulos, J., Malakasiotis, P., & Androutsopoulos, I. (2017). Deeper attention to abusive user content moderation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1125–1135, Copenhagen. Association for Computational Linguistics.
Pavlopoulos, J., Sorensen, J., Dixon, L., Thain, N., & Androutsopoulos, I. (2020). Toxicity detection: Does context really matter? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4296–4305, Online. Association for Computational Linguistics.
Qian, J., Bethke, A., Liu, Y., Belding, E., & Wang, W. Y. (2019). A benchmark dataset for learning to intervene in online hate speech.
Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Rosenthal, S., Atanasova, P., Karadzhov, G., Zampieri, M., & Nakov, P. (2021). SOLID: a large-scale semi-supervised dataset for offensive language identification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 915–928, Online. Association for Computational Linguistics.
Salminen, J., Sengün, S., Corporan, J., Jung, S., & Jansen, B. J. (2020). Topic-driven toxicity: Exploring the relationship between online toxicity and news topics. PLoS ONE, 2, 15.
Schrading, N., Alm, C. O.,, Ptucha, R., & Homan, C. (2015). An analysis of domestic abuse discourse on Reddit. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2577–2583, Lisbon. Association for Computational Linguistics.
Smetanin, S. (2020). Toxic comments detection in russian. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020”,
Sun, C., Huang, L., & Qiu, X. (2019). Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 380–385, Minneapolis, Minnesota. Association for Computational Linguistics.
Vaidya, A., Mai, F., & Ning, Y. (2020). Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 683–693.
Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, pp. 88–93, San Diego. Association for Computational Linguistics.
Waseem, Z., Davidson, T., Warmsley, D. & Weber, I. (2017). Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Language Online, pp. 78–84, Vancouver. Association for Computational Linguistics.
Wu, X., Lv, S., Zang, L., Han, J., & Hu, S. (2019). Conditional Bert contextual augmentation. In J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, J. J. Dongarra, & P. M. A. Sloot (Eds.), Computational science-ICCS 2019 (pp. 84–95). Cham: Springer International Publishing.
Xia, M., Field, A., & Tsvetkov, Y. (2020). Demoting racial bias in hate speech detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, pp. 7–14, Online. Association for Computational Linguistics.
Xia, C., Zhang, C., Nguyen, H., Zhang, J., & Yu, P. S. (2020). Cg-bert: Conditional text generation with Bert for generalized few-shot intent detection. arXiv:2004.01881.
Xu, J., Da, J., Margaret, L., Boureau, Y.-L., Weston, J., & Dinan, E. (2020). Recipes for safety in open-domain chatbots.
Yenala, H., Jhanwar, A., Chinnakotla, M. K., & Goyal, J. (2018). Deep learning for detecting inappropriate content in text. International Journal of Data Science and Analytics, 6(4), 273–286.
Yu, S., Su, J., & Luo, D. (2019). Improving Bert-based text classification with auxiliary sentence and domain knowledge. IEEE Access, 7, 176600–176612.
Zampieri, Marcos, Malmasi, Shervin, Nakov, Preslav, Rosenthal, Sara, F. N., & Kumar, R. (2019). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1415–1420, Minneapolis, Minnesota. Association for Computational Linguistics.
Zhang, G., Bai, B., Zhang, J., Bai, K., Zhu, C., & Zhao, T. (2020). Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4134–4145, Online, Association for Computational Linguistics.
Acknowledgements
This work was conducted under the framework of the joint Skoltech-MTS laboratory under the project concluded on 31.07.2022. We are grateful to MTS for permission to share the produced datasets and code. We thank Pavel Odintsev for creating two illustrations of our technology reproduced in this article. We thank Toloka for providing a crowdsourcing grant partially covered the annotation needs of this research. The work was partially supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). This research was partially funded by the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 860621, and the Galician Ministry of Culture, Education, Professional Training, and University and the European Regional Development Fund (ERDF/FEDER program) under grants ED431C2022/19 and ED431G2019/04.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Code availability
This work has not been submitted to any other journal or conference before. A part of this work has already been published in the Workshop for Balto-Slavic Natural Language Processing (Babakov et al., 2021). The current work contains new results, descriptions of new methods, and new extended datasets. The differences from the previous work are listed in Sect. 1. This manuscript describes the full conducted work on the collection of datasets of inappropriate messages and sensitive topics and on their use for training the classification models. The results presented in the manuscript can be replicated using the provided datasets, code for training the models, and pre-trained models. All of the above are made available.
Human and animal rights
The data collection process involved human participants. These were workers hired via Toloka crowdsourcing platform. They were informed and accepted that any data that they produce can be used by employers in the public or private domain. The projects created within Toloka for data annotation fully comply with the rules of this service. Our research was motivated by the need for moderation of automatically generated content by neural models, such as the GPT family of decoder-based Transformers to avoid the reputations risk of companies deploying such models (including PR risks).
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Nikolay Babakov conducted the main part of the work while at Skolkovo Institute of Science and Technology.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Babakov, N., Logacheva, V. & Panchenko, A. Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements. Lang Resources & Evaluation 58, 459–504 (2024). https://doi.org/10.1007/s10579-023-09682-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09682-z