Automatic Classification of Abusive Language and Personal Attacks in Various Forms of Online Communication
- 9.7k Downloads
The sheer ease with which abusive and hateful utterances can be made online – typically from the comfort of your home and the lack of any immediate negative repercussions – using today’s digital communication technologies (especially social media), is responsible for their significant increase and global ubiquity. Natural Language Processing technologies can help in addressing the negative effects of this development. In this contribution we evaluate a set of classification algorithms on two types of user-generated online content (tweets and Wikipedia Talk comments) in two languages (English and German). The different sets of data we work on were classified towards aspects such as racism, sexism, hatespeech, aggression and personal attacks. While acknowledging issues with inter-annotator agreement for classification tasks using these labels, the focus of this paper is on classifying the data according to the annotated characteristics using several text classification algorithms. For some classification tasks we are able to reach f-scores of up to 81.58.
Hateful conduct, abusive language and verbal aggression are by no means new phenomena. Comments and statements of this type seriously hamper a constructive private discussion or public debate. The sheer ease with which hateful utterances can be made – typically from the comfort of your home and the lack of any immediate negative repercussions – using today’s digital communication technologies, is responsible for their significant increase and global ubiquity. In recent years, the topic has received an increasing amount of attention from multiple stakeholders. Among these are social scientists who want to analyse this phenomenon and reasons for abusive online behaviour and politicians who realise that major parts of public debates and social discourse are carried out online. In addition, we have seen that not only such online discussions but also the perception of concepts, politicians, elections and civil rights movements can be influenced using highly targeted social media marketing campaigns. We live in a time in which online media, including online news and online communication, have an unprecedented level of social, political and also economic relevance. This situation creates a plethora of challenges with regard to the key question how best to address the importance and relevance of online media and online content with technological means while at the same time not putting in place a centralised infrastructure that can be misused for the purpose of censorship or surveillance. One challenge is to separate high quality content from offensive, hateful, abusive or massively biased content. While these tasks have been mostly in the realm of journalism, they are getting more and more transferred to the end user of online content, i.e., the analysis, curation and assessment of information is no longer carried out by professional news editors or journalists exclusively – the burden of fact checking is more and more left to the reader.
In the social sciences and humanities, research on the phenomena and characteristics of Computer-Mediated Communication (CMC) has a long tradition. Initially, scholars concentrated on different types of novel communication media such as electronic mail, Internet Relay Chat (IRC), Usenet newsgroups, and different types of hypertext systems and documents, especially personal home pages, guestbooks and, later, discussion fora (Runkehl et al. 1998; Crystal 2001; Storrer 2001; Döring 2002). Early on, researchers focused upon the obvious differences between these new forms of written digital communication and the established, traditional forms, especially when it comes to linguistic phenomena that can be observed on the text surface, such as smileys and emoticons, specific acronyms and technological aspects of communication. Many authors observed that the different forms of internet-mediated communication have a certain oral and spoken style, quality and conceptualisation to them, as if produced spontaneously in a casual conversation, while, at the same time, being realised in a written medium (Haase et al. 1997).
If we now fast forward to 2017, a completely different picture emerges. About 40–50% of the global population has access to the Internet, most of whom also use the World Wide Web and one or more of the big social networks. The internet has become mainstream and acts like an amplifier, maybe also as an enabler, of social trends. We already mentioned some of the current challenges of this massive penetration of our lives through Internet-based forms of communication. The social, political and economic relevance of online media, online news and online communication could not be any more crucial. While early analyses and discussions of computer-mediated communication and discourse, e.g., (Reid, 1991), observed that their participants were involved in the “deconstruction of boundaries” and the “construction of social communities”, today the exact opposite seems to be case: both offline and online can we observe the (disturbing) trend of increased nationalism and the exclusion of foreigners, immigrants and seemingly arbitrary minorities – boundaries are constructed, social communities deconstructed.
One last aspect is worth pointing out: up to now there has not really been any major need for automatic classification approaches of online content, with two notable exceptions. The first are online advertisements, either in the form of unsolicited spam email or in the form of online ads, either embedded in web documents or presented as pop-out windows. The second exception is sentiment analysis of social media data, driven by a clear use case: knowing what your customers or voters say or think about you as a company or politician. We are now slowly approaching a state of play, in which automatic means may be needed to classify online content or parts of online content into additional dimensions such as, for example, “hatespeech”, “abusive language”, maybe even “fake news” and “alternative facts”. While spam mail can be classified and categorised with a fairly high accuracy (and online ads taken care of with an ad blocker), sentiment analysis already poses more difficult challenges (such as irony, sarcasm and scope issues). And it remains to be seen if abusive language can be identified accurately using automatic means and if additional analysis dimensions have to be incorporated if automatic approaches are to be used in a real application scenario.
The research presented in this paper has been carried out under the umbrella of a two-year research and technology transfer project. We collaborate with four SME partners that all face the challenge of having to process, to analyse and to make sense of large amounts of digital information. The four companies cover four different use cases and sectors (Rehm and Sasaki 2015), including journalism. For these partners we develop a platform that provides several semantic and knowledge technologies. In this article, we focus upon the application of several classification algorithms to establish the feasibility of the detection and classification of abusive language. We do this by evaluating the classification algorithms on three publicly available data sets. While the definition of abusive language, and, consequently, inter-annotator agreement of relevant corpora are crucial and far from resolved issues in this area of research, we focus on classification using linguistic features. Our goal is to establish a solid baseline for these three, publicly available corpora. The remainder of this paper is divided into four sections. Section 2 discusses related work, most notably on inter-annotator agreement, an important prerequisite for accurate classification. Section 3 describes the analysed and classified data sets. Section 4 discusses the results and Sect. 5 provides a summary and ideas for future work.
2 Related Work
Today, when it comes to the characteristics of online media and communication, several challenges are being discussed over and over again. We believe that the Natural Language Processing (NLP) and Language Technology (LT) communities can provide at least parts of the adequate social and technical solutions for, among others, hatespeech, fake news (including orchestrated disinformation campaigns), politically biased journalism, trolling, cyber-bullying and abusive language.
Hateful or aggressive conduct online has received an increasing amount of attention in recent years. For an accurate classification, however, consensus is needed on what constitutes abusive language, hate speech and aggressive conduct, and what is still acceptable within the boundaries of free speech. While automated methods for detecting and classifying language use – such as spam vs. no spam – and sentiment – such as positive vs. negative – are typical application scenarios for NLP technologies, the question of what is acceptable and no longer acceptable within the boundaries of free speech puts this sensitive question and area of research into the intersection of different disciplines, including linguistics, sociology (Jones et al. 2013; Phillips 2015), psychology (Kowalski and Limber 2013; Dreißing et al. 2014), law (Marwick and Miller 2014; Banks 2010; Massaro 1991) and also common sense.
Many researchers approaching this topic acknowledge the difficulty in reaching a consensus. Ross et al. (2016) introduce a German corpus of hate speech on the European refugee crisis and report low inter-annotator agreement scores (Krippendorff’s \(\alpha \) between 0.18 and 0.29). Waseem (2016) investigates inter-annotator agreement comparing amateur annotations using CrowdFlower and expert annotations using precise instructions and reports a Cohen’s Kappa of 0.14. Van Hee et al. (2015) work on classification of cyberbullying using a Dutch corpus and report Kappa scores between 0.19 and 0.69. Kwok and Wang (2013) investigate racist tweets and report an overall inter-annotator agreement of only 33%. Nobata et al. (2016) report a relatively high agreement for binary classification of clean vs. abusive for Yahoo! comments (Kappa = 0.843), but this number drops significantly when different subcategories for the abusive comments are introduced (such as hate, derogatory language and profanity, with Kappa decreasing to 0.456).
Another complicating issue is the fact that abusive language is often extra-linguistic in nature. Whether a particular utterance is considered abusive or not, often depends on other aspects including context, (ethnicity of the) author, (ethnicity of the) targeted person or group, etc. (Nand et al. 2016; Waseem and Hovy 2016; Warner and Hirschberg 2012). An excellent overview of NLP-based approaches towards hate speech detection is provided by Schmidt and Wiegand (2017).
The three data sets – key facts and examples
3 Data Sets
The experiments reported in this paper are conducted on three different data sets. The first one (ET, see Table 1) is provided by Waseem and Hovy (2016) and consists of English tweets. We scraped the tweets’ actual content; of the 16,907 tweet IDs provided on the authors’ GitHub page, we were able to retrieve 15,979 tweets (the smaller number most likely due to deleted tweets or time-outs during scraping). The tweets were classified into the classes none (10,939 instances), sexism (3,131 instances) or racism (1,909 instances). Despite missing out on 928 annotated tweets, the distribution over our version of the data set is the same as the one reported by Waseem and Hovy (2016), with respectively 68%, 20%, 12% of tweets being annotated as none, sexist, racist. For this and the other two data sets, Table 1 provides some examples.
The second data set (GT, see Table 1) is provided by Ross et al. (2016) and consists of German tweets. With only 469 tweets, this data set is considerably smaller. They were annotated by two expert annotators who indicated a tweet to be either hateful or not hateful. In addition, the second annotator also scored the tweet on a scale of 1 (not offensive at all) to 6 (very offensive). The distribution of hateful vs. non-hateful tweets for annotator 1 was 110–359 and for annotator 2 it was 98–371.
Results of our classification experiments
The data set of Ross et al. (2016) is significantly smaller (GT). The annotations were done by two expert annotators. We have trained the set of classifiers twice, using these annotations (binary, expert 1 and binary, expert 2), and have made no attempts to resolve a final or definitive label, exactly because of the problems with inter-annotator agreement. While the results for the best scoring algorithm in the case of binary classification is still reasonable, performance drops dramatically when using multi-label classification (six classes). The figures for the GT data set are of limited reliability due to its small size. Ross et al. (2016) do not report classification results and focus on inter-annotator agreement. We include the results in Table 2, but due to large score deviations for individual runs and for reasons of brevity, do not discuss the results for this data set.
For our subsection of the WT data set (Wulczyn et al., 2016), we see a similar pattern in the sense that binary classification scores are reasonable (81.58 and 80.60 for Bayes) but performance drops significantly when using multi-class classification (seven classes).
Wulczyn et al. (2016) do not mention any f-scores, but they do refer to experiments using their best performing personal attack classifier on comments made in 2015 (excluding “administrative comments and comments generated by bots”). The different setup in this study makes it hard to compare, as they are not performing binary classification directly, but assign a continuous score between 0 and 1. As the authors note, “even though the thresholded model-scores give good estimates of the rate of attacks over a random sample of comments, it is not given that they also give accurate estimates when partitioning comments into different groups”. Using this method, however, the authors do report a precision of 0.63 and a recall of 0.63, when using a threshold value of 0.425.
In an attempt to get a better overview of which classification algorithms work best for which type of data, we found that the preferred classification algorithm is highly data-specific. For example, Ranawana and Palade (2006) provide an overview of multi-class classification algorithms and identify the most distinct features in order to combine several algorithms for one classification problem. In their experiments, they work on a data set of human DNA from the Berkeley Drosophila Genome Project website. Andreopoulos et al. (2009) describes a variety of clustering algorithms applied to the biomedical domain. Caruana and Niculescu-Mizil (2006) present an empirical study of several learning algorithms and their features applied on 11 data sets. It remains difficult to predict the best-performing algorithm when certain data characteristics (like size of the data set, average text length, variation, number of classes, etc.) are known. The main goal of the work presented in this paper is to provide a baseline performance indication and give an overview of different classification algorithms applied on the data sets included. We only use a simple BOW approach and the job of feature engineering is left as an important next step towards classification of abusive language.
To gain more insight into the features used by the best scoring algorithm for the different data sets, we extracted the ten most informative features and report scores averaged over ten runs. The x-axis contains the top 10 unique words for all existing classes. The y-axis represents the information gain per feature. Features for the first data set are shown in Fig. 1 for the three existing classes. The only overlap between the “racism” and “sexism” features (in the top 10) is islam, which was apparently also a relatively frequent term in the tweets classified as “sexist” in the training set. The considerable overlap between the “none” class and the other two is likely to reflect the fact that not all tweets containing these words were annotated as either “racist” or “sexist”. As also noted by Warner and Hirschberg (2012), classification of abusive language can be conceptualised as a word disambiguation task: sometimes the usage of a word is abusive, sometimes it is not. The features that are less straightforward as words represent the name of a cooking show (mkr),7 the established abbreviation of “retweet” (rt), the twitter handle of a user (mt8_9)8 and (probably) the result of Mallet tokenising the m in I’m as a separate word.
The ten most informative features for the classification task of aggression on a scale of −3 (very aggressive) to 3 (very friendly) in the WT data set for the best performing algorithm (Bayes) are shown in Fig. 2.9 Remarkably, the top 10 most informative features (words) for the seven classes are represented by only 18 unique words. The words in this set associated with friendliness (“welcome”, “please”, “great” and “happy”, for example) are only present in class 3. On the other end of the spectrum, class −3 only contains one word often associated with friendliness (“thanks”). Overall, there is a large degree of overlap between the classes, suggesting that the lower-ranked features also play an important role in classification. Upon manual investigation of the lower-ranked features, we found that the neutral class (0) seemed to function mostly as background noise, canceling out both extremes, as is the case for the classifier for the Waseem and Hovy (2016) data set. The negative digit classes (−1, −2 and −3) contain a large vocabulary of curse words, whereas the positive digit classes (1, 2 and 3) contain more communicative, constructive and cooperative terms like “ask”, “questions”, “discussion”, etc.
We apply a range of classification algorithms on three data sets that differ in language (usage), size and domain/genre: A corpus of English tweets annotated for racist and sexist tweets (ET); a corpus of German tweets annotated for hate speech (GT); and a corpus of English Wikipedia user comments who, unlike tweets, have less strict length restrictions (WT). While many studies on this topic focus on inter-annotator agreement (Sect. 2), we establish a baseline for classification based on these three corpora and data sets. We describe the differences between the performance of different classification algorithms and the features used by the respective best performing algorithm. Although consensus on what needs to be detected, which is typically reflected by inter-annotator agreement, is important to construct relevant training corpora, our results indicate that automatic classification can provide reasonable results and does not have to be a bottle neck in attempts to automatically detect abusive language. Nevertheless, further research is needed to reach an agreement on definitions of abusive language, hate speech, hateful conduct, cyber-bullying and other phenomena of online communication in order to arrive at accurate and meaningful classification approaches. Additionally, the simple feature set (BOW) used in the experiments can and should be complemented with more semantically and context-aware components such as sentiment values, dependency parsing (to provide insight on scope of abusive elements), and other more sophisticated NLP techniques. Finally, we would like to emphasise that the extra-linguistic nature of abusive behaviour and the fact that, apart from language usage, accurate detection depends on the author, target audience, communicative intent and other context specifics, is not addressed in this paper and, thus, left as an important part of future work in this area.
While the documentation states a range from −2 to 2, we actually found a range of −3 to 3 in the annotations (https://meta.wikimedia.org/wiki/Research:Detox/Data_Release).
Except for the numbers marked with an asterisk; these are the result of three-fold cross-validation due to the large amount of time needed for training and execution.
The set we used is inspired on the script available at https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb.
Due to the small corpus size of the GT data set, we refrain from showing the ten most informative features here.
The authors wish to thank the anonymous reviewers for their helpful feedback. The project “Digitale Kuratierungstechnologien (DKT)” is supported by the German Federal Ministry of Education and Research (BMBF), “Unternehmen Region”, instrument “Wachstumskern-Potenzial” (No. 03WKP45). More information at http://www.digitale-kuratierung.de.
- Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168. ACM (2006)Google Scholar
- Döring, N.: Personal home pages on the web: a review of research. J. Comput.-Mediat. Commun. 7(3) (2002). http://www.ascusc.org/jcmc/
- Kwok, I., Wang, Y.: Locate the hate: detecting tweets against blacks. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2013, pp. 1621–1622. AAAI Press (2013). http://dl.acm.org/citation.cfm?id=2891460.2891697
- Marwick, A.E., Miller, R.W.: Online harassment, defamation, and hateful speech: a primer of the legal landscape. Fordham Center on Law and Information Policy Report, June 2014Google Scholar
- Massaro, T.M.: Equality and freedom of expression: the hate speech dilemma. William Mary Law Rev. 32(211) (1991)Google Scholar
- McCallum, A.K.: MALLET: a machine learning for language toolkit (2002). http://www.cs.umass.edu/~mccallum/mallet
- Nand, P., Perera, R., Kasture, A.: “How bullying is this message?”: a psychometric thermometer for bullying. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 695–706. The COLING 2016 Organizing Committee, December 2016. http://aclweb.org/anthology/C16-1067
- Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Republic and Canton of Geneva, Switzerland, pp. 145–153. International World Wide Web Conferences Steering Committee (2016). https://doi.org/10.1145/2872427.2883062, ISBN 978-1-4503-4143-1
- Phillips, W.: This Is Why We Can’t Have Nice Things: Mapping the Relationship Between Online Trolling and Mainstream Culture. The MIT Press, Cambridge (2015)Google Scholar
- Rehm, G., Sasaki, F.: Digitale Kuratierungstechnologien - Verfahren für die effiziente Verarbeitung, Erstellung und Verteilung qualitativ hochwertiger Medieninhalte. In: Proceedings of the 2015 International Conference of the German Society for Computational Linguistics and Language Technology, GSCL 2015, pp. 138–139 (2015)Google Scholar
- Reid, E.M.: Electropolis: communication and community on internet relay chat, Honours thesis, University of Melbourne, Department of History (1991). http://www.aluluei.com/electropolis.htm
- Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., Wojatzki, M.: Measuring the reliability of hate speech annotations: the case of the European refugee crisis. In: Beißwenger, M., Wojatzki, M., Zesch, T. (eds.) Proceedings of NLP4CMC III: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication, Bochumer Linguistische Arbeitsberichte, Bochum, vol. 17, pp. 6–9, September 2016Google Scholar
- Runkehl, J., Schlobinski, P., Siever, T.: Sprache und Kommunikation im Internet - Überblick und Analysen. Westdeutscher Verlag, Opladen (1998)Google Scholar
- Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Valencia, Spain, pp. 1–10. Association for Computational Linguistics, April 2017. http://www.aclweb.org/anthology/W17-1101
- Storrer, A.: Getippte Gespräche oder dialogische Texte? Zur kommunikationstheoretischen Einordnung der Chat-Kommunikation. In: Lehr, A., Kammerer, M., Konerding, K.-P., Storrer, A., Thimm, C., Wolski, W. (eds.) Sprache im Alltag. Beiträge zu neuen Perspektiven der Linguistik, pp. 439–466. de Gruyter, Berlin (2001). Herbert Ernst Wiegand zum 65. Geburtstag gewidmetGoogle Scholar
- Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., De Pauw, G., Daelemans, W., Hoste, V.: Detection and fine-grained classification of cyberbullying events. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 672–680. INCOMA Ltd., Shoumen (2015). http://aclweb.org/anthology/R15-1086
- Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, LSM 2012, Stroudsburg, PA, USA, pp. 19–26. Association for Computational Linguistics (2012). http://dl.acm.org/citation.cfm?id=2390374.2390377
- Waseem, Z.: Are you a racist or am i seeing things? annotator influence on hate speech detection on Twitter. In: Proceedings of the First Workshop on NLP and Computational Social Science, Austin, Texas, pp. 138–142. Association for Computational Linguistics, November 2016. http://aclweb.org/anthology/W16-5618
- Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, San Diego, California, pp. 88–93. Association for Computational Linguistics, June 2016. http://www.aclweb.org/anthology/N16-2013
- Wulczyn, E., Thain, N., Dixon, L.: Ex machina: personal attacks seen at scale. CoRR, abs/1610.08914 (2016). http://arxiv.org/abs/1610.08914
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.