In recent years, the automatic detection of online Hate Speech (HS), and offensive language more generally, has become an active research topic in machine learning (Davidson et al. 2017; Schmidt and Wiegand 2017a, b; Fortuna and Nunes 2018). This has been prompted by increasing anxieties about the prevalence of HS on social media, and the psychological and societal harms that offensive messages can cause (Gelber and McNamara 2016; Judge and Nel 2018). In many respects, the core concerns underlying these developments are ancient ones. Harmful language (of various kinds) has been deemed problematical since at least the Old Testament, and philosophers as diverse as Aristotle, John Locke, Spinoza, Voltaire, Charles de Secondat, Baron de Montesquieu, Karl Friedrich Bahrdt, and, perhaps most famously, John Stuart Mill, have all reflected at length upon the implications of restricting free speech (for example, Aristotle 1984 [1782]; see also Britt 2010). While the specific concept of HS had already emerged by the 1940s, the social tensions in Western democracies created by increasing multi- and interculturalism from the mid twentieth century onwards led to the gradual introduction of HS-related legislation (Brown 2015, p. 182). The post-9/11 preoccupation with anti-terrorism initiatives brought a new urgency to such considerations, and recent authoritative monographs such as Ishani Maitra and Mary Kate McGowan’s Speech and Harm: Controversies over Free Speech (Maitra and McGowan 2012), Jeremy Waldron’s The Harm in Hate Speech (Waldron 2012), Alex Brown’s Hate Speech Law: A Philosophical Examination (Brown 2015), and Eric Heinze’s Hate Speech and Democratic Citizenship (Heinze 2016) have explored a wide range of practical and theoretical concerns about how HS could and should be managed in liberal democracies. Some theorists, like Ronald Dworkin, have argued robustly that ‘the coercive powers of the state’ should be resisted even when manifest in the form of HS legislation, since free speech is a fundamental condition of legitimate government (Dworkin 2006, p. 131). Others, like Mari Matsuda or Judith Butler, believe that by not issuing legislation against HS, injurious language is, consequently, protected by the state. Specifically, Matsuda speaks of the ‘victim [of HS] becom[ing] a stateless person’ (Matsuda 1993, p. 25), while Butler argues that ‘the state produces hate speech’ in the sense that.

[…] the category cannot exist without the state’s ratification, and this power of the state’s judicial language to establish and maintain the domain of what will be publicly speakable suggests that the state plays much more than a limiting function in such decisions; in fact, the state actively produces the domain of publicly acceptable speech, demarcating the line between the domains of the speakable and the unspeakable, and retaining the power to make and sustain that consequential line of demarcation. (Butler 1997, pp. 76–77; emphasis in original)

In the Western world, most nations now impose penalties for some forms of expression deemed hateful because of their content, and such approaches thereby institutionalise value pluralism: the relevant legislative bodies restrict the freedoms of certain citizens so that the interests and well-being of others can be safeguarded (Galston 1999).Footnote 1 Self-evidently, these are areas where political philosophy and ethics become inextricably intertwined.

Over the last decade, though, the rapid rise of social media has created new forms of swift and efficient communication in which HS can be expressed almost instantaneously online, and often anonymously. Recognising the non-trivial problems this creates, social media providers and video-sharing platforms such as YouTube, Facebook, and Twitter have developed internal policies for HS regulation, and they also signed a Code of Conduct agreement with the European Commission (2019). At present, such decisions are taken at the corporate level, rather than the state level, which means that the companies concerned essentially regulate themselves. While there have recently been recommendations for state-level regulators, the infrastructures proposed tend to be very high-level. For instance, the UK government has outlined enforcement powers that a regulator could use against companies that failed to fulfil their duty of care (HM Government 2019, pp. 41–52). Once again, though, such methods are inherently reactive, and this is a problem because it means that the harm has already been inflicted. In practice, the task of handling online HS involves large numbers of moderators (c 15k in the case of Facebook) checking the content of posts identified by users as being offensive. The slow and laborious nature of this inherently responsive system has motivated the conviction that automated systems are required to detect such material. And the need to define HS more precisely for these purposes (e.g., so that training data can be annotated accurately) has further emphasised the ambiguities inherent in the very phrase ‘Hate Speech’. The kinds of utterances so classified vary considerably internationally, largely because HS laws take a variety of different forms depending upon the legal definitions adopted in different countries. It is also widely-recognised that the legislative focus on specific protected characteristics (e.g., race, gender, religion) has the undesirable consequence of excluding other vulnerable groups. For instance, since being an immigrant does not involve any of the protected characteristics recognised by UK law, threatening language directed towards immigrants cannot (strictly) be classified as HS, though it may still constitute a crime. This is why Facebook’s public-facing ‘Community Standards’ document has recently been updated to indicate that the company’s approach to HS overtly provides ‘some protections for immigration status’ (Facebook 2019a). Nonetheless, dissatisfactions with the prevailing arbitrariness and relativism of the many different definitions of HS, as well as concerns about its emphasis on the specific emotion of ‘hatred’, have prompted some researchers to propose alternative categories such as ‘Extreme Speech’ or ‘Dangerous Speech’ (Hare and Weinstein 2009; Benesch 2019). Nonetheless, the phrase ‘Hate Speech’ remains the most prominent alternative, and therefore it will be used (with reservations) in the ensuing discussion. We will interpret the phrase broadly, essentially in the same manner as Paula Fortuna and Sérgio Nunes:

Hate speech is language that attacks or diminishes, that incites violence or hate against groups, based on specific characteristics such as physical appearance, religion, descent, national or ethnic origin, sexual orientation, gender identity or other, and it can occur with different linguistic styles, even in subtle forms or when humour is used. (Fortuna and Nunes 2018, p. 5)

The remaining sections of this article briefly summarise the current conventions for the regulation of online HS, before considering some of the ways in which the automatic detection of HS could be used to safeguard citizens within liberal democracies in a more ethical manner. While previous research in this area has focused primarily on the core task of developing automated methods for detecting HS, this article probes instead the way in which such technologies could be used as part of a larger infrastructure that moderates the content of social media posts in a way that does not excessively compromise freedom of expression. In particular, the method of quarantining is recommended as a particularly effective way of avoiding the problematical extremes of entirely unregulated free speech or coercively authoritarian censorship.

The regulation of online hate speech

As mentioned above, in response to growing public concerns about HS, most social media platforms have adopted self-imposed definitions, guidelines, and policies for dealing with this particular kind of offensive language. Continued criticism of these procedures, however, suggests that such an approach is far from ideal, therefore the current procedures adopted by Facebook, Twitter, and YouTube will be briefly summarised here as illustrative case-studies.

Figure 1 gives an overview of the way in which Facebook (FB) deals with HS if a user (S; the Sender) posts something (P) that another user (R; the Recipient) finds offensive.

Fig. 1
figure 1

(adapted from Allan 2017)

Facebook’s current HS regulation procedure

P will therefore be manually reviewed and evaluated for potential hate-inciting harmful content, if reported by a user, otherwise, it will remain visible on the website.Footnote 2 As a basis for deciding whether or not a post qualifies as HS, Facebook uses the following definition:

We define hate speech as a direct attack on people based on what we call protected characteristics—race, ethnicity, national origin, religious affiliation, sexual orientation, caste, sex, gender, gender identity and serious disease or disability. We also provide some protections for immigration status. We define “attack” as violent or dehumanising speech, statements of inferiority, or calls for exclusion or segregation. (Facebook 2019b)

Additional factors are the context of the comment, cultural norms (e.g., language, country), the genre/style of the comment (e.g., humour, satire), or if a post was reproduced as a means of criticism and opposition (Allan 2017). Yet the moderation process is subject to constant revision and modification. On 28th March 2019, Facebook announced that it would block and remove white supremacist content. The decision followed heavy criticism after a live-stream of a terrorist attack in Christchurch, New Zealand, had been made available on the social media platform. New Zealand’s Prime Minister Jacinda Ardern reacted saying that social media sites were ‘the publisher, not just the postman’ of extremist content online (BBC News 2019). Facebook’s decision to take a proactive stance against the spreading of nationalist and extremist material has important consequences for HS regulation on social media more generally. Nevertheless, the company’s methods for handling HS continue to be primarily reactive rather than proactive.

Twitter has also been criticised repeatedly for its HS-related procedures. While its policy on ‘hateful conduct’ tells its users they are not permitted to ‘promote violence against or directly attack or threaten other people on the basis of race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or serious disease’, critics have pointed out that this guideline is both inconsistent and ineffective, and does not protect groups from harassment (Twitter 2019; Matsakis 2018). One source of inconsistency is that Twitter, like Facebook, still relies primarily on manual and human-led HS detection and assessment. In a recent attempt to offer its users even greater protection from online harassment and abuse, the company announced a new policy prohibiting ‘dehumanising speech’, which it defines as:

Language that treats others as less than human. Dehumanization can occur when others are denied of human qualities (animalistic dehumanization) or when others are denied of their human nature (mechanistic dehumanization). Examples can include comparing groups to animals and viruses (animalistic), or reducing groups to a tool for some other purpose (mechanistic). (Gadde and Harvey 2018)

The policy further states that it applies to ‘[a]ny group of people that can be distinguished by their shared characteristics such as their race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, serious disease, occupation, political beliefs, location, or social practices’ (Gadde and Harvey 2018).

Further, Twitter has recently added an intermittent feature that blocks potentially sensitive image-based material (similar to Instagram’s sensitive content filter), and which sends the intended recipient a warning. The recipient can then either choose to view the content, or decide to block it (see Fig. 2). In addition, users can also flag their own tweets as potentially containing sensitive material (see Fig. 3). While this feature currently functions in a somewhat arbitrary manner, the underlying approach clearly has the potential to be deployed against HS, as discussed in “The automatic detection and classification of hate speech” section. It is worth mentioning that the social news and discussion platform Reddit implemented a ‘quarantine function’ in 2015, which allows for content and entire communities to be put in quarantine. Content will only be released again after successful appeal (Reddit 2019), and, once again, this procedure involves careful evaluation by a human administrator.

Fig. 2
figure 2


Warning of potentially sensitive material on Twitter.

Fig. 3
figure 3


Default settings for the display of potentially sensitive content on Twitter.

Finally, since its emergence in 2005, YouTube has developed from being a merely video-sharing site to being an influential source of news, information, and entertainment for users worldwide. In recent years, it has provided a powerful platform for those who have sought to spread conspiracy theories (e.g. anti-vaccine), extremist views, and misinformation (see, e.g., Uscinski et al. 2018; Ottoni et al. 2018). Like Facebook and Twitter, YouTube relies on the reporting of dangerous or abusive content by already-offended users, and all such reports are subjected to qualitative review. Its ‘Hate Speech Policy’ (Google 2019) currently lists specific protected characteristics (11 in total, including ‘veteran status’), and a reporting tool is provided that can be used to raise concerns about videos, comments, or even whole channels that promote HS.Footnote 3

The automatic detection and classification of hate speech

Given the brisk summaries in the previous section, it should be obvious why, in the last few years, the automatic detection of HS has become an active research priority. The various systems developed so far frequently adopt a binary classification framework: given a social media post P (e.g., a tweet), the system should classify P either as constituting HS or as not constituting HS. Consequently, Precision, Recall, and Accuracy are regularly used as metrics for determining system performance. For instance, Warner and Hirshberg (2012) developed a system that classified statements as being anti-Semitic or not, while Nobata et al. (2016) and Gao et al. (2017) used the categories ‘abusive’ or ‘clean’ instead. Since the precise nature of the task and the datasets used often varies from publication to publication, numerous classification strategies have been deployed (see Fortuna and Nunes 2018 for an overview) but comparing them in a meaningful manner is not always easy. While early systems tended to rely on basic word filters and simple syntactic structures to identify offensive language, more recent systems have sought to incorporate extra-linguistic knowledge-based features, as well as contextual information, to achieve more accurate detection and better classification rates. Even sociolinguistic features concerning the user’s background, posting history, and online characteristics have been included in the development and training of classifiers (e.g., Dadvar et al. 2013; Schmidt and Wiegand 2017a). To consider just a few examples in greater detail, Davidson et al. (2017) used logistic regression with L1 regularization to reduce the dimensionality of the data, and they favoured a ternary classification framework in which each tweet was identified as constituting offensive language or not, with all offensive tweets subsequently being classified as constituting HS or not. Their dataset of 25 k tweets had been manually labelled (via CrowdFlower), and using Accuracy as a scoring metric, they found that 91% of the offensive tweets were being correctly identified, but only 60% of the HS (i.e., only slightly better than random chance). By contrast, Qian et al. (2018) used a Conditional Variational Autoencoder (CVAE) to distinguish among 40 hate groups with 13 different hate group ideologies, using a dataset of 3.5 million tweets. Consequently, each tweet was associated with a specific hate category label (e.g., ‘ACT for America’) and a specific hate speech label (e.g., white nationalist, anti-immigration). This fine-grained approach enables more specific sub-classifications of HS posts, but it depends on there being enough data associated with each sub-type. In recent years, the application of Convolutional Neural Networks (CNNs) has produced higher Precision results in many HS-related tasks (e.g., Gambäck and Sidkar 2017).

Nonetheless, despite the many technical advances, there remain persistent difficulties concerning the annotation of HS-related training data. As noted above, different researchers focus on different tasks (e.g., anti-Semitic language [i.e., a specific subset of HS], abusive language [i.e., a superset of HS]). Therefore, they use different datasets, and it is often very difficult to compare and contrast the performance of the various systems. This lack of commonly-accepted training and test datasets has significantly hampered system development. There are also problems when it comes to labelling the data. The task of classifying millions of offensive tweets is usually crowd sourced (for practical reasons), yet it is hard to guarantee quality control using that method. The subjectivity of annotators remains problematical, and it arises from diverging perceptions of what constitutes HS. This divergence is a factor even when particular definitions of HS are specified, since the perceived tone and style of a given social media post (e.g., humorous, satirical) can vary greatly. Further, there has been a growing interest in the manner in which users respond to HS. The various strategies deployed are often grouped together as instances of ‘counter speech’. Crucially, users have been observed to use counter speech of different kinds, including pointing out and correcting misinformation, misrepresentations, and contradictions, warning of consequences, denouncing hateful speech, debunking via humour or sarcasm, deploying a notably positive tone, or using hostile language (Mathew et al. 2018). The important role of counter speech in countering HS is only starting to be understood, yet it clearly influences the spread of online hatred and misinformation. Clearly, more research concerning this topic (and specifically large-scale studies of datasets and the development of improved classification models) is needed.

The preceding paragraphs have offered a succinct overview of some of the recent developments in the task of detecting and classifying HS automatically. As mentioned earlier, though, the ex post facto identification of HS does not undo the harm that such material has already caused when posted online. While effective counter speech can certainly contribute to decreasing that harm (Butler 1997, p. 14), it would be far better to intercept potentially offensive posts at an earlier stage of the process, ideally before they have been read by the intended recipient. With this in mind, the following section will outline a framework for the automated quarantining of HS which extends an automated approach that has been used for several decades to decrease the damage caused by malware.

Quarantining hate speech

As summarised in “The regulation of online hate speech” section, social media o rganisations such as Facebook, Twitter, and YouTube currently use teams of moderators to determine whether potentially harmful posts should be removed or not. The current systems rely on already-offended users complaining about offensive messages, and the content of these is then assessed by teams of people who determine whether or not they should be deleted. These approaches will be referred to here as Too Little Too Late (TL2) methods, since they come into effect only after the intended harm has already been inflicted, both on the direct recipient of the message, and on any indirect recipients (including the thousands of human moderators who have to encounter hundreds of examples of disturbing material every day, see Newton 2019; Simon and Bowman 2019). Consequently, TL2 harm-reduction strategies are problematical, especially if we accept that language-mediated online harm is a serious as other sub-types (e.g., physical, financial). Also, in the influential theory of Information Ethics that Luciano Floridi (Floridi 2013, Chap. 4) has elaborated over the last few decades, there is a perceived need for an ethical framework that is primarily patient-oriented rather than agent-oriented. In other words, the moral impact of a given action is at least as important as the decision process the relevant agent followed when electing to take that action. Viewed from this perspective, reactive TL2 approaches are undesirable, since they do not prevent harm being caused. Inevitably, though, any proposed regulation designed to delimit harm raises familiar long-standing tensions between libertarian tendencies (e.g., freedom of expression) and more restrictive authoritarian ideologies/practices (e.g., censorship). These viewpoints have been prominent, for instance, in the important recent debates about HS legislation involving the legal theorists Dworkin (2009); Waldron (2012, 2017) and Weinstein (2017). The various disagreements have centred on topics such as whether HS bans necessarily undermine democratic legitimacy by depriving certain citizens of a voice in the political process, and diminishing their opportunity to speak without fear of criminal sanction. Contrasting views about such matters become vividly apparent in relation to online HS if the only available options are (i) to leave already-posted offensive material in situ, or (ii) to remove it entirely.

Crucially, though, these are not the only two options. Many cultures have well-established orthographic conventions for decreasing the impact of offensive written material, to differing extents. For instance, in English-speaking cultures it is often possible to replace letters in potentially derogatory words (e.g., ethnic slurs) with other symbols to render them more opaque, or else to strike out the problematical lexical items entirely. These are the sorts of content moderation methods implemented by wordfilter software in online chatrooms and forums (Roberts 2017). Here are some examplesFootnote 4:

figure a

Examples (2) and (3) still enable the blatantly racist sentiment of (1) to be conveyed, but they offer different degrees and styles of textual censorship that (arguably) soften the impact of the offensive language. Option (4) is more extreme, since it entirely conceals the specific group targeted by the HS, and it merely indicates that something offensive had been written. These methods may be effective at the lexical level, but more sophisticated handling is required when HS is expressed in phrases or sentences that do not merely contain offensive wordsFootnote 5:

  1. (5)

    Yo if my son comes home & try’s 2 play with my daughters doll house I’m going 2 break it over his head & say n my voice ‘stop that’s gay’.

Simple key-word spotting would not be sufficient to identify the homophobic content of (5). The term ‘gay’ is not inherently offensive since it is regularly used in non-pejorative ways such as ‘Gay Pride’ (unlike, for instance, the word ‘faggot’ when deployed as a homophobic slur). The import of the above tweet is only apparent at a clausal level, but even then the task of determining the referent of the demonstrative ‘that’ requires sentential-level comprehension. If the strategies in (2) above were to be deployed in this case, then the tweet could be modified as follows—but no current wordfiltering software could cope easily with such complex syntactic structures:

  1. (6)

    Yo if my son comes home & try’s 2 play with my daughters doll house I’m going 2 break it over his head & say n my voice ‘stop that’s gay.

Examples like this highlight the subtleties involved in identifying and HS-related content and implementing forms of textual censorship (e.g., ellipses, strikethroughs) that ameliorate the impact of the problematical content. Given the complexity of the task, it is important to consider an alternative form of (potentially temporary) censorship—namely, quarantining. This approach has been commonplace in cyber security applications since the late 1980s, especially as a form of protection against malware.Footnote 6 For instance, Exchange Online Protection (EOP) is a spam and malware filter available as part of the Exchange Online email security service owned by Microsoft (Kjierland and Baumgartner 2018). It can be set to assess whether email is spam via EOP’s own Spam Confidence Level ruleset and the detection scores assigned by the relevant email server. Any mail message that is ranked at a value of (say) five or above from either of these checks is sent to a central quarantine area, where it is retained for 15 days before being deleted. This is just one example of how quarantining is regularly deployed to protect users against software specifically designed and intended to cause particular forms of online harm (e.g., data loss, data theft, server failure).

Via analogy, HS can be viewed as another form of intentional online harm, and therefore it too can be handled by means of quarantining. The analogy is less tenuous than it may sound initially, since online HS is sometimes generated by software (e.g., by Twitter bots) (Fig. 4). Therefore, when considered in relation to the cyber security cases, the quarantining of online HS can be viewed as an extension of existing frameworks for anti-malware protection (Daniel et al. 2019). When applied to the HS problem, quarantining is situated between the two ethical extremes of entirely permitting or entirely prohibiting the posting of certain messages. It enables the recipients of those messages (or other appropriate moderators) to decide (i) whether they wish to read the messages or not, and (ii) if they decide to read them, whether they wish to allow them to be posted or not. If this approach were adopted for a HS posting involving S (the Sender), P (the Posted message), and R (the Recipient), then the main steps in the process would be as follows:

Fig. 4
figure 4

The quarantining process for HS

It is helpful to consider a concrete example. The American skier, Gus Kenworthy, came out as gay in October 2015 (Fig. 5). His YouTube channel was subsequently bombarded with homophobic slurs. In 2018 he posted a video on his YouTube channel with the caption ‘Feeling Like a Champion’, and it received many responses including the following:

Fig. 5
figure 5


Example of homophobic HS directed at American skier, Gus Kenworthy, on YouTube.

As discussed in “The regulation of online hate speech” section, there are currently very few immediate constraints on what anonymous users can post (Fig. 6). Consequently, the damage caused by HS of this kind is potentially instantaneous: it appears as soon as S sends it, and R can potentially read it immediately. There is no mechanism by which R can decide whether or not to read the post. There is no buffer zone. As Charles R. Lawrence emphasised as long ago as Lawrence (1990), ‘[t]he experience of being called “nigger,” “spic,” “Jap,” or “kike” is like receiving a slap in the face. The injury is instantaneous’ (p. 452).

Fig. 6
figure 6

Homophobic HS shown in Fig. 5 quarantined and provided with a graphic indicating degree of severity of the post

By contrast, if quarantining were deployed in these cases, and if a given P were automatically identified as constituting HS, then R would receive an alert such as the following:

R could then decide whether or not to read the post after seeing who has written it (e.g., ‘White Dragon’) and after being informed that it has been specifically flagged up as potentially constituting homophobic HS. R could also receive an indication of the degree of severity of the post by the value specified on the Hate O’Meter graphic. This value can easily be generated from the confidence scores (continuous values in the interval [0,1]) produced by the automated HS detection system. ‘0’ means that the post is not harmful in any way, ‘1’ means that the post is extremely harmful/offensive.

As the above summary indicates, quarantining processes function in the intermediary ethical space between more extreme libertarian and authoritarian approaches. During quarantining, potentially offensive social media posts are neither entirely permitted nor entirely prohibited. Instead, they are held in limbo for a finite period until they have been appropriately assessed by the relevant recipients or moderators. A framework of this kind offers a better balance between positive and negative liberty (to use Isiah Berlin’s well-worn distinction) than current conventions (Berlin 1969). While senders are still free to write whatever they wish (positive liberty), the recipients have the opportunity to decide which kinds of messages they wish to receive (negative liberty). The framework also offers various options concerning the degree to which this safeguarding is applied. The following implementation methods are just three of the numerous possibilities, and, as before, each scenario involve S posting the message P of which R is the direct recipient:

  • The Bipartite Method: S and R (and no one else) receive an HS alert; if they both consent to the message appearing, then P becomes visible for all other indirect recipients to read.

  • The Multipartite Method: even if S and R consent to post P, all other users (the indirect recipients) receive the alert when they first encounter P, and they can only access the contents if they give their consent.

  • The Elective Method: all users can specify the degree of online HS protection they desire. For instance, in a settings file they can specify that they want to be safeguarded from, say, racist HS, and/or homophobic HS, and/or sexist HS, and so on—or simply from all kinds of HS. Consequently, users will receive alerts for any P that falls into one of the HS subtypes from which they have chosen to be protected.

Clearly, these options involve different degrees of ‘friction’, where, in IT-related discourse, ‘friction’ denotes any process that prevents users/customers accessing as rapidly as possible the goods or services they require. For instance, having to select answers from pop-up windows, having to fill in sign-up forms, failing to find relevant product specification information, and encountering long load times—these are all examples of online friction. Such experiences may annoy users, and, in extreme cases, cause them to change to other service providers (Facebook 2019c). Consequently, for many AICT developers, zero-friction systems are self-evidently an ideal. However, it should be clear from the preceding discussion that there are numerous situations in which some friction is highly desirable. Although social media platforms have generally tried to facilitate low-friction interactions (e.g., making it as quick and easy as possible to upload and/or share photos, audio files, documents, messages), there are situations in which this can be problematical. Adopting a high-level perspective, the Bipartite Method has the least total friction since only two users encounter the HS-related alert, while The Multipartite Method has the greatest total friction since all users encounter the alert. Crucially, in the case of The Elective Method, the degree of personal friction is chosen by each user individually. In essence they become ‘voluntary consumers’ of it, and the users freely opt for a degree of friction (Sumner et al. 2011, p. 18). This does not prevent them subsequently asking the service provider to remove the HS content (as is currently the case), but it does give them more control over whether or not they access that potentially harmful content in the first place. In all of these cases, the friction itself could function as a deterrent, since it is more tiresome for S to post HS if doing so constantly triggers an alert that S subsequently has to process. The same procedure could also apply to the retweeting and automatic forwarding of potentially harmful messages, since the content of those messages could also be detected automatically and an alert prompted.

These matters are of some importance because, inevitably, there are so many different scenarios in which quarantining methods could be deployed. Before considering some of these in more detail, we will distinguish formally between a Direct Recipient (DR) of a post (i.e., a person to whom the message is purposefully sent) and an Indirect Recipient (IR) (i.e., a person who might come across the post on a social media feed even though he or she was not a DR). In some situations, there is only one S and one DR (e.g., a 1-to-1 Facebook Messenger post), but in other situations there can be multiple DRs and IRs. Given this, consider the two following cases:

  • Case 1: S is a neo-Nazi. S seeks to post anti-Semitic HS on his own public social media feed (i.e., S = DR). S receives an alert and consents to the message appearing; therefore, the message is posted and can be viewed by IRs who happen to read S’s feed.

  • Case 2: S and DR are both neo-Nazis. S seeks to post anti-Semitic HS on DR’s public social media feed. S and DR both receive an alert and consent to the post; therefore, the message is posted and can be viewed by IRs who happen to read DR’s feed.

In these cases, the S and the DR are both wish to propagate HS, therefore they always give their consent when the intended post triggers an alert. The concerns potentially arise when the IRs are considered. Even though an IR voluntarily chooses to view someone else’s social media feed, that person may still be offended by a certain post encountered there. This is why Twitter has introduced a warning concerning ‘sensitive content’ (see Fig. 7). If a user’s feed is already known to potentially include such material (even though it is not entirely clear how ‘sensitive’ should be defined), the entire page may be blocked initially, and it will only be released if specifically requested by the viewer.

Fig. 7
figure 7


Twitter warning and temporary blocking of user’s feed (names and images redacted by authors.

How a quarantining system handles these cases would depend on the implementation method adopted. If either the Multipartite or Elective Methods were adopted, then IRs who had chosen to be fully protected from HS would receive an alert, and could then decide whether or not to view the posted message. In practice, the system could be implemented in a similar manner as Parental Guidance Locks on internet streaming/catch-up services such as BBC iPlayer (2019). In other words, parents could set up a social media account for their child and specify appropriate HS protection levels. If an alert were triggered, then the parents could enter a password to access and assess the quarantined post, to determine whether or not it should be posted on their child’s social media feed.

While the emphasis so far in this discussion has fallen exclusively on HS, it is important to recognise that extremist groups often deploy a wide range of linguistic strategies when seeking to attract and recruit followers who may be sympathetic to their ideologies—and HS is only a part of this. Indeed, more subtle and complex techniques may not display explicit HS properties at all, the use of positive, euphemistic, and/or more abstract rhetoric may appeal to potential adherents just as effectively as HS. For instance, the terrorist group ISIS frequently used triumphant terminology like ‘brothers rise up’ and ‘claim victory’ in their recruitment strategies on social media (see, e.g., Awan 2017), while the controversial British far-right activist Tommy Robinson has repeatedly claimed he is attacking the ‘fascist ideology’ of Islam rather than Muslims specifically (Union Magazine 2015). Clearly, the recruitment functions of such discourses need to be examined attentively, and their diverse and multifaceted nature goes far beyond the specific task of HS detection for the purposes of quarantining.


This article has explored the problem of online text-based HS, and the ethical implications of the various strategies for dealing with this problematical phenomenon have been discussed. The current TL2 regulatory frameworks were described, and some of the problems resulting from this kind of reactive self-regulation were outlined. Crucially, it was suggested that they are undesirably ineffectual, especially when viewed from a patient-oriented ethical perspective. State-of-the-art methods for the automatic detection and classification of HS were then summarised, before the main emphasis shifted to the way in which these technologies might eventually be used when their performance has improved. In particular, quarantining has been explored as a viable approach that strikes an appropriate balance between libertarian and authoritarian tendencies. In this framework, HS is treated like a form of malware, and while the senders of the HS are not censored in a crude unilateral matter, the recipients of the HS are given the agency to determine how they wish to handle the HS they have received. This approach potentially preserves freedom of expression, but the harm caused by HS is still controlled in a safe fashion by those most directly affected.

The need to consider the various available strategies for handling online HS has never been more urgent. The UK government has recently stated explicitly that.

By designing safer and more secure online products and services, the tech sector can equip all companies and users with better tools to tackle online harms. We want the UK to be a world-leader in the development of online safety technology and to ensure companies of all sizes have access to, and adopt, innovative solutions to improve the safety of their users. (HM Government 2019, p. 77)

Given this, the importance of technological infrastructures that can facilitate the development of safer and more ethical online products should be all too apparent. And handling online HS convincingly and effectively is simply one part of a complex whole.

It is crucial to recognise, though, that the various methods considered in this article are all entirely text based. That is, they rely on sequences of (written) words being analysed in ways that enable the detection and classification of HS to be accomplished automatically. Recognising this, it is important to acknowledge that online communication is frequently and increasingly multimodal, rather than monomodal. Multimodality Studies has emerged over the last 20 years or so, largely due to the pioneering work of theorists such as Gunther Kress, Carey Jewitt, Jeff Bezemer, and Theo van Leeuven. Therefore, it is now widely accepted that different modes (e.g., image, writing, gesture, music) have different semiotic affordances (e.g., the materials the image is made from, the syntax of written language), and that these are orchestrated by meaning-makers to form multimodal ensembles. For instance, an online advert might contain moving images, written text, background music, a spoken voice-over, and so on—and these all combine to convey the meaning of the advert.Footnote 7 One inevitable consequence of multimodal approaches to meaning making is that language is necessarily displaced from a position of centrality in the analytical framework. It loses its privileged status as the primary agent of meaning making, and becomes merely one of many possible ways of creating and communicating semantic content. This obviously creates problems for automated HS-detection systems that are exclusively text based. For example, consider the following (notorious) meme, which was posted on Facebook in 2013 (see Fig. 8). It was not removed, despite requests from users, even though images of women breastfeeding were removed (Bates 2013).

Fig. 8
figure 8


Example of multimodal HS content.

This blatantly misogynistic meme combines the modes of image and writing, and the offensive nature of the whole arises from the juxtaposition of the parts. Taken in isolation, the text is not necessarily problematical: it is not inherently sexist in-and-of itself, and, in certain contexts, it could presumably constitute benignly humorous advice about sexual health and family planning. However, when presented with this particular image, the meaning of the text changes, and the violently misogynistic, connotations become apparent. Nonetheless, the multimodal character of the whole means that no current text-based HS-detection systems would classify the meme as being an instance of HS. Despite the conspicuous nature of this problem, research programmes focused on multimodal approaches to HS detection and classification have only just started to emerge (Hosseinmardi et al. 2015; Zhong et al. 2016).Footnote 8 Clearly, there is much that remains to be accomplished.