Taboo language across the globe: A multi-lab study

Sulpizio, Simone; Günther, Fritz; Badan, Linda; Basclain, Benjamin; Brysbaert, Marc; Chan, Yuen Lai; Ciaccio, Laura Anna; Dudschig, Carolin; Duñabeitia, Jon Andoni; Fasoli, Fabio; Ferrand, Ludovic; Filipović Đurđević, Dušica; Guerra, Ernesto; Hollis, Geoff; Job, Remo; Jornkokgoud, Khanitin; Kahraman, Hasibe; Kgolo-Lotshwao, Naledi; Kinoshita, Sachiko; Kos, Julija; Lee, Leslie; Lee, Nala H.; Mackenzie, Ian Grant; Manojlović, Milica; Manouilidou, Christina; Martinic, Mirko; del Carmen Méndez, Maria; Mišić, Ksenija; Chiangmai, Natinee Na; Nikolaev, Alexandre; Oganyan, Marina; Rusconi, Patrice; Samo, Giuseppe; Tse, Chi-shing; Westbury, Chris; Wongupparaj, Peera; Yap, Melvin J.; Marelli, Marco

doi:10.3758/s13428-024-02376-6

Taboo language across the globe: A multi-lab study

Original Manuscript
Open access
Published: 09 May 2024

Volume 56, pages 3794–3813, (2024)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Taboo language across the globe: A multi-lab study

Download PDF

Simone Sulpizio ORCID: orcid.org/0000-0002-4051-9494^1,2^na1,
Fritz Günther³^na1,
Linda Badan⁴,
Benjamin Basclain⁵,
Marc Brysbaert⁶,
Yuen Lai Chan⁷,
Laura Anna Ciaccio⁸,
Carolin Dudschig⁹,
Jon Andoni Duñabeitia¹⁰,
Fabio Fasoli^11,12,
Ludovic Ferrand¹³,
Dušica Filipović Đurđević¹⁴,
Ernesto Guerra¹⁵,
Geoff Hollis¹⁶,
Remo Job¹⁷,
Khanitin Jornkokgoud¹⁸,
Hasibe Kahraman⁵,
Naledi Kgolo-Lotshwao¹⁹,
Sachiko Kinoshita⁵,
Julija Kos²⁰,
Leslie Lee²¹,
Nala H. Lee²¹,
Ian Grant Mackenzie⁹,
Milica Manojlović¹⁴,
Christina Manouilidou²⁰,
Mirko Martinic¹⁵,
Maria del Carmen Méndez²²,
Ksenija Mišić¹⁴,
Natinee Na Chiangmai¹⁸,
Alexandre Nikolaev²³,
Marina Oganyan²⁴,
Patrice Rusconi²⁵,
Giuseppe Samo²⁶,
Chi-shing Tse⁷,
Chris Westbury²⁷,
Peera Wongupparaj²⁸,
Melvin J. Yap²⁹ &
…
Marco Marelli^1,2^na1

1100 Accesses
11 Altmetric
Explore all metrics

Abstract

The use of taboo words represents one of the most common and arguably universal linguistic behaviors, fulfilling a wide range of psychological and social functions. However, in the scientific literature, taboo language is poorly characterized, and how it is realized in different languages and populations remains largely unexplored. Here we provide a database of taboo words, collected from different linguistic communities (Study 1, N = 1046), along with their speaker-centered semantic characterization (Study 2, N = 455 for each of six rating dimensions), covering 13 languages and 17 countries from all five permanently inhabited continents. Our results show that, in all languages, taboo words are mainly characterized by extremely low valence and high arousal, and very low written frequency. However, a significant amount of cross-country variability in words’ tabooness and offensiveness proves the importance of community-specific sociocultural knowledge in the study of taboo language.

Sexist Slurs: Reinforcing Feminine Stereotypes Online

Article Open access 28 November 2019

More human than human: measuring ChatGPT political bias

Article Open access 17 August 2023

What is Cross-Cultural Communication?

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Everyday communication is full of socially inappropriate words that are considered linguistic taboo. We are taught not to use them in conversation, even though we produce taboo words from the very moment we start speaking (Jay & Jay, 2013), and keep doing it throughout our lives. We also produce them while sleeping (Arnulf et al., 2017) or when acquired language disorders severely impair any other word production (Van Lancker & Cummings, 1999). As adults, 0.5% of the words we produce (i.e., ~80 words per day; Mehl et al., 2006) and 1% of the words we write on Twitter are taboo words (Wang et al., 2014). We use taboo words despite it being socially inappropriate, forbidden, and (in some countries) even legally punished. We do so because taboo language is an extremely powerful linguistic tool that fulfills an unparalleled wide range of psychological and social functions, as no other word category can do. Swearing allows us to induce emotional reactions (Sheidlower, 2009), insult others (Croom, 2011), increase the vividness of what is said (Azzaro, 2018), intensify emotional communication (Jay & Janschewitz, 2007), reinforce message effectiveness (Cavazza & Guidetti, 2014), increase the perceived credibility of the speaker (Rassin & Heijden, 2005), regulate emotions and reduce pain (Stephens & Umland, 2011), promote group bonding and reinforce group identity (Daly et al., 2004; Montagu, 2001), and elicit humor (Blake, 2018). Moreover, unlike all other words, taboo words (and in particular swear words) are used almost only with a connotative function (i.e., they do not refer to their literal meaning; Finkelstein, 2018; Jay & Janschewitz, 2008).

We do not all swear the same. Frequency of swearing is associated with personality traits (e.g., high scores of agreeableness and conscientiousness, as measured by the Big Five personality test, are associated with low frequency of swearing; Mehl et al., 2006), social factors (e.g., group identity; Daly et al., 2004), gender (men swear more frequently in public and use more offensive words than women; Jay, 2009), and idiosyncratic pragmatic factors, such as the conversational topic, the setting of the conversation (i.e., public/private, formal/informal), or the speaker–listener relationship (Jay & Janschewitz, 2008; Johnson & Lewis, 2010).

Studies investigating how taboo words are processed indicate peculiar properties of this category. Taboo words are remembered better than other words (MacKay et al., 2004), capture people’s attention (Carretié et al., 2008; MacKay et al., 2004), exert a detrimental effect on word recognition (e.g., Sulpizio et al., 2019) and speech production tasks (e.g., White et al., 2017), require a higher level of cognitive control (Dhooge & Hartsuiker, 2011; Scaltritti et al., 2021), increase the arousal level of the sympathetic nervous system (Harris et al., 2003; McGinnies, 1949), and persist in severe acquired language disorders hindering any other linguistic production (Van Lancker & Cummings, 1999).

Despite its wide use and relevance in fulfilling multiple social and psychological functions, we know very little about what taboo language is and what constitutes it across different populations, languages, and cultures. All our empirical knowledge on taboo language comes from a relatively small set of studies, almost entirely conducted in English and with limited cultural diversity. This poses two main theoretical problems. First, taboo language is highly conditioned by sociocultural factors, so what constitutes taboo can only be determined within a specific sociocultural environment. Hence, the currently available evidence offers an extremely restricted picture of the phenomenon. Second, because of this, even the composition, and thus the definition, of the taboo taxonomy is blurred. There is no agreement on the types and the number of categories characterizing taboo words (Jay, 2009; Stapleton, 2010). Finally, related to this last issue, it is still unclear what makes a word taboo. In terms of semantic properties, emotional aspects have been suggested to play a central role (Hansen et al., 2017; Jay & Jay, 2015). However, emotionality might not be enough to precisely characterize taboo words, which would otherwise be indistinguishable from other emotional words. Other properties that are typically considered are offensiveness (i.e., how a person perceives a word as inappropriate) and tabooness (i.e., how a person believes the society considers that word inappropriate; Jay, 1992). Nonetheless, while the use of the latter property makes the definition tautological, the former seems not to be a necessary property of swearing. For example, the English words sex or vagina are generally not offensive but are taboo in some social circumstances. In the data presented below, these are the words with the largest discrepancy between tabooness and offensiveness. The specific lexico-semantic characterization of taboo words is still to be determined, as it is still unknown whether and to what extent taboo words can be differentiated from non-taboo words on the basis of their lexical and semantic properties.

The present study aims at providing a first step towards filling these gaps, by collecting and characterizing taboo words in 17 different countries and 13 different languages (including some typically overlooked ones), covering all five permanently inhabited continents. In addition to offering a window into taboo language around the world, our study offers the unique chance to tease apart cross-linguistic from cross-cultural differences by analyzing the behavior of participants that speak country-based varieties of the same language (e.g., English in Canada and in Singapore). Importantly, taboo words in our study are defined in a strictly bottom-up manner based on speakers’ productions (Study 1). This allows us to establish what each community actually considers taboo without introducing any bias due to the researchers’ idiosyncrasies and normative definitions, and to identify commonalities and differences across languages and countries. In Study 2, we systematically collect intuitions about several semantic measures for each of the produced words to determine the combination of semantic features that best characterize the taboo dimension, and to evaluate their consistency across languages and cultures. Taken together, our results achieve two important goals: Theoretically, they contribute to a better general definition and understanding of taboo words and swearing across languages and cultures. Methodologically, they form a very rich database to study taboo language both per se and in relation to its several social and psychological functions.

Study 1: Identifying taboo words

Methods

Participants

We collected data in 18 labs from 17 countries (Australia, Belgium, Botswana, Canada, China [two labs, one in Beijing and one in Hong Kong], Chile, France, Germany, Finland, Italy, Serbia, Singapore, Slovenia, Spain, Thailand, United Kingdom, United States of America [US]), covering all five permanently inhabited continents and 13 different languages (Cantonese, Dutch, English, Finnish, French, German, Italian, Mandarin, Serbian, Setswana, Slovenian, Spanish, Thai), with some of these (i.e., English and Spanish) spoken in multiple countries. These languages are spoken as native language by more than 2 billion people in the world (i.e., ~25% of the global population, data from Wikipedia).

The total number of participants was 1046 (see Supplementary Table 1 for details), with each lab collecting data from at least 40 participants (40 to 167). Only native speakers of the language in question who lived in the country in question and who were not suffering from language-related and/or learning disabilities were included. Supplementary Table 1 reports participants’ details per lab as well as information concerning the ethics approvals obtained by each lab involved in the project.

Procedure

In each of the labs, a local coordinator managed all aspects of the study. The coordinator was a native speaker of the language in question living in the culture in which data collection occurred, or was flanked by another researcher who was a native speaker of the language in question and was living in the culture in which data collection occurred.

Participants were asked to freely write down all the taboo words they could think of. Both single-word and multi-word expressions were accepted, and examples were provided for both cases. There was neither time pressure nor any time restriction to complete the task. In the instructions, we specified that participants were free to write whatever came to their minds and encouraged them to avoid self-censorship. Instructions (in English) are reported in Fig. 1.

The instructions were provided to all the labs, which were asked to translate them in the local language and then back-translate them into English (translation and back-translation were not required for labs collecting data in English). Translation and back-translation were provided by different persons, and the back-translation was compared to the original version as a sanity check. Details about the data collection modality for each lab are reported in Supplementary Table 1.

For each sample, all participants’ productions were combined. In each lab, a researcher went through the list and (a) checked all the productions and corrected for possible minor errors (e.g., typos); (b) for non-English languages, provided an English translation of each word; and c) on the basis of their intuition as native speaker knowledgeable of the respective culture, classified each word (using a simplified taxonomy based on Jay, 2009) as belonging to one of the following categories for which definitions were provided to researchers: insult; slur; sexual; scatological referents and disgusting objects; profanities/blasphemies. When appropriate, researchers were invited to classify the same production in more than one category. When words could not be classified within the existing categories, researchers were allowed to create new categories. All annotated data are available at https://osf.io/ecr32.

Note that since there was no one speaker knowing all involved languages, we cannot guarantee that the very same classification and translation criteria were applied for all languages. Therefore, the information collected in (b) and (c) only provides a pointer to the word meaning, so that readers who do not speak the language have an opportunity to understand all the items in the dataset. However, we emphasize that this information should only be considered as a general reference and treated very carefully for any form of quantitative analysis.

Statistical considerations

In the linear mixed-effects model (LMM) analyses reported here, the 18 different samples served as our basic unit of observation; therefore, all LMMs reported here contain random intercepts for the samples in addition to the fixed effects specified in the individual analyses. We estimated the LMMs in R (R Core Team, 2022) using the packages lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017).

Results

In Study 1, participants from 17 countries (see Fig. 2) were asked to freely generate any taboo words they could think of. The total number of words produced varies greatly between samples (see Fig. 3).

In a qualitative exploratory analysis, to assess the cross-language variability in our data, we manually inspected the 10 most frequently produced words in each sample and categorized them by means of their English translations (treating words with near-synonym translations as the same word). As can be seen in Fig. 4, there is a certain degree of consensus across samples: Some words are found among the most frequent words in many if not most languages. Variations of cunt (especially when also considering mother’s cunt) were seen in all samples, and those of bitch in almost all samples. Six additional items (dick, faggot, nigger, fuck, shit, and ass) were produced by about half of the samples. Also, with only a few exceptions, most samples produce around one third to a half of the 17 items (6–10 items), again suggesting some overlap in participants’ intuitions across languages and cultures. Note that almost all of these words are produced by some participants in every sample, but not frequently enough to appear among the ten most frequent words.