Introduction

The Sapir–Whorf or linguistic relativity hypothesis—henceforth, simply “relativity” (Whorf, 1956)—takes various forms, but at its heart it contends that the idiosyncrasies of the languages we speak influence the way we think about the world. In its strongest incarnation—linguistic determinism—thought is constrained by language, but few if any contemporary scholars take this view (Athanasopoulos, 2009). At the opposite extreme is the “universalist” position, in which thought is said to be independent of language (e.g., Pinker, 1994; see Lucy, 2016, for a historical overview). Although there is no agreement as to where between these two opposing views the truth is situated, the broad consensus is that neither extreme is correct (Gleitman & Papafragou, 2013). In the middle are a variety of standpoints, some of which are more limiting of the role of language than others. For example, thinking for language (i.e., thinking about speaking) might allow language to bias our attention toward the more linguistically describable aspects of what we perceive (Slobin, 1996), but this does not mean that language alters the underlying conceptual structures. Another standpoint is that language can influence our judgments about what we perceive, but not perception itself, which is cognitively impenetrable (see Firestone & Scholl, 2016, for debate; see also Pylyshyn, 1984). For many, our perceptions are perhaps best described as modulated or biased by language (Athanasopoulos, 2006; Dolscheid, Shayan, Majid, & Casasanto, 2013; Gilbert, Regier, Kay, & Ivry, 2006, 2008; Lupyan, 2012). Overall, investigating the ways in which language does and does not relate to thought now appears to be the prevailing approach (Lucy, 2016; Lupyan, 2012; Thierry, 2016).

Evidence for language influencing thought and judgments about perceptions comes from various domains. For example, performance on color discrimination/matching tasks has been shown to be more efficient when the stimuli have different labels in a participant’s language relative to when they are subsumed under one term (e.g., Roberson, Pak, & Hanley, 2008; Winawer et al., 2007). Differences between languages have also been demonstrated to affect how people think about object relations (Park & Ziegler, 2014) and objects themselves (Imai & Gentner, 1997) as well as broader, more abstract concepts such as quantity (Athanasopoulos, 2006), time (Boroditsky, 2001; Casasanto et al., 2004), and motion (Athanasopoulos & Bylund, 2013). Some of these studies have generated vigorous debate, perhaps most notably in the area of color perception (e.g., A. Brown, Lindsey, & Guckes, 2011; Franklin, Clifford, Williamson, & Davies, 2005; Witzel & Gegenfurtner, 2013). For some, aspects of grammar might provide a better tool by which relativity can be explored than vision-focused research; this is because grammar is unaffected by sensory input, obviating the need for language to “breach” psychophysical barriers. Additionally, for the relativity hypothesis to hold, it should not be limited to category labels but extend to features of syntax too (Sato & Athanasopoulos, 2018).

One aspect of grammar that has received attention in recent years is grammatical gender (Bassetti & Nicoladis, 2016; Bender, Beller, & Klauer, 2018). Unlike English, which has a semantic (or conceptual) gender system whereby the gender of a noun is dictated by its biological sex, many languages have a formal system by which all nouns are assigned to a grammatical gender category whether they have biological sex or not. For example, the English word “bed” has no gender, and is referred to with the pronoun “it,” but in Italian (ilmletto) takes the masculine gender. The consequence of a formal grammatical gender system is obligatory conformity or agreement with the syntactic rules of that class. This may involve marking for gender any definite or indefinite articles (“the,” “a/an”), plural markings, and case markings, as well as other forms of agreement. Different languages also assign different genders to the same objects; for example, in contrast to Italian, “bed” takes feminine grammatical gender in Spanish (lafcama). Such differences are not at all exceptional; grammatical gender assignment in a given language is largely arbitrary (Corbett, 1991; Foundalis, 2002), and “escapes logic” (Boutonnet, Athanasopoulos, & Thierry, 2012). Indeed, as English and other languages with purely semantic gender show, a formal grammatical gender system is also unnecessary when it comes to the communicative function of a language.

The crucial exception to arbitrariness in grammatical gender assignment concerns words referring to entities, usually human, with actual biological sex. “Woman” is feminine and “man” masculine across German, French, Italian, and Spanish, and it is from these associations, presumably,Footnote 1 that grammatical “gender” receives its name. Occasionally, the correlation between grammatical gender and biological sex is imperfect; witness German, which has a third, neuter gender category in addition to masculine and feminine, and which assigns this gender to the word for “girl” (dasnMädchen). Nevertheless, even in German the vast majority of words referring to humans that have biological sex show a strong pattern of masculine with males and feminine with females. In this sense, languages with a formal gender system can be viewed as having a largely semantic gender system for human agents at least.

It is the combination of the arbitrariness of grammatical gender assignment and the link with biological sex that has attracted researchers to grammatical gender when examining relativity (e.g., Bassetti, 2007; Sato & Athanasopoulos, 2018). The question asked in such studies is whether grammatical gender assignment “rubs off” on the conceptual representations of inanimate objects that have no biological sex such that, for example, Italian speakers conceptualize beds as somehow more masculine than speakers of a language with no formal grammatical gender system, or than speakers with a gender system but for whom a different gender is assigned. Although historically seen as a quirk of grammar, grammatical gender is therefore regarded to be in a particularly strong position to inform long-standing philosophical, linguistic, and psychological debates around the universality or otherwise of human thought (Phillips & Boroditsky, 2003).

A variety of tasks have been employed to understand whether grammatical gender does influence concepts. The most common has been the voice choice task (16 different publications using this task were discovered in this review), in which participants are asked to assign a male or female voice to objects, with many finding that the sex of the voice and the grammatical gender of the target are indeed broadly consistent (e.g., Kurinski, Jambor, & Sera, 2016; Lambelet, 2016; Ramos & Roberson, 2011; Sera, Berge, & del Castillo Pintado, 1994; Sera et al., 2002). Similar results have been found when asking participants to assign a human name or a sex to an object (“sex assignment” tasks; e.g., Belacchi & Cubelli, 2012; Flaherty, 2001), and when asking participants to rate on a scale the similarity (“similarity task”) between pictures of male and female humans and objects (Phillips & Boroditsky, 2003). In object–name memory association tasks, participants are instructed to remember male and female names that substitute object names, such that “chair” might now be “Patricia,” and the results have sometimes shown that the ability to recall the human name is enhanced if it is congruent with the grammatical gender of the object in question (Boroditsky & Schmidt, 2000).

The tasks described above all involve judgments made without time pressure or measurement, but there is also some, albeit limited, evidence in favor of grammatical gender influencing concepts from speeded-response tasks. Converging evidence from different measurement types is important to generate the clearest picture possible of any effect. For example, in the extrinsic affective Simon task (EAST), participants respond to words using two keys; in one condition, each key is mapped to a color, and in another each key is mapped to a sex (male or female). Bender, Beller, and Klauer (2016b) found that German speakers were faster to respond to the color of a word if that word was a gendered, personifiable noun (e.g., death, beauty) and the response key was also mapped to the sex implied by the target’s grammatical gender. However, they also found that the effect was weak or nonexistent for inanimate objects, and only held for personifiable nouns that had connotations of sex that were themselves congruent with the noun’s grammatical gender (e.g., “war” is masculine in connotation and gender, but “Spring” is feminine in connotation but masculine in gender).

The role of gender or sex in the tasks so far described has been clear, because participants are making judgments and responses with biological sex playing an obvious role in this process. Some researchers have preferred tasks in which the salience of sex and gender can be reduced or made more orthogonal to the participant’s conscious experience. For example, Konishi (1993) employed a property judgment paradigm, finding that German speakers rated masculine-gendered objects such as “key,” “table,” and “beach,” as more “potent” (a trait associated with masculinity) than concepts with feminine gender; Spanish speakers, for whom the same targets took the feminine gender, rated the same items as less potent. This type of task makes participants think about concepts without having to relate them to gender or sex directly; indeed, neither term need come up at all in the instructions for such experiments.

Other studies, often using the same or similar designs as those described above, have instead reported potentially important absences of evidence. For example, in a properties judgment task, Landor (2014) asked approximately 600 speakers of gendered languages to generate adjectives to describe inanimate objects, and then asked a different group of participants to assign a male or female voice to them. Cross-referencing these results back to the original stimuli, no evidence of property judgments consistent with grammatical gender was found. Absences of evidence have also been reported in sex assignment tasks (Nicoladis & Foursha-Stevenson, 2012), similarity tasks (Degani, 2007), the EAST (Bender et al., 2018), and priming-type tasks (e.g., Degani, 2007; Samuel, Roehr-Brackin, & Roberson, 2016). The importance of these findings is that they might delineate important constraints on the relativity hypothesis that, as we discussed earlier, would be a finding consistent with the majority of theoretical standpoints in the contemporary literature.

Further clarifications of such constraints might come from results that do not conform to a binary “support or no-support” distinction. For example, a property judgment task by Semenuks, Phillips, Dalca, Kim, and Boroditsky (2017) found that grammatical gender influenced property judgments, but only in an analysis that left out participants’ first-choice adjectives, focusing on second- and third-choice adjectives alone. The authors suggest that the absence of an effect from the first adjective choice might explain the tendency for more negative findings from speeded-response tasks, but an alternative explanation is that participants fell back on grammatical gender as a strategy after they had made their initial and potentially most faithful descriptions (cf. Bender, Beller, & Klauer, 2016a).

Other studies have shown support for relativity only for limited subsets of results rather than for the full range of predictions. For example, in the study by Konishi (1993) already described, German and Spanish speakers both showed evidence of perceiving objects with masculine gender as more “potent” than objects with feminine gender, with “potent” rated as a masculine trait; however, participants also rated “negative” as a masculine trait, but this property did not transfer to objects. Indeed, even the link with potency involved very small differences between masculine and feminine concepts. Similarly, in a sex assignment task, Nicoladis, Da Costa, and Foursha-Stevenson (2016) found that Russian–English bilingual preschoolers assigned male sex to masculine gender objects more frequently than they did to neuter gender objects, but not more frequently than they did to feminine gender objects; and in a voice choice task, decisions have been found to be consistent with masculine gender but not with feminine gender (Bassetti, 2007).

As we mentioned earlier, the prevailing standpoint is that neither a strong view of relativity nor a universalist view provides an accurate account of the relationship between language and thought, and as a result the absences of evidence and the mixed results hint at precisely the systematic constraints most researchers seek to reveal. It is our view that a systematic review of the empirical literature would be well-placed to begin the process of drawing conclusions concerning the strength, extent, and limitations of any such relationship.

Before describing the results of the review, we outline the six potential constraints or “parameters” that we have extracted from discussions in the literature. Assessment of the influence of these parameters, together with the effects of the large variety of tasks used in this field, runs through our review. Finally, we consider whether the literature supports the view that the effects of grammatical gender are primarily about language, or whether the effects attributed to relativity might instead be explained by statistical co-occurrences and associations that are carried by language, but are not linguistic themselves. Such questions have the potential to go to the heart of contemporary thought about the role of language in cognition (Lupyan, 2012; Thierry, 2016).

The salience of gender/sex in the task

In terms of methodology, many studies that have shown effects of grammatical gender have tended to come from explicit judgments when gender and/or sex is a salient context in the experiment (e.g., Phillips & Boroditsky, 2003; Saalbach, Imai, & Schalk, 2012; Sera et al., 1994; Sera et al., 2002). Indeed, the most common task in the field, voice choice, asks participants to assign a biological sex to an object. Given the nature of the question (i.e., there is no objectively correct answer), the participant might seek some rationale for the choices rather than assign at random. Under such conditions, the chances of grammatical gender being consciously recruited are increased, potentially undermining a conceptual change account. A number of scholars have raised this issue (e.g., Bender, Beller, & Klauer, 2011; Bender et al., 2018; Kousta, Vinson, & Vigliocco, 2008; Pavlidou & Alvanoudi, 2013; Ramos & Roberson, 2011), and Phillips and Boroditsky (2003) have suggested that it is difficult to resolve empirically. For instance, in Study 2 of Sera et al. (1994), reference to “masculine” and “feminine” was removed from the instructions from the previous experiment, but participants were nevertheless still required to assign male or female voices. Even when the role of gender/sex is less prominent in a task, it is often still high. For instance, the EAST (Bender et al., 2016a, 2016b) maps all responses onto the same keys, meaning that even when participants respond according to color, they do so using a key that has also been assigned to a biological sex. This is a necessary facet of a design that relies on mapping two concepts to the same response in order to test for any effect of the overlap. Any strategic use of grammatical gender, or simply its recruitment via strong associations with biological sex in a task, might undermine the case for conceptual-change account of results, and it is clearly important to understand the extent to which the evidence in favor of relativity might rely on such possibilities, by means of comparing high- versus low-gender/sex salience in tasks.

The salience of language in the task

Another issue concerns the salience of language in the task and whether the design instead indexes language processes, as in “thinking for speaking” (Slobin, 1996). Indeed, it was an early requirement of language-and-thought research that effects of language should be evident on nonlinguistic tasks (R. Brown & Lenneberg, 1954). As with the salience of gender/sex, it is important to understand the extent to which the evidence for relativity might rest upon the recruitment of grammatical gender through language processing rather than any underlying conceptual change. We therefore classified research as being either high or low in terms of the salience of language in the task.

Testing participants in their gendered language

For some, testing that occurs in participants’ gendered language limits inferences to the effects of grammatical gender on concepts within that language, rather than on the concepts themselves (Boroditsky & Schmidt, 2000; Slobin, 1996); if one’s language creates a worldview in which “bed” is masculine, “bed” should be conceptualized as masculine not only when acting in the context of that particular language, but also when acting in a nongendered language, such as English. By definition, this argument suggests that research is best conducted on bilinguals. In favor of such a possibility, testing participants in a second, ungendered language context has also revealed influences of a first-language gender system (Boroditsky & Schmidt, 2000; Phillips & Boroditsky, 2003; Semenuks et al., 2017). To test this particular hypothesis, we classified experiments in terms of whether participants performed in their gendered or nongendered language.

Two-gender versus three-gender languages

Another issue concerns the precise nature of the grammatical gender system under investigation. For example, in Romance languages like Spanish, Italian, and French, most nouns that refer to humans carry grammatical gender that is consistent with the target’s biological sex. In German, however, the correlation is weaker, in part owing to its third, neuter gender, but also because German articles do not always differentiate between genders as a result of the German case system. This can result, for example, in even animates being labeled with a grammatical gender incongruent with their biological sex. The issue of two- versus three-gendered languages therefore concerns whether an influence might be strongest in speakers of languages with two gender classes that form a particularly “tight fit” with semantic gender (see Saalbach et al., 2012; Sera et al., 1994; Vigliocco, Vinson, Paganelli, & Dworzynski, 2005). Results with German speakers on voice choice tasks have indicated that grammatical gender does not influence decisions in the way that, for example, French or Spanish grammatical gender does (Sera et al., 2002). If this pattern were borne out in a broader review, it might suggest a statistical, correlative relationship between biological sex and grammatical gender in relativity.Footnote 2 For Lucy (2016), the structural consequences of grammatical gender (case markings, adjective agreement, etc.) are too often overlooked and might explain some inconsistencies in results. We classified experiments in terms of whether participants spoke a two- or three-gendered language.Footnote 3

Effects with animate and inanimate targets

Another parameter concerns whether grammatical gender might influence the conceptualizations of animate but not inanimate targets (e.g., Vigliocco et al., 2005). For example, participants who speak a gendered language (German) showed a greater willingness to erroneously endorse sex-specific statements about animals if those statements were consistent with the animals’ grammatical gender than did speakers of an ungendered language (Japanese), but the same was not true of inanimate targets (Imai, Schalk, Saalbach, & Okada, 2014). In a series of priming experiments, Bender et al. (2011) asked German speakers to decide the gender of a target word after they had seen either (1) a definite article denoting gender (der for masculine and die for feminine), (2) the words Mann (“man”) and Frau (“woman”), (3) the symbols for male and female, or (4) pictograms of a man or woman. They found that the congruent linguistic article primes sped up judgments relative to incongruent trials, for both animate and inanimate targets. However, of the other primes, only the Mann/Frau pictograms had an effect, and only on animate targets. The researchers concluded that the grammatical gender of objects does not appear to “seep” into the semantic content of inanimate nouns. A broadly similar pattern was found in the properties judgment task by Semenuks et al. (2017). Results like these have led some scholars to suggest that grammatical gender is only relevant to conceptualization when sex is a relevant property of the target (Ramos & Roberson, 2011; Vigliocco et al., 2005). If this is true, then relativity is subject to an important constraint, namely that grammatical gender only interacts with targets that have biological sex in the first place. We classified experiments in terms of whether participants responded to animate or inanimate targets.

Stronger effects in adults than children

Finally, studies with adults have been thought to provide more supporting evidence for relativity than studies with children, and particularly very young children. For example, only six out of 18 Spanish-speaking 3- to 5-year-olds freely sorted pictures of inanimate objects and male and female people into groups defined by grammatical gender (Martinez & Shatz, 1996), and only very limited effects of grammatical gender on object conceptualization, or indeed no effect at all, have been found in other studies with young children (Bassetti, 2007; Nicoladis & Foursha-Stevenson, 2012; Sera et al., 2002). Weaker effects at younger ages are consistent with the possibility that it is experience with grammatical gender that leads to biological sex connotations with objects, but also with the possibility that it is metalinguistic knowledge of grammatical gender acquired through formal instruction, rather than grammatical gender assignment itself, that might explain some positive results. We classified studies in terms of whether the participants were children (< 18 years old) or adults.

An existential question for relativity: What is “gendered” about grammatical gender?

A crucial question that underpins everything in this review concerns the nature of grammatical gender itself. It has been pointed out that the “gender” in grammatical gender is not intrinsic to language but is itself an arbitrary, human-made label (Bender et al., 2018). At some point in their lives, speakers of gendered languages usually learn that the names for the categories are “masculine” and “feminine,” but would they ever choose those names without formal instruction? As was highlighted by Foundalis (2002), “masculine” and “feminine” are in fact poor predictors of the majority of nouns in their class, and even the relationship between the meanings of the words gender and sex are stronger for speakers of nongendered languages, such as English; speakers of gendered languages would not usually use the translation equivalent for “gender” in grammatical gender to refer to biological sex at all.

Grammatical gender has usually been perceived as a particularly useful tool to study relativity because its assignment patterns are so arbitrary and have no psychological reality outside of language itself, but the same point could be leveled, with more detrimental consequences for the research paradigm, at the titles of the categories themselves, which are metalinguistic labels (they describe something about language) detached from the use of the grammar itself. Grammatical gender therefore appears to suffer from an identity problem that other tools used to investigate relativity do not. To illustrate, we might as a thought experiment substitute the terms “masculine” and “feminine” for either “plant” and “nonplant,” “sky” and “earth,” “Group 1” and “Group 2,” “x” and “y,” or any number of arbitrary labels, without interfering with the performative use of a gendered language itself. We might predict that if we told a cohort of Spanish speakers to go about their day “believing” that the titles of the classes had changed to “x” and “y,” they could simply continue to refer to la mesa (thef table) and el libro (them book) with no real-world consequences. If, however, we asked the same cohort to randomly shuffle their color-word mappings for a day, such that they might need to refer to the Spanish word for “green” with the Spanish word for “blue,” or to swap their spatial prepositions, such that “over” might become “under,” we would be asking them to violate the rules of language use, and we would expect there to be errors.

If the labels “masculine” and “feminine” are in a sense historical accidents, this has consequences for the use of these categories in the relativity paradigm, because effects that have previously been attributed to conceptual change might in fact be the result of simple statistical co-occurrences and associations between two groups of labels: the entirely arbitrary, human-made labels of metalinguistic grammatical classes, on the one hand, and the similarly arbitrary “gender” assignment to noun labels, on the other. This would be the equivalent, for example, of finding that English speakers conceive of the digits 3, 5, 7, and 9 as somehow “weirder” that 2, 4, 6, and 8, because the former are arbitrarily labeled as “odd.” Although language would be the vehicle of any statistical associations, the outcome becomes trivial in the context of classic views of relativity as engaging conceptual change.

What patterns would a statistical co-occurrence account predict? We would expect five out of the six parameters described to influence results. First, speakers of a two-gendered language should show stronger effects than speakers of three-gendered languages, because the relationship between human biological sex and grammatical gender is reinforced through greater repetition and a stronger gender/sex correlation in human animates, at least when compared to the three-gendered language most commonly used in the field (German). We would additionally predict that performing in a gendered language would give rise to such associations in a way that performing in a nongendered language like English might not. We would predict that the more salient the roles of both gender/sex and language in the experiment, the greater the opportunity for associations between biological sex and language to be engaged. Finally, we would expect the presence of animate targets to elicit biological sex information more than inanimate targets. A sixth possibility, albeit a theoretically more tenuous one, is that adults would have more strongly reinforced associations than children, owing to their greater quantitative experience of language. Interestingly, all of these parameter settings have already been cited in the literature as conducive to finding effects of grammatical gender, although these suggestions have not yet been supported by a systematic review. Assuming that the strong version of relativity is wrong (as we do), it then becomes important to decide whether the effects of grammatical gender are statistical and associative or an effect of language on the fabric of conceptual representation itself.

Review

The remit of the review

This review includes (1) empirical research with (2) human participants and (3) real languages and words (rather than languages and words invented for the purposes of experimentation), (4) that is either published or unpublished and (5) is reported in English.Footnote 4 Studies were also required to test the influence of grammatical gender on at least some nonhuman targets (excluding, therefore, studies on grammatical gender and gender stereotyping of men and women).

Formal searches were conducted encompassing the years 1990 to 2018 using the search terms “grammatical” and “gender” together, once with “Whorf” and once with “relativity,” in Web of Science (all) and Google Scholar (first five pages). Additionally, we searched the EThOS PhD thesis bank using the terms “grammatical” and “gender” for the broadest possible range of results and the NDLTD thesis bank using “grammatical gender,” once with “Whorf” and once with “relativity.” All studies from a recent special issue of the International Journal of Bilingualism (volume 20, issue 1) were included where relevant. To ensure maximal representation of relevant data and to minimize the potential for skewed results owing to potential publication biases (e.g., De Bruin, Treccani, & Della Sala, 2015), we also emailed the corresponding author of every study revealed in the first stage of the review to request any unpublished or in-press results. Finally, a call for data was also issued on the ResearchGate website as part of a project linked to the present review. A further 13 pieces of empirical research not turned up by these searches were added, either because they were known to the authors or were offered/recommended to the authors during this contact phase. The full list of items included and excluded from the review can be found in the supplemental online materials (SOM1).

Classifications by task

Owing to the heterogeneity of methodologies in the field, we opted to sort the individual experiments into eight task types: voice choice, properties judgments, EAST, sex assignment, priming, similarity judgment, association, and object–name memory association. These eight task types were chosen because they were different enough in methodology to be considered in their own right. To illustrate, we felt that the closest pair was voice choice/sex assignment, since both involve explicit biological sex judgments to be made about targets. However, voice choice tasks require the participant to imagine an object speaking, which might recruit thinking about language in a way that assigning biological sex alone might not.Footnote 5

We classified every experiment according to the six parameters previously described as being potentially important to the outcomes of research. We adopted the following approach to these classifications. Regarding language content, we asked whether language goes in to the task (e.g., the stimuli are words), or comes out of the task (e.g., choosing between words such as “male” or “female,” thinking of adjectives in property judgment tasks). Where neither occurs, or where any role of language is judged to be highly orthogonal, that study is classified as low language content. The same approach was taken to gender/sex content (gender or sex information should not go in to the task or be part of the response or process leading to the response). The other parameters (age, language, number of gender categories, and target type) were readily classifiable at face value. Full details of item classifications can be found in SOM2.

Classifications of results

Given that the results of tasks are sometimes not unambiguous in their support or otherwise of relativity, each individual study was classified in terms of one of three outcomes: support, mixed support, or no support. Studies classified as offering support showed an effect of grammatical gender on conceptualizations consistent with the hypothesis employed. A study was classified as offering no support if there was an absence of evidence for relativity that could be classified minimally as mixed support. Studies were classified as offering mixed support if they showed partial confirmation of an influence, such as an influence of one gender but not another (Bassetti, 2007), marginally significant effects (e.g., Semenuks et al., 2017), an influence in accuracy but not in response times (e.g., Bender et al., 2016a, 2016b), or an influence shown on one criterion but not another (e.g., Konishi, 1993). SOM2 lists these classifications for each experiment.

Adjustments for sample size

Given that the sample sizes varied widely from study to study and condition to condition (from 7 to 924), we report results taking sample size into account. This has the obvious benefit of weighting the pattern of results in favor of those studies that are most likely to be highest-powered and less susceptible to spurious effects. Given that the same participants sometimes performed multiple conditions or analyses, we also allowed multiple data points in the review from the same participants. For example, this review allowed for separate data points for inanimate and animate targets from the same group. In such cases, separate classifications are necessary to ascertain the effect of different parameters, such that the evidence from animate targets might be classified as offering support, but the results from inanimate targets be classified as no support. Full details can be found in SOM2.

This review adopts a methodological approach somewhere between a vote-count system (owing to the classifications of support, mixed support, or no support) and a statistical meta-analysis (owing to sample-size adjustments) (cf. Samuel, Roehr-Brackin, Pak, & Kim, 2018). Given the heterogeneity of their research designs, languages tested, and so on, occasionally only very small clusters of studies could be considered to be using the same methods, and often these were studies that came from the same labs and sampled the same type of linguistic population. We therefore considered this approach the better way to provide an overview of the multiple ways in which the field has investigated the research question, and how different designs may culminate in different outcomes.

What is not in the review

We excluded studies that looked not for relationships between targets and biological sex, but rather for relationships between objects of the same grammatical gender versus objects of different grammatical gender (e.g., Almutrafi, 2015, Exp. 2; Bobb & Mani, 2013; Boutonnet et al., 2012; Cubelli, Paolieri, Lotto, & Job, 2011; Kousta et al., 2008; Yorkston & De Mello, 2005). For example, it has been demonstrated that semantic category judgments about nouns belonging to the same grammatical gender are processed more quickly than nouns from different grammatical genders (Cubelli et al., 2011), and that grammatical gender information is processed in semantic similarity tasks even when it is irrelevant and undetectable by behavioral measures (Boutonnet et al., 2012). Although such studies offer support for the processing of grammatical gender information when it is apparently task-irrelevant (a useful prerequisite for tasks in this review), and even show that objects of the same grammatical gender are perceived to be more similar than objects that are not, this is not the same as demonstrating that objects are conceptualized as more masculine or feminine as a function of their gender assignment. Such results might be explained in terms of an effect of membership in the same grammatical category, independently of biological sex information (cf. Cook, 2016). Because this review is entirely concerned with this specific relationship such studies, though clearly interesting in their own right, were omitted.

There have also been studies assessing how and when grammatical gender is processed in language production and comprehension, including in bilinguals for whom the grammatical gender for the same object might be opposite in their two languages (Costa, Kovacic, Fedorenko, & Caramazza, 2003; Costa, Kovacic, Franck, & Caramazza, 2003). Again, such studies are not designed to look at whether object conceptualization has the potential to be influenced by the biological sex connotation of their grammatical gender assignment, and they were therefore excluded.

The review also does not include one study that does pertain to grammatical gender and biological sex, but in which a meaningful understanding of the number of participants is not possible. This was the study by Segel and Boroditsky (2011), in which depictions of personifications and allegories in thousands of works of art were classified retrospectively in terms of their gender congruency. This was notionally classified as support.

Finally, the remit of the review excluded studies that did not involve human participants but investigated the question of relativity through connectionist models (e.g., Dilkina, McClelland, & Boroditsky, 2007; Sera et al., 2002) or through training in artificial “languages” (Eberhard, Heilman, & Scheutz, 2005; Phillips & Boroditsky, 2003, Exps. 4 and 5; Sera et al., 2002, Exp. 4) or nonsense words (Konishi, 1994; Vuksanovic, Bjekic, & Radivojevic, 2015). This is because we felt that we should limit our scope to behavior more clearly grounded in the experience of real people with real grammatical gender categories.

Results

Overall, the initial search revealed 99 individual pieces of research, with a further 13 added that were known to the authors but were not revealed by the formal search. After removing those items that did not provide empirical data, the review included 43 individual pieces of research, one of which was unpublished (Nicoladis, 2019), and three of which were doctoral or master’s theses (Almutrafi, 2015; Degani, 2007; Landor, 2014). The remaining pieces of research were published journal articles, conference proceedings (always of the Cognitive Science Society), or book chapters. After subdividing this research by task type and condition, these pieces of research resulted in 158 lines of data (split by differences in conditions within experiments), which together surveyed 5,895 participants in total.

As we described earlier, we then calculated the number of “samples,” which was 7,334. This number differs from the number of participants because it allows for the possibility that the same participant might have performed in multiple conditions. It is for this reason that the number of samples can be higher (but not lower) than the number of participants. We present all our results in the context of samples, rather than participants, in order to capture these important within-experiment differences.

Overall results

Across the review as a whole, the results from 32% of all samples were classified as offering support for relativity, 24% were classified as offering mixed support, and 43% as offering no support. With the exception of one particularly large study (Montefinese, Ambrosini, & Roivainen, 2019, N = 924 and N = 105, total N = 1,029), there was no evidence that the results were driven by only a small cluster of highly powered studies (see Fig. 1). If this outlying study were removed, support would be at 38%, mixed support at 28%, and no support at 34%.

Fig. 1
figure 1

Number of samples (vertical axis) for each research item or line of the review (horizontal axis). The mean and median samples are displayed in the text box. The outlier is Montefinese et al. (2019) with a property judgment task (N = 924 Italian speakers).

In what follows, we first describe the results by task type. We then describe results as a function of task parameters. A full at-a-glance view of all the results by task type and parameter can be found in Figs. 2 and 3.

Fig. 2
figure 2

Classifications of support, mixed support, and no support by task type. The total number of samples is shown on the y-axis.

Fig. 3
figure 3

Classifications of support, mixed support, and no support according to the task parameters. The total number of samples is shown on the y-axis.

Results by task type

Properties judgment

Full results for the properties judgment task are displayed in Table 1. Properties judgments made up 37% of all samples in the review, making it the most commonly performed task in the literature (i.e. it comes with the highest number of samples, rather than the highest number of uses in the literature). Only 3% of samples were classified as offering support (Flaherty, 2001; Imai et al., 2014; Saalbach et al., 2012). A further 23% were classified as providing mixed support; reasons were the finding that results were more consistent with grammatical gender in one group than another (that spoke a different language), but apparently not more so than chance itself (Haertlé, 2017); results limited to one property but not another, despite evidence that both were linked to biological sex (Konishi, 1993); evidence to suggest an effect of the grammatical gender of a language the participants did not speak, with no direct comparison of this effect with the language they did speak (Sedlmeier, Tipandjan, & Jänchen, 2016); and effects limited to second- and third-choice, but not first-choice, adjectives (Semenuks et al., 2017). The remaining 75% of samples offered cases of no support at all (Flaherty, 2001; Imai et al., 2014; Landor, 2014; Mickan, Schiefke, & Stefanowitsch, 2014; Montefinese et al., 2019; Semenuks et al., 2017). It should be noted that the study by Montefinese et al. represents an extreme outlier. The results of this study were classified as no support. However, even if the results of this study were removed, the rate of support for properties judgment tasks would only rise to 4%. All properties judgment tasks were intrinsically high in language content, and all have so far been conducted with adult participants. Given the almost floor-level overall rate of support, an examination of the effect of different factor parameters was not conducted.

Table 1 Numbers of samples according to parameter setting and results classification, for the properties task

Voice choice

The full results for the voice choice task are displayed in Table 2. The voice choice paradigm, although the most common experimental task for researchers, is the second most commonly performed by participants in the field, accounting for 28% of all samples. Of these, 64% were classified as offering support (Almutrafi, 2015; Athanasopoulos & Boutonnet, 2016; Beller et al., 2015; Bender et al., 2016a; Haertlé, 2017; Kurinski et al., 2016; Lambelet, 2016; Ramos & Roberson, 2011; Sera et al., 1994; Sera et al., 2002; Vernich, 2017; Vernich, Argus, & Kamandulytė-Merfeldienė, 2017). An additional 22% were classified as providing mixed support; these included results consistent with the hypothesis but limited to one of two genders (Bassetti, 2007), effects for limited subsets of targets (Beller et al., 2015; Bender et al., 2016a), and statistically marginal results (Bender et al., 2018). Mixed results also included cases in which the data suggested that voice choices were not more consistent with grammatical gender than chance levels (Forbes, Poulin-Dubois, Rivero, & Sera, 2008; Sera et al., 2002), and effects limited to native speakers but not learners (including advanced learnersFootnote 6) of the same language (Kurinski & Sera, 2011). A minority of samples offered no support at all (12%) (Bassetti, 2007; Bender et al., 2018; Forbes et al., 2008; Sera et al., 1994; Sera et al., 2002).

Table 2 Numbers of samples according to parameter setting and results classification, for the voice choice task

All voice choice tasks were classified as having a high gender content and high language content. A greater rate of support for relativity was found when participants spoke a language with two genders (69%) instead of three (24%). There was also more support from adult samples (69%) than from children (39%). To illustrate these comparisons, the majority of no-support cases came from participants who were 5-6 year-olds (Sera et al., 1994; Sera et al., 2002), 9-year-olds, (Bassetti, 2007), or adult speakers of German, a three-gendered language (Bender et al., 2018). In contrast, there was no evidence that voice choices were more consistent with grammatical gender when applied to animate (38%) than to inanimate objects (68%), and a lower rate of support was found when assigning voices in one’s gendered language (56%) relative to one’s ungendered language (73%). Some of these results might be skewed by the relative dearth of child samples, samples who performed the task in a low gendered-language context, with a three-gendered language background, or with animate targets.

Sex assignment

Full results for the sex assignment task are displayed in Table 3. Sex assignment tasks are almost identical to voice choice tasks, in that they are explicit assignments of male or female sex to animate and inanimate objects. Largely thanks to the study by Belacchi and Cubelli (2012), representing 412 samples (46% of all samples with this paradigm), sex assignment makes up 12% of the samples in the review.

Table 3 Numbers of samples according to parameter setting and results classification, for the sex assignment task

Overall, 66% of samples were classified as offering support (Belacchi & Cubelli, 2012; Flaherty, 2001; Pavlidou & Alvanoudi, 2018; Sera et al., 1994), 26% as mixed support (Bender et al., 2016b; Nicoladis, 2019; Nicoladis et al., 2016; Nicoladis & Foursha-Stevenson, 2012; Pavlidou & Alvanoudi, 2013), and 8% as no support (Flaherty, 2001; Nicoladis & Foursha-Stevenson, 2012).

Sex assignment tasks are intrinsically classified as high in gender/sex content. Although they need not also have a high language content, they were judged high for every study in this review. For animate targets, the rate of support was 100%, higher than for inanimate targets (33%). The rate of support was slightly higher in children (69%) than in adults (62%). The rate of support was 75% when participants performed in their gendered language, but zero in an ungendered language. Finally, the rate of support from two-gendered languages was high (83%), but from three-gendered languages it was low (23%). Again, some of these comparisons (with the probable exception of age) might be skewed by imbalances in the number of samples that were classified as performing under each parameter setting.

EAST

The full results for the EAST are displayed in Table 4. The EAST is a unique case in this review, because it has only been used by one core group of researchers (Bender et al., 2016a, 2016b, 2018), and only ever with adult speakers of German, which is a three-gender language. It comprises 9% of the samples. There is a good case that it might constitute a priming task, but given its singularity we felt it was best considered as a task category in its own right. Overall, only 11% of the samples were classified as offering support (Bender et al., 2016a, 2018), 56% were classified as offering mixed support (Bender et al., 2016a, 2016b, 2018), and 33% as offering no support (Bender et al., 2016a, 2018).

Table 4 Numbers of samples according to parameter setting and results classification, for the EAST

Variation in the results by parameter should be interpreted in the context of the low overall rate of support. The EAST uses a two-key response method, with one clearly mapped to “male” and one to “female”—hence, gender context is always high—and since the stimuli are always words, language context is also high. In fact, the only possible parameter comparison that can be made with the EAST concerns inanimate and animate targets, which showed similar levels of support (10% vs. 11%, respectively).

Priming

The full results for priming tasks are displayed in Table 5. There are only five different pieces of research with priming experiments in the literature, comprising 8% of the samples in the review. Care must be taken when attempting to interpret parameter patterns from only a handful of studies, where settings can be entirely confounded with individual articles. Overall, priming offered a 34% support rate (Bender et al., 2011; Sato & Athanasopoulos, 2018) and a 66% no-support rate (Bender et al., 2011; Degani, 2007; Mickan et al., 2014; Samuel et al., 2016). There were no cases of mixed classifications. Given the small overall numbers of support (198 samples in total), coming from only two articles, comparisons were unlikely to reveal any reliable patterns.

Table 5 Numbers of samples according to parameter setting and results classification, for the priming tasks

Similarity judgment

The full results for similarity tasks are displayed in Table 6. Similarity judgment tasks comprised only 3% of all samples. The pattern of results, indicating 44% support (Phillips & Boroditsky, 2003), 45% mixed support (Sedlmeier et al., 2016), and 11% no support (Degani, 2007), is entirely confounded with the three individual pieces of research to use the task. The small number of samples with this task make it difficult to draw meaningful conclusions as to what might lead to the differences in results.

Table 6 Numbers of samples according to parameter setting and results classification, for the similarity tasks

Association

The full results for similarity tasks are displayed in Table 7. Only two pieces of research have employed an association paradigm (Bender et al., 2018; Martinez & Shatz, 1996), which together comprise only 2% of the samples in the review. All the studies with this paradigm were classified as offering no support. Gender content and language content were always high, and participants were always tested on inanimate targets and in a gendered-language context.

Table 7 Numbers of samples according to parameter setting and results classification, for the association tasks

Object–name memory association

The full results for object–name association tasks are displayed in Table 8. Comprising just under 2% of samples in the review, object–name memory association paradigms form the smallest task-type classification in this review. The task has been used in three separate pieces of research. Of the samples, 36% came under support (Boroditsky & Schmidt, 2000), 31% mixed support (Kaushanskaya & Smith, 2016), and 33% no support (Pavlidou & Alvanoudi, 2013). Note that these differences are split entirely by publication. All these studies were classified as having a high gender content and high language content. All were conducted with adult participants, and all included inanimate targets.

Table 8 Numbers of samples according to parameter setting and results classification, for the object–name association tasks

Results by task parameters

The distribution of samples displayed in Fig. 3 points to a number of imbalances in the literature to date. The samples in this review were typically involved in experiments with a high language content (98%). The samples were usually adults (87%), performing in their gendered language (83%), with inanimate targets (76%), and with high sex/gender salience in the task (62%). Samples also usually spoke a language with two gender categories (57%). In other words, the average experiment incorporated five out of the six parameter settings that are usually considered most conducive to results in support of relativity (inanimate targets being the exception).

Changes in the rate of support as a function of task parameters are displayed in Table 9. In the following sections we describe comparisons where it was possible to isolate one category from another. A study that puts speakers of two-gendered and three-gendered languages into the same group, for example, was excluded entirely rather than added to both categories.

Table 9 Summary of shifts in the rate of support as a function of task parameters

Gender/sex content

Consistent with previous views of the literature, as well as with a statistical association account of grammatical gender effects, studies with a high gender/sex content showed a higher rate of support (51%) than did studies with low gender/sex content (2%). The voice choice, sex assignment, EAST, association, and object–name association task types were always classified as high in gender/sex. The samples classified as low came almost entirely from the 2,689 (96%) who performed property judgment tasks, which came with only a 3% support rate. However, only 98 samples from this paradigm were classified as high, rendering any more detailed comparisons unreliable.

Language content

Almost all of the research was classified as high in language content (98%). The almost complete absence of research classified as low in language content makes any attempt to draw conclusions about this parameter liable to mislead. Only the priming and similarity studies by Sato and Athanasopoulos (2018) and Phillips and Boroditsky (2003), respectively—both of which were classified as support—were classified as low in language content. We return to this issue in our Discussion.

Gendered versus ungendered language

A slightly lower rate of support was found for studies performed in a gendered language (29%) than in an ungendered language (32%). This is not consistent with a statistical association account. When we compare performance in gendered and ungendered languages at the within-task level, we see that support is higher in a gendered language context than in an ungendered language context in the sex assignment task (75% vs. 0%) and the properties task, albeit in the latter case with very low rates (3% vs. 0%). Support is slightly higher in the ungendered language for voice choice (73% vs. 56%) and priming tasks (44% vs. 31%). It is also higher for similarity tasks (100% vs. 0%) and object–name tasks (54% vs. 0%). Although 83% of all samples performed in a gendered language context, meaning that comparisons were based on imbalanced sample sizes, the pattern of results within tasks suggests that there is no clear support for the hypothesis that grammatical gender is more likely to influence thought when participants perform in a gendered language.

Two-gender versus three-gender languages

The review showed higher rates of support from studies with two-gender languages (43%) than with three-gender languages (16%). This outcomes is consistent with a statistical association account. Broken down by task type, support from two-gendered languages over three-gendered languages came from the sex assignment tasks (83% vs. 23%), voice choice tasks (69% vs. 24%), similarity tasks (52% vs. 27%), priming tasks (46% vs. 32%), and object–name association tasks (42% vs. 30%). Only properties tasks (3% vs. 5%) reversed this pattern, albeit with negligible support rates in each category.

Animate versus inanimate targets

Consistent with a statistical association account, studies with animate targets showed a higher rate of support (50%) than did studies with inanimate targets (27%). Broken down by task, this pattern was true of sex assignment tasks (100% vs. 33%), priming tasks (78% vs. 8%), and properties tasks (10% vs. 0%). The results from the EAST were almost matched (11% vs. 10%). The reverse pattern was found for voice choice tasks (38% vs. 68%). Note that the great bulk of the positive results from inanimate targets comes from the voice choice task (90% of samples).

Adults versus children

In apparent contrast with some views expressed in the literature, research with children (55%) revealed a higher rate of support than research with adults (29%). However, this result is strongly weighted by task type. Overall, 87% of the samples came from adult participants, and the great bulk of the data from children came from voice choice and sex assignment tasks (92%). Given that the tasks that children performed provided the highest rates of support, and those performed almost exclusively by adults provided the lowest (e.g., properties judgments), it is difficult to know whether this outcome is the result of an age-related difference or a task-related difference. Looking at age-related performance at the level of the individual task, there is some evidence that adults show a greater influence of grammatical gender than children do; we see that the rate of support is 30% higher in adults than children in voice choice tasks (69% vs. 39%), although it is 7% lower in adults sex assignment tasks (62% vs. 69%).

Other patterns

Almost half of all the data in the review (40%) came from the voice choice and sex assignment tasks, the two paradigms that make the clearest demands on participants to consider targets in terms of biological sex. Since the potential for the strategic use of grammatical gender under such circumstances has been one of the most frequent issues brought up in the literature, we compared the rate of support from the review as a whole with the results with these two paradigms included or excluded. Overall support across the review drops from 32% to only 11% in their absence, and no support rises from 43% to 64%. In other words, when all the data are included, approximately one in three samples in the review provides support; when the data from voice choice and sex assignment are excluded, this rate drops to one in ten.

Discussion

At its broadest, the review shows that the evidence for an influence of grammatical gender on conceptualizations is highly task- and context-dependent. We found that the voice choice and sex assignment tasks formed the backbone of support for relativity; when they were removed, the support rate dropped to only 11%. With them included, about a third (32%) of the data were classified as support, relative to a no-support rate of 43% and a mixed-support rate of 24%. If we consider the possibility that publication biases mean that fewer null results make it to publication, it may be that even this support rate is an overestimate.

The review provides support for a number of important constraints on the relativity hypothesis. For example, the rate of support is higher when the gender content of a task is high than when it is low, suggesting any influence might be at least partly contingent on the opportunity to strategically call upon grammatical gender. Results are also more likely to be classified as support when participants are processing animate rather than inanimate targets, which also suggests that language might be partly contingent on the immediacy of the overlap between grammatical gender and biological sex. This finding argues against a singular, uniform effect of gender category on all its members. Finally, results were more likely to offer support when the gendered language has two gender categories rather than three; a finding that is inconsistent with a straightforward account of grammatical gender classification per se influencing the conceptualizations of objects. The review initially appeared to reveal one misconception concerning age; there was actually a higher rate of support from samples of children than adults. Upon closer inspection, this outcome was closely bound to the fact that children performed those tasks that most consistently produced positive results for relativity, namely voice choice and sex assignment. However, not all the predicted biases were supported. We found no evidence that support was more common when participants performed a task in a gendered-language context, which is what is predicted by thinking-for-speaking accounts. The only parameter that could not be meaningfully assessed at all concerned the salience of language in the task, an issue that we return to later in this Discussion.

What these parameter-related comparisons reveal is that much of the positive evidence for relativity comes from tasks and conditions that are particularly susceptible to alternative, strategy-based explanations. Overall, we therefore take the results of this review as imposing quite powerful constraints on the relativity hypothesis as seen through the lens of grammatical gender. Nevertheless, this conclusion itself comes with a caveat; we feel the review also points to a significant weakness in much of the relevant research’s ability to speak to the issue of relativity with clarity. This means that future research could either cement this rather negative conclusion, or overturn it through stronger designs that are less susceptible to confounds. As a result, we suggest that our review provides an interim rather than final pattern of results.

We focus our discussion on those areas that we feel need addressing in future work, and make suggestions as to some ways for experiments to deal with them. First, we describe why we feel much of the data speak only weakly to the question of relativity.

How do people solve the tasks?

Some have held that for relativity to be supported there should be a reasonable expectation that participants did not engage grammatical gender information strategically. This is most clearly an option in tasks in which judgments are about sex and are explicit, such as voice choice and sex assignment, but possibly also for similarity, association, and object–name memory associations. It is therefore important to distinguish between two means of arriving at decisions in many of the tasks in this review: one as a result of conceptual change, as usually hypothesized in relativity accounts, and another through metalinguistic knowledge. Metalinguistic knowledge refers to the influence of the knowledge of a formal property of language, such as grammatical gender, on judgments about objects. Although both processes are interesting, it is relativity that the research in this review was designed to investigate. The problem is that the voice choice and sex assignment tasks upon which the bulk of the support for relativity apparently rests cannot tell the two apart.

It might be argued that researchers can simply ask participants how they performed the task, and therefore be in a position to rule out a metalinguistic strategy as a result. Interestingly, the cases in which participants have been asked how they came to their decisions have thrown up mixed results (e.g., Almutrafi, 2015; Kurinski & Sera, 2011; Sato & Athanasopoulos, 2018). Particularly convincing evidence for a conscious metalinguistic strategy account of voice choice performance comes from a study in which 25 out of 30 participants later admitted to using grammatical gender to guide their responses (Almutrafi, 2015). However, although it has often been assumed to be so, there is also no a priori reason that the use of metalinguistic strategy need be a conscious process at all. Regardless of how it might occur, the use of metalinguistic knowledge undermines a reading of results as the outcome of conceptual change.

Some researchers have pointed out that participants rarely if ever respond in a manner that is 100% consistent with biological sex. This seems to argue against a metalinguistic strategy (conscious or otherwise). However, we cannot know whether participants had more than one strategy, some more universal (such as masculine for artifacts, feminine for natural kinds: Mullen, 1990), and some more idiosyncratic or personal (see for example participants’ justifications for their choices in Kurinski & Sera, 2011), with different strategies being brought to bear at different times throughout the task. The absence of a consistent, 100% effect of grammatical gender on task performance therefore does not preclude the possibility that metalinguistic strategies might account for some of the effects that were found.

Is there evidence to support one or the other account from the results of the review? Here again we run into the logical problem that we cannot tell processes apart. For example, the finding that more support came from speakers of two-gendered languages than from speakers of three-gendered languages might be seen as weakening metalinguistic strategy accounts, because there is no reason to believe Spanish speakers should prefer such strategies over German speakers, for example. However, the availability or attractiveness of metalinguistic strategies might, on the other hand, be enhanced where there is a neater one-to-one mapping of grammatical gender category and biological sex.

Another potential criticism of a metalinguistic rather than a relativity account is that in taking the former view, one subscribes to an intrinsically negative, prejudicial default that effectively renders relativity empirically impossible to support. Essentially, giving a metalinguistic alternative explanation equal weight might set the bar for actual conceptual change accounts impossibly high. However, the scope exists to tighten experimental design to guard against alternative, “killjoy” explanations; though this review suggests that, for the most part, when these measures are in place, the likelihood of finding a positive result declines.

In our view, the most convincing evidence of the potential for metalinguistic strategy accounts comes from the results of property judgment tasks. Since this task type does not incorporate a sex/gender prompt it keeps such information at arm’s length. The very low rate of only 3% support from this task type might therefore reflect the absence of such strategies in performance.

Overall, the voice choice and sex assignment paradigms are the most susceptible to alternative explanations, and further research using these tasks without serious modification is unlikely to reveal more about relativity. Since these tasks are the most common in the field, and similar objections can be raised against other task types such as similarity, object–name memory associations, and associations, the pool of information from which we can make the most meaningful inferences about the research question is likely to be small. It is for this reason that we feel the case for grammatical gender influencing concepts is currently difficult to weigh up and awaits future research (see also the Practical suggestions for future research section below).

Language on language or language on concepts?

The process of judging whether a task incorporates language is a difficult one. For example, in a voice choice or sex assignment task participants are sometimes only required to produce a single word: “male,” “female,” “boy,” and so forth. They might only need to circle a letter M or F on a sheet of paper. Does this constitute a language process that might inadvertently recruit grammatical gender itself? What of the role of the instructions of the task, which are linguistic, in formulating a linguistic process to arrive at a response? For almost all the tasks in the field, even language processing of the more conspicuous kind is unavoidable, such as when making judgments based on linguistic stimuli in the EAST, or thinking of adjectives to describe pictures in properties judgments. For some, linguistic relativity research using solely behavioral measures (response times and related patterns) is always susceptible to linguistic processes (e.g., Gleitman & Papafragou, 2013), and it is for this reason that some now advocate primarily neurophysiological approaches (Thierry, 2016).

The argument that language needs to be controlled in relativity research is usually attributed to the thinking-for-speaking argument (Slobin, 1996), which was originally based on the idea that languages require speakers to attend to certain aspects of a scene, such as temporal and spatial details, depending on what information their language required (see also Slobin, 2003). Slobin later also conceived of “thinking for comprehending,” in which the languages we speak also influences the way that we think about what we comprehend (Slobin, 2003). Such a view could mean that tasks that present participants with words will also be subject to the restrictions of thinking for speaking, as might tasks that use words about gender or sex in their instructions. This would likely encompass almost all the tasks in this review.

In the tasks in this review, participants did not need to produce the actual nouns for the target items themselves in their response. There are some data from the review that we can bring to bear on this question, though they are not conclusive. The thinking-for-speaking theory predicts that effects of language on thought might not extend to performance outside of the language in which such effects are sourced. Translated into grammatical gender research, effects should therefore be strongest when performing in a gendered than in a nongendered language context. This was not the case, though only by a very subtle margin (29% to 32%). However, given the fact that 83% of the samples performed in a gendered language context, it is also possible that further data from research employing an ungendered context might lead to a change in this outcome, either in favor of or against thinking for speaking.

It is difficult to know where to draw the line between high and low language content. We classified all but two articles (Phillips & Boroditsky, 2003; Sato & Athanasopoulos, 2018), which involved only 133 samples in all, as being high in language content. It could therefore be argued that almost the entirety of the research in the review could have been testing for an influence of language on language. We ourselves do not make this claim; this is almost certainly too strong a conclusion to draw, given the heterogeneity of task designs. In its broadest sense, the philosophical debate around the involvement of language in behavior is beyond the remit of this review. More practically, however, we believe it difficult to argue that the tasks described in this review vary enough in their language content to allow for meaningful comparisons along this dimension.

Practical suggestions for future research

We divide our suggestions for future research into two sections, in order of importance. First, we point out that the results of the review support the possibility that there might be a fundamental flaw in the use of grammatical gender as a tool to speak to the question of relativity at all. Second, we make the case that if grammatical gender can provide an insight, then future tasks would benefit from an overhaul in order to better control for alternative explanations.

Returning to the question of what is “gendered” about grammatical gender?

The results of the review make the case that effects of grammatical gender are for the most part predicted by parameter settings that would be consistent with a statistical association account, at least as well as by a conceptual-change view of relativity. This is because most settings that promote the association between biological sex and grammatical gender enhance the probability that effects will be found.

That relativity is scaffolded by associations between language and thought, rather than language as thought, is a view that is partially consistent with contemporary thought in the field, such as the label-feedback hypothesis (Lupyan, 2012) and its offshoot the structural feedback hypothesis (Sato & Athanasopoulos, 2018). These theories contend that labels or grammatical information hone attention to associated features, which in turn feed back down to lower-level processes in a feedback loop. These effects can be upregulated or downregulated by the salience of the relevant linguistic information in the task. The results of the review, as well as results from studies in which participants are briefly trained in invented or real languages and come to behave in line with those languages (e.g., Boroditsky, 2001; Casasanto, 2010; Phillips & Boroditsky, 2003), suggest that this is indeed the case. However, where a statistical association account departs from these accounts is that the latter accounts allow for some degree of change at the conceptual level, but a statistical association account does not. A statistical association account would predict that if an Italian speaker is processing the target “bed” in the context of gender/sex, for example, then the concept of masculinity might receive activation by an association rather than by any lasting conceptual rub-off. This would be similar, for example, to the statistical association between the concepts of sunshine and ice cream; we would be unlikely to conceive of sunshine as being similar to ice cream. Put simply, a statistical association account need not require conceptual change, especially long-lived change, to occur at all, and would therefore be incompatible with the spirit of relativity in any of its theoretical incarnations.

As we stated in our introduction, this review is not in a position to make such a distinction between relativity and its alternatives, in part because more data are required, but also because our review did not find enough unambiguous support for relativity to discriminate between the possibilities. For example, it is difficult to assess positive results from voice choice and sex assignment in the light of the hierarchical taxonomy of relativity accounts by Wolff and Holmes (2011), which ranges from the strongest form of relativity (“thought is language”) through to subtler effects, such as “language as spotlight,” when a nonrelativistic account is at least as likely an explanation for the results from such tasks. Instead, our review is in a position to weigh up the size of the problem in relating much grammatical gender research to relativity at all, in any of its forms. Of the five principal parameter settings that a statistical association account would predict, one (language content) was impossible to draw meaningful inferences from; three (target type, number of gender categories, and salience of sex/gender) resulted in higher overall rates of support, and only one (gendered language context) was equivocal. It is perhaps important to note that the latter parameter did not run powerfully in the opposite direction to what a statistical association account would predict; there was only a – 3% support rate difference, far smaller than the next smallest difference of + 23%, which was instead in favor of the account. The sixth parameter—age—was also equivocal, but was less important for the account in any case. We therefore interpret these results as framing and underlining the case for a statistical association account, by which arbitrary labels for grammatical classes interact with arbitrary assignments of nouns to those classes, under conditions that facilitate their association. This is not to imply that we prefer such an account, or to rule out the possibility that multiple factors might have simultaneous and additive effects. It does imply, however, that grammatical gender is presently a foggier lens through which to inspect the case for relativity than the domains of categorical perception, space or time, to name a few.

As a first step, it would be useful to establish whether “masculine” and “feminine” are psychologically privileged “attractor” concepts that impose their status on other objects in their grammatical class, or whether these metalinguistic labels are themselves arbitrary. This is important because it would help researchers to understand whether the idea of a relationship between grammatical “gender” and biological sex has any psychological reality. If it does, then it becomes more likely that the members of a class are in some sense imbued with this conceptual relationship, and the case for conceptual-change accounts would be enhanced.

There is already a study that suggests a method by which to test this. In one experiment, not included in this review because it involved invented languages, native English speakers were taught “Gumbuzi,” an artificial language with two artificial grammatical “gender” groups, labeled “soupative” and “oosative” (Phillips & Boroditsky, 2003). Participants were taught ten items in each group, six of which were inanimate objects, and the remaining four were humans who were either all female or all male. After learning which items were assigned to which category, participants rated the similarity of human–object pairs both within and across the two groups. The results showed that pairs from the same group were rated as being more similar than items from different groups, leading the authors to conclude that there can be a causative (i.e., learned) relationship between grammatical gender and people’s conceptualizations of objects.

Since 40% of all the items in a group were humans of the same biological sex, the groups were strongly biased toward sex/gender, regardless of the labels “soupative” and “oosative.” It is also likely that the participants were aware of such things as grammatical gender categories through formal second language instruction in schools, knowledge of which they could have applied to the task. Additionally, the labels “oosative” and “soupative” fail to actually describe any of the items within each group; real grammatical gender categories are at least partially correlated with the biological sex of its members. Nevertheless, this study provides a template for a future study that might teach one group of participants that groups are called masculine and feminine, and another group that the groups are named after another and equally represented natural kind. If the arbitrary labeling of the classifiers themselves drives performance, then the results of similarity ratings should pattern in line with other labels at least as much as they would with masculine and feminine. This would suggest that any influence of grammatical gender is a human-made one that is independent of linguistic structure and lacking in psychological reality, undermining the notion of a conceptual relationship between grammatical gender and biological sex, and in turn favoring a statistical association relationship.

Dulling Occam’s razor

If grammatical gender is not merely a cultural label, then we follow Ramos and Roberson (2011) and others who suggest that studies be conducted that aim to restrict both gender and language to as oblique a role as possible. Property judgment tasks do the former very well, but language remains fundamental, and in any case the evidence from these tasks is to date overwhelmingly negative. Instead, the priming tasks by Sato and Athanasopoulos (2018) would seem strong candidates for future investigations. In their first experiment, French–English participants were found to be slower to indicate whether two objects were associated with a male or female face when the grammatical gender of those objects was incongruent with the biological sex of the person. In a second experiment, participants matched one of two trait words (e.g., “charming,” “realistic”) to a now genderless face after being primed with a pair of objects, such as a tie and a spade. The results again pointed to an influence of the grammatical gender of the objects in French–English bilinguals’ choices. These studies, while not eliminating biological sex and language altogether, keep some distance between these and participants’ actual responses, because associations come from the task-irrelevant grammatical genders of objects that participants were presented with earlier. Future work might find a way of making this distance greater still, and include direct statistical comparisons between speakers of a gendered language and speakers of a nongendered language in order to establish more clearly that any effects are attributable to grammatical gender specifically.Footnote 7

Limitations

This review represents, to the best of our knowledge, the first systematic attempt to assess the literature in a quantitative manner. The heterogeneity of methods and their uneven representation in the literature presented us with a difficult decision; to group together research of different types, or to provide a finer-grained picture. We took the view that it was better in a first review to provide a nuanced picture that takes into account differences between, for example, instructing participants to assign a voice to an object or a sex to an object. This does have the drawback of making it harder to make reliable inferences based on less well-used tasks, in particular object–name memory associations, association tasks, and similarity tasks. On the other hand, it makes it easier for later work to be incorporated into future reviews.

A more difficult and subjective issue concerns the interpretation of results classified as offering mixed support. The argument for the inclusion of this category is to our minds quite compelling. If we take, for example, the finding that training in Spanish improves the rate at which voice choices are consistent with grammatical gender, but nevertheless fails to raise this rate above chance (Kurinski & Sera, 2011), neither an entirely cautious approach (i.e., this result finds no support for relativity) nor an endorsement (voice choices are consistent with grammatical gender) naturally follows, and the need for a third category becomes clear. To present as objective a view as possible, we have for the most part focused on the rate of support in the first instance, the rate of no support second, and mixed support only where it is necessary, such as in conditions in which there are few data on either side of this middle category.

Conclusion

In conclusion, our review showed that support for an influence of grammatical gender on concepts is strongly task- and context-dependent. Support also comes for the most part from tasks that are susceptible to clear alternative explanations. Perhaps most importantly, it needs to be empirically established that grammatical gender itself is not a cultural label but a concept with psychological reality before any influence can be reasonably attributed to truly linguistic processes.