Introduction

Grammatical gender is an “inherent property of nouns which controls morphologically marked agreement relations between different syntactic elements” (Bußmann, 2002, p. 247, transl.). Thus, “at least one other part of speech (determiner, adjective, pronoun) carries corresponding morphological features” (Bußmann, 2002, p. 247, transl.). The existence of grammatical gender as well as its function and formal marking and the number of gender specifications vary between languages (e.g., Corbett, 2014).

German differentiates between three gender specifications—masculine, feminine, and neuter. In nominative singular, the definite determiners der, die, and das and the anaphoric personal pronouns er, sie, and es are associated with these gender specifications. Table 1 displays the inflection paradigm of the German definite determiners and anaphoric personal pronouns.

Table 1 Inflection paradigm of the German definite determiners and anaphoric personal pronouns

As can be deduced from Table 1, morphological markers within the inflection paradigm differ as a function of gender specification. This may apply to the nouns themselves as well as to dependent parts of speech.

Representation of Grammatical Gender in Psycholinguistic Models

According to psycholinguistic models like the discrete two step speech production model of Levelt and colleagues (e.g., Jescheniak & Levelt, 1994; Levelt, 1989, 1999, 2001), gender information in the mental lexicon is stored at a modality independent lemma level. There, each gender specification is represented by one central gender node, which is connected to all nouns of this gender specification (e.g., Jescheniak & Levelt, 1994; cf. Figure 1a). The number of gender nodes equals the number of gender types in a given language, that is, for German, three gender nodes are assumed.

Fig. 1
figure 1

Gender representation within the discrete two step speech production model a with categorial gender representation (e.g., Jescheniak & Levelt, 1994), b with decomposed gender representation (e.g., Penke et al., 2004; Opitz et al., 2013), and c with decomposed and underspecified gender representation (Opitz & Pechmann, 2016; model adapted from Jescheniak & Levelt, 1994, p. 826 and Opitz & Pechmann, 2016, p. 236)

As the classical discrete two step model implies that gender representations are equivalent to the three gender specifications, speed of gender access and processing should not differ for masculine, feminine, and neuter words, if other semantic, lexical, and morphosyntactic features are held comparable across these specifications.

There is, however, a number of studies suggesting that nouns within the mental lexicon are not connected to abstract gender nodes, but to feature nodes representing gender in a decomposed way as illustrated in Fig. 1b.

This assumption is based on theoretical linguistic frameworks like Distributed Morphology (e.g., Halle & Marantz, 1994) or Minimalist Morphology (e.g., Wunderlich, 1996) which propose that morphosyntactic specifications are composed of abstract binary features which have either a positive (marked) or a negative (unmarked) value. In such accounts, German gender specifications are supposed to be realised via the features masculine [+ / − m] and feminine [+ / − f]. Specifically, masculine and feminine gender specifications are marked in a complementary way, while neuter forms are unmarked, i.e., masculine = [+ m, − f], feminine = [− m, + f], and neuter = [− m, − f] (e.g., Bierwisch, 1967, p. 248). Other morphosyntactic categories, too, can be decomposed in this way. Thus, grammatical number can be specified as singular = [-pl] and plural = [+ pl] (e.g., Bierwisch, 1967, p. 248) and case as nominative = [− obj(ect), − obl(igatory)], accusative = [+ obj, − obl], dative = [+ obj, + obl], and genitive = [− obj, + obl] (e.g., Opitz et al., 2013, p. 235). 

According to morphological underspecification accounts, grammatical elements may not necessarily be fully specified for morphological properties, but may lack features, even though every incidence of this element in a syntactic context is specified for these features (Lehmann, n.d.). This might result in an underspecified feature representation within the mental lexicon as has been suggested by Opitz and Pechmann (2016; cf. Figure 1c).

Empirical Evidence for Decomposed Gender Representation in the Mental Lexicon

First evidence for decomposed gender representation in the mental lexicon was put forward by Clahsen et al. (2001) who conducted a visual lexical decision experiment involving inflected adjectives. They found that processing of adjectives with a very specific affix (− m with two positive features: [+ obj, + obl]) resulted in longer reaction times (RTs) compared to processing of adjectives with a less specific affix (− s with no positive feature). Furthermore, they conducted a cross-modal priming experiment with auditory primes and visual stimuli. There, priming effects were smaller for specific affixes whose morphosyntactic features were primed incompletely compared to less specific affixes whose morphosyntactic features were fully primed. However, as Opitz et al. (2013) emphasise, it cannot be ruled out that the results were confounded by phonological features. As they argue, e.g., the affix -e was fully primed by the affix − (e)s, but this was not the case in the opposite condition (stimulus: − (e)s, prime: − e).

Janssen and Penke (2002) analysed errors of person and number agreement between subjects and verbs in sentence completion and elicitation tasks with German agrammatic patients. They found, amongst others, that in most cases, marked features were replaced by unmarked features.

Penke et al. (2004) compared reading times for correct and incorrect sentences in a sentence-matching test. Sentences included prepositional adjective (Experiment I) and determiner (Experiment II) noun phrases with matching or non-matching inflectional markers. Typically, in such a test, incorrect sentences induce longer RTs than correct sentences. This grammaticality effect, however, emerged only in sentences where positive features of an inflected form were missing or negatively specified in the syntactic context. It did not emerge when negative or missing features of an inflected form co-occurred with positively specified features in the syntactic context. Penke and colleagues interpret these results as indication of a relevant distinction between the two principles of compatibility and specificity. That is, an underspecified morphosyntactic element is chosen for a given context when it is (a) compatible with this context and (b) the most specific of all elements fulfilling precondition (a) (e.g., Opitz et al., 2013). Violations of compatibility are supposed to be more serious than violations of specificity and, thus, may result in different processing costs. Furthermore, Penke and colleagues argue that positive features are part of the representation of morphologically complex words or affixes while negative features are applied on the basis of the paradigmatic context. Therefore, only positive features can disagree with the syntactic context and, thus, decelerate RTs. Results, however, are to be interpreted with caution since in some sentences the mismatch of the inflectional markers became apparent already with the combination of preposition and adjective while in others it turned up first with the combination of adjective and noun. This might explain some RT-differences irrespective of the occurrence of positively marked features in the context.

Opitz et al. (2013) asked German subjects to rate the grammaticality of visually presented prepositional accusative adjective-noun phrases. Each noun was combined with each adjective three times, containing a masculine, a feminine, or a neuter accusative marker, respectively. Overall, phrases with masculine nouns were more error-prone than phrases with feminine or neuter nouns. Additionally, event related potentials (ERPs) were analysed at the time of noun presentation. In all incorrect conditions, a P600 occurred, i.e., a positive deflection at 600–900 ms after presentation of the critical stimulus. The P600 is associated with syntactic processing difficulties or the need of a reanalysis (e.g., Frisch et al., 2002; Gouvea et al., 2009). Within the experiment, it indicated different processing of correct and incorrect phrases. Additionally, a left anterior negativity (LAN) was observed 300–550 ms after presentation of the noun. It is associated with the identification of a morphosyntactic error. In phrases with masculine and feminine nouns, there were no LAN-differences between the two incorrect conditions. Furthermore, in phrases with masculine nouns, there was no difference between incorrect and correct conditions. However, in phrases with neuter nouns, the LAN was larger in combinations with masculine adjectives—corresponding to a violation of compatibility—compared to feminine adjectives—corresponding to a violation of specificity. The authors interpret their results in terms of a feature-based account assuming maximal underspecification of features and a generally increased processing effort for masculine nouns (for a suggested explanation see the next paragraph).

The study of Opitz et al. (2013) was complemented by an experimental series of Opitz and Pechmann (2016). Experiment I was a replication of the experiment described above, but this time RTs served as dependent variable. In Experiment IIa, participants decided whether visually presented nouns were masculine or feminine. In Experiment IIb, participants decided whether visually presented nouns conformed to the gender specification of a given block of words. In Experiment III, participants decided whether a visually presented word was a noun or not; verbs and adjectives served as fillers. Overall, masculine nouns induced more errors and longer RTs compared to feminine nouns across all experiments, while neuter nouns were taking a middle position. According to Opitz and Pechmann (2016), these results reflect differential processing efforts for representatives of the different gender specifications resulting from a different number of connections to gender feature nodes. Processing effort would be directly related to the number of gender features to be activated and retrieved. Opitz and Pechmann state that their experimental results speak in favour of least specified feminine nouns and most specified masculine nouns. Thus, feminine nouns would be the default gender specification with no connection to any feature node; neuter nouns would be connected to one gender feature ([− f]), and masculine nouns to two features ([− f] and [+ m]; see Fig. 1c).

All in all, Opitz and Pechmann question not only a categorial gender representation in favour of feature-based representations but also the restriction of underspecified gender representations to the domain of inflectional markers. Instead, they suggest that underspecification “is more broadly used in the mental lexicon and extends to the feature specification of nouns” (Opitz & Pechmann, 2016, p. 235). While it is conceivable that gender representation in the mental lexicon is based on features, the claim that underspecification of gender extends to nouns can, in our view, be called into question for empirical as well as theoretical reasons:

First, Opitz and Pechmann’s (2016) assumption of underspecified nouns relies heavily on the observation of longer RTs for masculine compared to feminine nouns in a series of experiments. However, this observation may be an artefact of confounding variables of the specific stimuli chosen. Even though nouns were controlled for frequency and length and phrases for plausibility and familiarity, the stimulus words differed critically regarding several formal characteristics. Particularly, 33 of 60 feminine nouns had a schwa-ending strongly associated with feminine gender (e.g., Wegener, 1995). No such clear formal cues existed for the masculine or the neuter nouns. In addition, due to their productivity and frequency, feminine morphological gender indicators were potentially easier to recognise within the experiments than masculine and neuter morphological markers of the nouns. Furthermore, 51 of 60 feminine nouns, but only 29 of 60 masculine and 28 of 60 neuter nouns had their stress on the first syllable which is the prototypical stress pattern of German nouns and, therefore, might also be associated with shorter phonological processing times (e.g., Sulpizio et al., 2015). Taken together, confounding variables such as cue validity or word stress—or others—might account for the experimental results (i.e., the masculine disadvantage) regarded as critical by Opitz and Pechmann (2016) without drawing on underspecified feature-based gender representations of nouns in the mental lexicon.

Second, for theoretical reasons it seems as if the assumption of underspecification of gender features in both the context and the dependent words undermines the whole idea of decomposition and underspecification. Underspecification accommodates for the fact that the same morphological markers appear in different syntactic contexts. For example, the German determiner dem is only specified for [− f] and can, thus, be applied to masculine or neuter contexts which are thought to be characterised as [+ m, − f] and [− m, − f], respectively. With underspecification of both the dependent word and the noun, not only the dependent word but also the noun can be combined with any other dependent word whose specification does not contradict its own specification. In consequence, the mapping direction seems to become somewhat arbitrary. Thus, if a noun like Klimaneut “climate” was only specified as [− f], it should be possible to combine this noun not only with other neuter [− f] but also with masculine [+ m, − f] depending words neither of which disagrees with the [− f] specification of Klima. This is clearly not the case. For example, the incongruent combination dermasc Klimaneut would be identified as erroneous by a contextual specification including a fully specified noun (*dermasc [+ m, − f] Klimaneut [− m, − f]) but would be accepted under the assumption of underspecification of both determiner and noun (*dermasc [+ m, − f] Klimaneut [− f]).

Thus, empirical as well as theoretical considerations result in scepticism towards the assumption of underspecified feature-based gender representations of nouns. Therefore, this assumption will not be considered anymore hereinafter. Opitz and Pechmann’s (2016) assumption, that the total number of features is relevant for processing costs, however, seems plausible and is taken up in the hypotheses.

The Present Study

In summary, there are two fundamentally different types of approaches to grammatical gender representation in the mental lexicon. On the one hand, predominant psycholinguistic models postulate categorial gender nodes. On the other hand, recent experimental results suggest that feature decomposition as discussed in theoretical linguistics is also the basis of mental representations and processing of morphological properties, including grammatical gender. Within such decomposition approaches, full specification of features as well as different kinds of underspecification are being discussed. Also, different suggestions have been made regarding the kind of feature processing which leads to specific response patterns in RTs, ERs, or electrophysiological potentials (e.g., relevance of all features involved vs. relevance of positive features of the dependent word only vs. relevance of feature compatibility and specificity).

The present study aimed at testing hypotheses resulting from these different suggestions. To this end, two experiments on gender agreement in visually presented combinations of definite determiners and nouns (Experiment I) as well as anaphoric personal pronouns and nouns (Experiment II) were conducted with German speaking participants. Every noun was combined with each of the three possible definite determiners or personal anaphoric pronouns in nominative case. Error rates (ERs) and reaction times (RTs) were compared for the incorrect combinations of the nouns with their two incongruent determiners or pronouns (agreement violations). Comparisons of ERs and RTs in the (congruent) agreement conditions across the three gender specifications were not considered in detail. This was to accommodate for the fact that different nouns differ not only regarding lexical-semantic characteristics like abstractness, frequency, or length, but also regarding gender-specific factors like availability and reliability of semantic, morphological, or phonological gender indicators (e.g., Köpcke, 1982; see also the General Discussion). It seems virtually impossible to fully parallel nouns across gender specifications for all possible confounding variables.

Based on the explanatory approaches described above, the following hypotheses regarding the comparison of the incongruent conditions could be deduced:

Categorial Representation of Grammatical Gender (Classical Discrete Two-Step Model, e.g., Jescheniak & Levelt, 1994)

With the assumption of categorial gender representation, ERs and RTs should be similar across the two possible agreement violation conditions for each gender specification (cf. Table 2). However, differences could potentially arise from (a) differences of frequency of the three gender specifications or word forms of determiners / pronouns, (b) formal differences between the determiners / pronouns (e.g., different number of phonemes or graphemes, different degrees of phonemic / graphemic similarity), or (c) differential compatibility of formal or semantic gender markers of the noun and the incongruent determiner / pronoun. Thus, for example, 64% of the German one-syllable words are masculine, 22% neuter, and 14% feminine. In the light of this distribution, the presentation of a neuter one-syllable word like Haus “house” might result in a faster rejection of die / sie compared to der / er simply because die / sie is less frequently associated with a one-syllable noun than der / er. Furthermore, even in a nominative context, rejection of feminine nouns with das might be faster than rejection of feminine nouns with der as der can also appear in feminine genitive singular contexts, while das is not represented in any cell of the feminine paradigm (cf. Table 1). The personal pronouns es and er, however, are unique to masculine and neuter nominative singular contexts, respectively, and should therefore produce comparable RTs for combinations with feminine nouns, based on their validity (but ignoring their frequency of use).

Table 2 Examples for stimulus combinations (left panel) and predictions for comparison of the two agreement violation conditions per gender specification (right panel) based on the assumption of categorial gender nodes

Feature-Based Representations of Grammatical Gender

Regarding feature-based representations of grammatical gender, different explanation attempts have been made regarding (a) full specification or underspecification, and (b) relevant processing aspects like number of features involved or violation of compatibility vs. specificity. Even (c) the question may be raised of whether the morphosyntactic context is specified serially from left to right (i.e., in chronological order) or by the noun as the syntactic head of a combination. Combinations of these different dimensions result in a considerable number of different possible predictions regarding expected RT patterns in the violation conditions, which cannot be presented here in detail. Based on previous proposals, hypotheses can be derived as follows (cf. Table 3):

Table 3 Predictions for comparison of the two agreement violation conditions per gender specification based on the assumptions put forward by Penke et al. (2004), Opitz et al. (2013), and Opitz and Pechmann (2016)

Thus, Penke et al. (2004) postulate underspecified feature representations for the dependent words with masc [+ m], fem [+ f], and neut [], while nouns are fully specified regarding gender (masc [+ m, − f], fem [− m, + f], neut [− m, − f]). Grammaticality effects are only observed when a positive feature of the dependent word is missing or negatively specified in the context. In a task which involves the detection of morphosyntactic violations, thus, in combinations with masculine and feminine nouns, die / sie and der / er respectively should induce less errors and shorter RTs than das / es (e.g., die [+ f] Mantel [+ m, − f] < das [] Mantel [+ m, − f], der [+ m] Party [− m, + f] < das [] Party [− m, + f]). Similar RTs are to be expected for der / er and die / sie in combinations with neuter nouns (e.g.,der [+ m] Klima [− m, − f] = die [+ f] Klima [− m, − f]).

According to Opitz et al. (2013, p. 246 and 254), differences in processing costs result from the degree of feature agreement violation between partly underspecified dependent words with masc [+ m, − f], fem [], and neut [− f]) and fully specified nouns (masc [+ m, − f], fem [− m, + f], neut [− m, − f]). Violations of compatibility (e.g., [+ m, − f] − [− m, + f] are easier to detect and, thus, result in less errors and shorter RTs than violations of specificity (e.g., [− f] − [+ m, − f]). Therefore, similar RTs are to be expected for die / sie as well as der / er and das / es in combinations with masculine and feminine nouns respectively (e.g., die [] Mantel [+ m, − f] = das [− f] Mantel [+ m, − f] → in both cases violation of specificity, der [+ m, − f] Party [− m, + f] = das [− f] Party [− m, + f] → in both cases violation of compatibility). In combinations with neuter nouns, der / er should induce less errors and shorter RTs than die / sie (e.g., der [+ m, − f] Klima [− m, − f] → violation of compatibility < die [] Klima [− m, − f] → violation of specificity).

Finally, Opitz and Pechmann (2016) argue that the absolute number of features involved in processing critically affects RTs. Furthermore, they postulate underspecified feature representations not only for the dependent word but also for the noun. While we do not agree with this latter assumption for the reasons discussed above, the other assumptions seem plausible. Thus, combinations of masculine nouns with die / sie would induce less errors and faster RTs than combinations of masculine nouns with das / es (e.g., die [] Mantel [+ m, − f] < das [− f] Mantel [+ m, − f]). In combinations with feminine as well as neuter nouns, der/er would result in more errors and longer RTs compared to das / es and die / sie, respectively (e.g., der [+ m, − f] Party [− m, + f] > das [− f] Party [− m, + f], der [+ m, − f] Klima [− m, − f] > die [] Klima [− m, − f]).

Yet, even if none of these accounts may be entirely correct, principally, differences in RTs in the violation conditions speak in favour of some kind of decomposition of gender representation in the mental lexicon.

Experiments

Experiment I: Gender Agreement Decision for Determiner Noun Phrases

In order to test the predictions presented above, in a first experiment, ERs and RTs were measured for decisions on gender agreement between definite determiners and nouns.

Materials and Procedure

One hundred and twenty morphologically simple German nouns served as stimuli for this experiment, 40 for each gender specification (masculine, feminine, neuter). Word length measured in number of syllables and graphemes as well as type frequency and lemma frequency according to dlexFootnote 1 were matched across gender specifications. Target nouns are listed in Appendix Table 11. As Experiment I was embedded in an experiment on compound processing, 60 compound nouns served as fillers. During the experiment, each noun appeared three times—once with each of the three definite determiners (e.g., das Kleiddress”, neut.-*die Kleid-*der Kleid)–in randomised order.

Stimuli were displayed visually in the centre of a computer screen using the DMDX software (http://www.u.arizona.edu/~kforster/dmdx/dmdx.htm; cf. Forster & Forster, 2003). First, a fixation cross appeared for 500 ms. It was followed by the determiner-noun phrase whose parts (determiner and noun) were presented simultaneously, horizontally aligned in left-to-right order (i.e. in the default order of German noun phrases). Participants were instructed to decide as fast and accurately as possible on gender agreement of determiner and noun by pressing the corresponding button (YES or NO). YES answers were assigned to the participant’s dominant hand. Stimulus presentation was terminated by the participant’s response or automatically after 3000 ms. Subsequently, a new trial started automatically.

The experimental testing was preceded by 18 practice trials in order to familiarise participants with the task. Afterwards, all 540 nominal phrases (target nouns and fillers) were presented, with pauses at an interval of 60 trials. Overall, the experiment took 30–40 min. Correctness of the answers and RTs were recorded with DMDX.

Participants

Thirty native speakers of German took part in the experiment. With one exception, all of them were students at the University of Erfurt. 23 of them were female, seven male. Mean age was 22.6 years (range: 18–41). None of the participants was diagnosed with dyslexia. Four of them were left-handed. Participants were paid for their participation.

Results

Incorrect responses, responses lasting longer than 3000 ms, and responses exceeding 2.5 standard deviations of a participant’s individual RT mean (calculated separately for YES and NO answers) were counted as errors. Across the 360 experimental target stimuli, no participant exceeded an ER of 11% (mean ER: 6.2%, range: 3.3–10.6%).

Error Rates

Results of the ER analyses are summarised in Fig. 2. Most relevant with respect to the different hypotheses are comparisons between the two agreement violation conditions per gender specification.

Fig. 2
figure 2

Mean error rates (%) in Experiment I (determiner noun agreement decision)

The ERs in the agreement violation conditions were compared with paired t-tests across participants and items.

Agreement violation conditions with masculine nouns Comparison of ERs in masculine nouns revealed a significant difference with more errors on combinations with neuter das (meanmasc_das = 11.2%, SD = 14.2) compared to combinations with feminine die (meanmasc_die = 5.2%, SD = 8.9; t1(29) = 4.075, p < 0.001; t2(38) = 3.077, p = 0.004).

Agreement violation conditions with feminine nouns Comparison of ERs in feminine nouns revealed no significant difference between combinations with neuter das (meanem_das = 3.2%, SD = 8.1) compared to combinations with masculine der (meanfem_der = 3.7%, SD = 6.8; t1(29) = − 0.361, p = 0.721; t2(39) = − 0.606, p = 0.548).

Agreement violation conditions with neuter nouns Comparison of ERs in neuter nouns revealed a significant difference with more errors on combinations with masculine der (meanneut_der = 9.4%, SD = 10.7) compared to combinations with feminine die (meanneut_die = 3.6%, SD = 6.7; t1(29) = 5.723, p < 0.001; t2(39) = 4.394, p < 0.001).

Reaction Times

Results of the RT analyses are summarised in Fig. 3. Responses classified as errors (see above) were excluded from these analyses.

Fig. 3
figure 3

Mean reaction times (ms) in Experiment I (determiner noun congruence decision)

The RTs in the agreement violation conditions were compared with paired t-tests across participants and items.

Agreement violation conditions for masculine nouns Comparison of RTs in masculine nouns revealed a significant difference with longer RTs for combinations with neuter das (meanmask_das = 1,105.3 ms, SD = 113.9) compared to combinations with feminine die (meanmask_die = 1,063.9 ms, SD = 97.8; t1(29) = 3.581, p = 0.001; t2(38) = 3.208, p = 0.003).

Agreement violation conditions for feminine nouns Comparison of RTs in feminine nouns revealed a significant difference with longer RTs for combinations with neuter das (meanfem_das = 1,080.8 ms, SD = 96.5) compared to combinations with masculine der (meanfem_der = 1,059.8 ms, SD = 100.6; t1(29) = 2.195, p = 0.036; t2(39) = 2.406, p = 0.021).

Agreement violation conditions for neuter nouns Comparison of RTs in neuter nouns revealed no significant differences between combinations with masculine der (meanneut_der = 1,096.7 ms, SD = 118.5) compared to combinations with feminine die (meanneut_die = 1,084.4 ms, SD = 125.7; t1(29) = 0.938, p = 0.356; t2(39) = 1.379, p = 0.176).

Interim Discussion

In Experiment I, participants had to decide on gender agreement in visually presented determiner noun phrases. ERs and RTs were compared for the two agreement violation conditions of each target gender specification. It turned out that incongruent combinations with das resulted in higher ERs and RTs compared to incongruent combinations with die (and der). For neuter target nouns, incongruent combinations with der caused higher ERs but no RT differences compared to incongruent combinations with die. Different difficulties of detecting violations with one versus the other wrong determiner are unpredicted by categorial accounts, thus their explanation would need additional assumptions (e.g., based on frequency of use). While different difficulties of detecting violations with different determiners are, in principle, predicted by feature based accounts, the specific patterns observed did not fully agree with any of the predictions deduced from the previous studies.

However, it cannot be excluded that RTs in the present experiment were influenced by inherent characteristics of the determiners. For example, der [deːɐ̯] and das [das] consist of three phonemes each while die [diː] consists of only two. Furthermore, frequencies of the definite determiners differ. According to dlex (www.dlexdb.de), type frequency of das (absolute number of occurrences: 677,120) is lower than type frequency of der (3,026,098) and die (2,510,938).

For this reason, a second experiment was conducted, using the personal anaphoric pronouns er [eːɐ̯], sie [ziː], and es [ɛs] which all consist of two phonemes and are more similar regarding their type frequencies according to dlex (er: 604,723, sie: 607,179, es: 548,281).

Experiment II: Gender Agreement Decision for Anaphoric Personal Pronouns and Nouns

Experiment II was a conceptual replication of Experiment I, differing in a) the function words used (personal anaphoric pronouns instead of definite determiners), b) the kind of stimulus presentation, and c) the participants.

Materials and Procedure

The procedure was analogous to Experiment I. This time, however, the anaphoric personal pronouns ermasc, siefem, and esneut were used as function words instead of definite determiners. Furthermore, presentation was not simultaneously in a left-to-right order. Instead, the pronoun was presented in the centre of the screen. After 600 ms, the noun appeared just below the pronoun. This was to accommodate for the fact that pronouns and nouns do not form immediate constituents of a single phrase in natural speech. Rather, they are connected by a paradigmatic relation. Presenting the pronoun first aimed at building up the expectation of a particular gender specification that could then be compared to the gender specification of the noun presented afterwards (instead of the other way round). Again, the participants had to decide on the agreement of gender specifications of pronoun and noun within a given pair.

As no compound filler stimuli had to be inserted in this experiment, the number of experimental stimuli could be increased with no extra effort for the participants. Thus, 180 morphologically simple nouns were used, 60 for each gender specification. One hundred and eighteen of them were taken from Experiment I (cf. Appendix Table 12 for a list of the stimuli). Each noun was presented with each of the three pronouns, thus calling for one YES answer (agreement) and two NO answers (agreement violations) per noun.

Testing was preceded by six practice trials in order to familiarise the participants with the task. The experiment consisted of 540 trials with pauses at an interval of 60 trials. Altogether, the experiment lasted approximately 30 min. RTs and correctness of the answers were recorded with DMDX.

Participants

Thirty-seven participants took part in Experiment II. None of them had taken part in Experiment I. Data of six participants had to be excluded from analysis due to bilingualism (n = 1), a technical error during data registration (n = 1), and ERs exceeding 10% (n = 4).Footnote 2 Of the remaining 31 participants, 27 were female and four male. Their mean age was 22.6 years (range: 19–30 years). All participants included were native speakers of German. Two were left-handed. None of them was diagnosed with dyslexia. They were paid for their participation.

Results

Error Rates

Results of the ER analyses are summarised in Fig. 4. Agreement violation conditions are most relevant with respect to the different hypotheses.

Fig. 4
figure 4

Mean error rates (%) in Experiment II (pronoun noun agreement decision)

The ERs in the agreement violation conditions were compared with paired t-tests across participants and items.

Agreement violation conditions for masculine nouns Comparison of ERs in masculine nouns revealed a significant difference with more errors on combinations with neuter es (meanmasc_es = 7.7%, SD = 9.1) compared to combinations with feminine sie (meanmasc_sie = 2.8%, SD = 4.4; t1(30) = 6.830, p < 0.001; t2(59) = 4.648, p < 0.001).

Agreement violation conditions for feminine nouns Comparison of ERs in feminine nouns revealed a significant difference with more errors on combinations with neuter es (meanfem_es = 5.3%, SD = 6.3) compared to combinations with masculine er (meanfem_er = 3.2%, SD = 3.73; t1(30) = 3.449, p = 0.002; t2(59) = 3.162, p = 0.002).

Agreement violation conditions for neuter nouns Comparison of ERs in neuter nouns revealed a significant difference with more errors on combinations with masculine er (meanneut_er = 10.8%, SD = 13.0) compared to combinations with feminine sie (meanneut_sie = 5.4%, SD = 5.2; t1(30) = 4.511, p < 0.001; t2(59) = 3.647, p = 0.001).

Reaction Times

Results of the RT analyses are summarised in Fig. 5. Erroneous responses were excluded from these analyses. Again, agreement violation conditions are most relevant with respect to the different hypotheses.

Fig. 5
figure 5

Mean reaction times (ms) in Experiment II (pronoun—noun agreement decision)

The RTs in the agreement violation conditions were compared with paired t-tests across participants and items.

Agreement violation conditions for masculine nouns Comparison of RTs in masculine nouns revealed a significant difference with longer RTs for combinations with neuter es (meanmasc_es = 1,211.4 ms, SD = 108.5) compared to combinations with feminine sie (meanmasc_sie = 1,128.5 ms, SD = 94.2; t1(30) = 6.534, p < 0.001; t2(59) = 8.043, p < 0.001).

Agreement violation conditions for feminine nouns Comparison of RTs in feminine nouns revealed a significant difference with longer RTs for combinations with neuter es (meanfem_es = 1,170.9 ms, SD = 79.9) compared to combinations with masculine er (meanfem_er = 1,132.2 ms, SD = 95.8; t1(30) = 3.570, p = 0.001; t2(59) = 3.509, p = 0.001).

Agreement violation conditions for neuter nouns Comparison of RTs in neuter nouns revealed a significant difference with longer RTs for combinations with masculine er (MWneut_er = 1.209,2 ms, SD = 118.4) compared to combinations with feminine sie (meanneut_sie = 1,169.8 ms, SD = 99.4; t1(30) = 2.682, p = 0.012; t2(59) = 3.674, p = 0.001).

Interim Discussion

On the lines of Experiment I, in Experiment II errors and RTs were compared for gender agreement decisions on visually presented combinations of anaphoric personal pronouns and nouns. With only slight deviations, results of Experiment II equalled those of Experiment I and, again, differed from results of former studies. This was despite the fact that the anaphoric personal pronouns are more similar regarding phoneme number and frequency than the definite determiners.

Still, differences a) between individual processing times for the different determiners and pronouns and b) in acceptability of incorrect combinations of determiners or pronouns and nouns as described above might account for the diverging results. For this reason, regression analyses were conducted on the RTs of Experiments I and II including data from two control experiments.

Regression Analyses on Results of Experiments I and II Including Data from Two Control Experiments

Before conducting regression analyses, two control experiments were run.

Control Experiment I: Lexical Decision for Definite Determiners and Personal Pronouns

As explicated above, determiners and pronouns differ regarding lexical features like grapheme and phoneme number, graphemic and phonological similarity, word frequency, frequency of appearance within inflectional paradigms, number of associated nouns and many others. Some of the relevant factors may even be still unknown. As it is impossible to control for all these factors, a lexical decision experiment was run in order to collect data on the processing of the determiners and pronouns used in Experiments I and II and, thus, obtain a measure of ‘processing costs’, reflecting the combined effect of relevant variables.

Materials

The definite determiners der, die, and das as well as the anaphorical pronouns er, sie, and es served as target stimuli in this control experiment. Nine more German function words served as fillers and 15 non-lexical two- or three-grapheme combinations served as nonwords (stimuli are listed in Appendix Table 13).

Procedure

Stimuli were displayed visually with the DMDX software in the centre of a computer screen. First, a fixation cross appeared for 600 ms. It was followed by the target item. Participants were instructed to decide as fast and accurately as possible about the lexical status of the target item by pressing the corresponding button (YES: word or NO: nonword). YES answers were assigned to the participant’s dominant hand. Every item was presented ten times, order of presentation being randomised. The testing was preceded by eight practice trials. Afterwards, all 300 experimental trials were run, with pauses at intervals of 100 trials. Overall, the control experiment took about ten minutes. Correctness of the answers and RTs were recorded with the DMDX software.

Participants

Thirty-five subjects participated in this control experiment. All of them had also participated in Experiment II. Thirty participants were female and five male. The mean age was 22.5 years (range: 19–30 years). All participants were native speakers of German. None of them was diagnosed with dyslexia. Two were left-handed. Participants were paid for their participation.

Results

Incorrect responses, responses lasting longer than 3,000 ms, and responses exceeding 2.5 standard deviations of a participant’s individual mean (calculated separately for YES and NO answers) were counted as errors. Across the 300 experimental stimuli, no participant exceeded an ER of 7% (mean ER: 3.8%, range: 2.0–7.0%). Descriptive statistical data for target determiners and pronouns are summarised in Table 4. Lexical decision times gathered in Control Experiment I were later used for the correction of RTs in regression analyses on the results of the main experiments.

Table 4 Mean lexical decision times (ms) for target definite determiners and anaphoric personal pronouns in Control Experiment I

Control Experiment II: Semantic and Formal Ratings of Noun Gender

In German, nouns differ regarding their semantic and formal characteristics which influence the probability of a particular gender assignment. For example, Mann „man“ is masculine due to biological sex of the referent. 90.4% of the nouns ending with schwa are feminine (Wegener, 1995, p. 76). All nouns with the suffix –chen are neuter. In some cases, there is a discrepancy of semantic and formal characteristics of a word (e.g., das Mädchenneutthe girl”—the formal cue determines gender irrespective of the referent’s biological sex). As semantic and formal characteristics might result in different perceived degrees of gender (dis)agreement between a given determiner or pronoun and a given noun, a control experiment was run collecting data to explore the presence and metalinguistic awareness of such characteristics.

Materials

The same nouns were used as in Experiment II (thus also including all 118 stimuli used in Experiment I).

Procedure

Written target words were presented in randomised order one below the other in a column of a table. On the right, there were three empty columns. They were labelled as masculine–feminine–neuter, and coloured light blue, red, and green, respectively. Participants were instructed to rate on a scale of 1–10 as to how “masculine”, “feminine”, and “neuter” they perceived each noun. There was a semantic condition, in which participants were asked to concentrate on the meaning of the words, and there was a formal condition, in which they were asked to concentrate on the words’ form or sound when making their decision. Each noun was to be provided with three values (one for each gender specification). Participants were instructed that the individual values for each gender specification were independent from each other, i.e., they did not have to sum up to ten.

Semantic and formal ratings had to be carried out by the same participants but separately from each other. Half of the participants did the semantic ratings first, the other half started with the formal ratings. Word order was different in both conditions. Each rating run was preceded by six practice items. Overall, the investigation took 30–60 min.

Participants

Thirty-four participants took part in Control Experiment II. All of them had also participated in Experiment II. One subject had to be excluded from analysis due to incomplete completion of the form. Of the remaining 33 participants, 29 were female and four male. Their mean age was 22.3 years (range: 19–30 years).

Results

Altogether, two answers in the formal condition were missing. Thus, 17,820 semantic and 17,818 formal values were collected. Descriptive statistical data for the ratings of formal and semantic gender features of masculine, feminine, and neuter nouns are summarised in Table 5.

Table 5 Mean values of semantic and formal gender ratings of masculine, feminine, and neuter nouns (scale: 0–10) in Control Experiment II

Based on the mean values of all participants, a semantic and a formal quotient were calculated for each noun, respectively, dividing the mean gender value corresponding to the correct gender specification of a given noun by the sum of the two gender values not corresponding to its gender specification. Thus, for example, the semantic values for Hibiskusmaschibiscus” were masculine = 2.00, feminine = 4.76, neuter = 6.12. This resulted in a semantic quotient for Hibiskus of 2.00: (4.76 + 6.12) = 0.18.

Semantic and formal quotients are supposed to indicate how much semantic and formal cues to gender are coded in a word, potentially resulting in an easier or more difficult decision on gender agreement. As can be seen in Table 5, the (correct) target gender usually yielded the highest ratings, both semantically and formally, but did not approach ceiling. In semantic ratings, feminine nouns had the highest feminine rating of the three noun types, but their own highest rating was neuter. In masculine nouns, masculine and neuter semantic ratings had similar values. This might be due to the fact that many words have no clear association with natural sex but are just ‘neuter’. Apart from that, alternative (incorrect) genders yielded lower ratings still substantially above bottom. Semantic and formal quotients were later included as predictor variables in regression analyses on the results of the main experiments.

Regression Analyses

Taking into consideration the results of the two control experiments, linear mixed effects regression analyses were conducted on RTs of correct trials in Experiments I and II (including the congruent as well as the incongruent determiner-noun phrases and pronoun-noun phrases) using the lme4 package (Bates et al., 2015) in R (R Core Team, 2018). Lexical decision times from Control Experiment I were subtracted from RTs in Experiments I and II in order to accommodate for the fact that determiners and pronouns differ regarding a set of factors which influence processing but cannot be fully controlled for. Corrected RTs, thus, are thought to reflect the time needed for processing of gender specification (and other higher-level representations) devoid of word processing costs. They served as dependent variable.

Most relevant potential predictor variables were Gender type of the noun (masculine / feminine / neuter) and Determiner (der / die / das) or Pronoun (er / sie / es), respectively. Other potential predictor variables included as fixed factors comprised Word length (number of graphemes), Word frequency (log10 lemma frequency according to dlex), Semantic quotient and Formal quotient (both yielded from Control Experiment II), Repetition (first, second, or third presentation of a given noun within the experiment), Position (consecutive number of a given item within the experiment), and Interaction of Frequency and Repetition. Participant was included as random factor.

Overall Analysis on Determiner-Noun Phrases in Experiment I

Results of the overall analysis on determiner-noun phrases (Experiment I) are summarised in Table 6. All predictor variables except for Semantic quotient influenced RTs. Specifically, RTs were faster for more frequent and shorter nouns and nouns with higher formal quotients. They decreased with increasing number of repetitions of a given noun and later positions within the experiment. Moreover, the influence of word frequency decreased with increasing number of repetitions of a given noun within the experiment.

Table 6 Overall regression analysis of corrected RTs in Experiment I

Overall, feminine nouns were processed faster than masculine nouns, and masculine nouns faster than neuter nouns. Combinations with der were processed faster than combinations with die, and combinations with die faster than combinations with das.

Analysis of Agreement-Violation Conditions in Experiment I

Additionally, pairwise comparisons among levels of factors were conducted with the emmeans-function in R (Lenth, 2022; cf. Table 7). They yielded faster RTs for congruent compared to incongruent combinations for each gender. Crucially, in the agreement-violation conditions, combinations with die and der were processed faster than combinations with das with masculine and feminine nouns, respectively. There was no significant difference in the RTs for combinations of die and der with neuter nouns.

Table 7 Pairwise comparisons of corrected RTs in Experiment I
Overall Analysis on Pronoun-Noun Phrases in Experiment II

Results of the overall analysis on pronoun-noun phrases (Experiment II) are summarised in Table 8. All predictor variables significantly influenced RTs. Specifically, RTs were faster for more frequent and shorter nouns and nouns with higher formal and semantic quotients. They decreased with increasing number of repetitions of a given noun and later positions within the experiment. Moreover, the influence of word frequency decreased with increasing number of repetitions of a given noun within the experiment.

Table 8 Overall regression analysis of corrected RTs in Experiment II

Again, overall feminine nouns were processed faster than masculine nouns, and masculine nouns faster than neuter nouns. Combinations with er were processed faster than combinations with sie, and combinations with sie faster than combinations with es.

Analysis of Agreement-Violation Conditions in Experiment II

As in Experiment I, pairwise comparisons (cf. Table 9) yielded faster RTs for congruent compared to incongruent combinations for each gender. Crucially, in the incongruent conditions, combinations with sie and er were processed faster than combinations with es with masculine and feminine nouns, respectively. There was no significant difference in the RTs for combinations of sie and er with neuter nouns.

Table 9 Pairwise comparisons of corrected RTs in Experiment II
Summary of Results

Overall, as in the studies of Opitz and colleagues, combinations with feminine nouns produced lowest reaction times. However, the observation of increased processing effort for masculine nouns (cf. Opitz & Pechmann, 2016; Opitz et al., 2013) was not confirmed. Congruent conditions were processed faster than incongruent conditions in both experiments. Incongruent combinations with der and er were processed faster than incongruent combinations with die and sie; incongruent combinations with das and es resulted in longest RTs.

As explicated above, we are cautious in interpreting the comparison of phrases containing nouns of different gender types as it seems virtually impossible to perfectly parallel these different nouns. Furthermore, the comparison of congruent vs. incongruent trials might reflect different processes involved in one but not the other (e.g., some kind of memorized visual picture which is recognized in the correct but not in the incorrect condition, e.g., Deutsch & Bentin, 2001). Therefore, the discussion focusses on the comparisons of the agreement violation conditions within each gender type, thus avoiding (a) the comparison of nouns of different gender types, and (b) possible different processing strategies related to the processing of congruent vs. incongruent phrases.

Results of the pairwise comparisons of RTs in the agreement violation conditions and of the t-tests conducted before are summarised in Table 10.

Table 10 Summary of results of RTs in the agreement violation conditions in Experiments I and II

General Discussion

The present study aimed at testing hypotheses on the representation of grammatical gender within the mental lexicon. While prevailing psycholinguistic models of language processing assume categorial gender representation with one separate node for each gender specification (e.g., the traditional discrete two step model), more recently some authors have argued that mental representation and processing of grammatical gender parallels accounts of decomposition and underspecification put forward in theoretical linguistics. That is, gender specification in the mental lexicon may be based on feature representations instead of categorial gender nodes.

Against this background, two experiments were conducted with German speakers who had to decide on gender (dis-)agreement for visually presented combinations of I) definite determiners and nouns and II) anaphoric personal pronouns and nouns in an implicit nominative singular experimental setting. Each noun was combined with each determiner or pronoun, resulting in one agreement condition and two agreement violation conditions per noun.

Overall analyses showed fastest RTs for feminine nouns but no processing disadvantage for masculine nouns (or masculine determiners / pronouns) as predicted by Opitz et al. (2013) and Opitz and Pechmann (2016). Congruent trials were processed faster than incongruent trials. However, as nouns of the three gender types were not fully parallelised (and cannot be perfectly parallelised, after all) and correct vs. incorrect trials might evoke different cognitive processes, we focussed on the comparison of ERs and RTs in the agreement violation conditions separately per gender specification. Thus, the same nouns were compared within the same condition (gender disagreement) but with different incongruent determiners and pronouns, respectively. Overall, violations with neuter das / es yielded more processing effort than combinations with die / sie or der / er, while no general difference was found for combinations with der / er compared to die / sie.

Categorial Versus Feature-Based Mental Representation of Gender

Differences between the two agreement violation conditions per gender specification as found in the present experiments clearly pose a challenge to models assuming categorial gender representation because three (in German) equivalent gender nodes should result in similar RTs or ERs irrespective of the type of agreement violation. Two objections can be raised, however.

First, language specific frequency differences between the gender specifications or the corresponding determiners and pronouns might account for (part of) the results. In fact, according to different counts of type and token frequency (e.g., Baayen et al., 2003; Wegera, 1997; Hoberg, 1999), neuter words are less frequent than masculine and feminine words in German. In consequence, as explicated above, neuter das is less frequent than der and die. However, a purely frequency-based explanation of our results is contradicted by the fact that the neuter pronoun es is not less frequent than er and sie. Nonetheless, es produced longer RTs in agreement violation conditions compared to er and sie in Experiment II. It is, thus, argued that frequency differences alone do not account for the observed behavioural differences between gender specifications.

Second, nouns differ regarding semantic and formal characteristics, which may serve as cues to their gender and might, in consequence, make one of the competing wrong gender specifications more probable than the other. For example, for German one-syllable nouns, Köpcke (1982) has described 24 phonological regularities for gender specification. Eleven of these rules only exclude one gender specification, but do not allow differentiating between the other two. According to Köpcke’s regularities, mostly, no differentiation between masculine and neuter is possible. This may be associated with more formal similarity between masculine and neuter nouns as compared to feminine nouns. The present study is the first one to meet such objections by taking individual semantic and formal quotients into consideration which are supposed to capture noun-inherent semantic and formal characteristics associated with one or the other gender specification. Indeed, these quotients did significantly influence RTs, but still left additional RT differences between different gender violation conditions. Thus, the results speak in favour of representations of grammatical gender in the mental lexicon that are more complex than categorial gender representation.

Specific Accounts for Feature-Based Gender Representation and Processing

So far, three specific suggestions have been made regarding feature-based representation and processing of gender information.

According to Penke et al. (2004), dependent words are underspecified regarding gender with only positive features being part of the representation and neuter being the unmarked gender (masc [+ m], fem [+ f], neut []). Processing costs are supposed to result from the deviation of positive features of the dependent words that are missing in the context.

Opitz and colleagues (Opitz & Pechmann, 2016; Opitz et al., 2013) suggest a different kind of underspecification of the dependent words with feminine instead of neuter being the default gender (masc [+ m, − f], fem [], neut [− f]). While Opitz et al. (2013) interpret processing costs in terms of the differentiation of violation of compatibility vs. specificity, according to Opitz and Pechmann (2016) processing effort is directly related to the number of gender features involved.

These different suggestions lead to different predictions regarding error rates and reactions times in the experiments in this study (cf. Table 3). The results obtained contradict the predictions resulting from the accounts put forward by Opitz and colleagues. No evidence was found for the kind of underspecification they suggested. Instead, the results are consistent with the account of Penke et al. (2004).

On the basis of the present study, we consider the specific accounts of Opitz and colleagues as improbable. It does not follow, however, that the account of Penke and colleagues is the only possible explanation. As has been explicated in the introduction, their own study is not without limitations. Furthermore, feature-based processing comprises several variables like the type of underspecification, the specific kind of computation of processing costs, but also the question of context setting. A multitude of combinations of different specifications of these variables are possible, and the combination suggested by Penke et al. (2004) might only be one of those consistent with the results of our study. Furthermore, it has to be noted that our experimental setting represents a considerable simplification compared to natural language contexts, as only two-word phrases were presented, and a nominative singular context was implicitly induced. In natural language contexts, case and number add to the number and kind of features involved, and the linguistic contexts might be more or less explicit regarding their specification. Additionally, the situation is expected to be different in language systems other than German, which, for example, might contain only two or more than three gender types. So, the experimental paradigm used in the present study can only be a first step towards a comprehensive picture of gender representations in the mental lexicon.

Conclusions

Altogether, the results of the experiments presented here call into question the assumption of categorial gender representation in the mental lexicon. Instead, they support the notion of feature-based mental representation and gender agreement processing. Future experiments will have to further support the specific explanation account of Penke et al. (2004) or to bring up another specific account into discussion and broaden the experimental setting in order to include further aspects of natural language processing.