Why do He and She Disagree: The Role of Binary Morphological Features in Grammatical Gender Agreement in German

In many languages, grammatical gender is an inherent property of nouns and, as such, forms a basis for agreement relations between nouns and their dependent elements (e.g., adjectives, determiners). Mental gender representation is traditionally assumed to be categorial, with categorial gender nodes corresponding to the given gender specifications in a certain language (e.g., [masculine], [feminine], [neuter] in German). In alternative models, inspired by accounts put forward in theoretical linguistics, it has been argued that mental gender representations consist of sets of binary features which might be fully specified (e.g., masc [+ m, − f], fem [− m, + f], neut [− m, − f]) or underspecified (e.g., masc [+ m], fem [+ f], neut [] or masc [+ m, − f], fem [], neut [− f]). We have conducted two experiments to test these controversial accounts. Native speakers of German were asked to decide on the (un-)grammaticality of gender agreement of visually presented combinations of I) definite determiners and nouns, and II) anaphoric personal pronouns and nouns in an implicit nominative singular setting. Overall, agreement violations with neuter das / es increased processing costs compared to violations with die / sie or der / er for masculine or feminine target nouns, respectively. The observed pattern poses a challenge for models involving categorial gender representation. Rather, it is consistent with feature-based representations of grammatical gender in the mental lexicon.

According to morphological underspecification accounts, grammatical elements may not necessarily be fully specified for morphological properties, but may lack features, even though every incidence of this element in a syntactic context is specified for these features (Lehmann, n.d.). This might result in an underspecified feature representation within the mental lexicon as has been suggested by Opitz and Pechmann (2016;cf. Figure 1c).

Empirical Evidence for Decomposed Gender Representation in the Mental Lexicon
First evidence for decomposed gender representation in the mental lexicon was put forward by Clahsen et al. (2001) who conducted a visual lexical decision experiment involving inflected adjectives. They found that processing of adjectives with a very specific affix (− m with two positive features: [+ obj, + obl]) resulted in longer reaction times (RTs) compared to processing of adjectives with a less specific affix (− s with no positive feature). Furthermore, they conducted a cross-modal priming experiment with auditory primes and visual stimuli. There, priming effects were smaller for specific affixes whose morphosyntactic features were primed incompletely compared to less specific affixes whose morphosyntactic features were fully primed. However, as Opitz et al. (2013) emphasise, it cannot be ruled out that the results were confounded by phonological features. As they argue, e.g., the affix -e was fully primed by the affix − (e)s, but this was not the case in the opposite condition (stimulus: − (e)s, prime: − e). Janssen and Penke (2002) analysed errors of person and number agreement between subjects and verbs in sentence completion and elicitation tasks with German agrammatic patients. They found, amongst others, that in most cases, marked features were replaced by unmarked features. Fig. 1 Gender representation within the discrete two step speech production model a with categorial gender representation (e.g., Jescheniak & Levelt, 1994), b with decomposed gender representation (e.g., Penke et al., 2004;Opitz et al., 2013), and c with decomposed and underspecified gender representation (Opitz & Pechmann, 2016; model adapted from Jescheniak &Levelt, 1994, p. 826 andPechmann, 2016, p. 236) Penke et al. (2004) compared reading times for correct and incorrect sentences in a sentence-matching test. Sentences included prepositional adjective (Experiment I) and determiner (Experiment II) noun phrases with matching or non-matching inflectional markers. Typically, in such a test, incorrect sentences induce longer RTs than correct sentences. This grammaticality effect, however, emerged only in sentences where positive features of an inflected form were missing or negatively specified in the syntactic context. It did not emerge when negative or missing features of an inflected form co-occurred with positively specified features in the syntactic context. Penke and colleagues interpret these results as indication of a relevant distinction between the two principles of compatibility and specificity. That is, an underspecified morphosyntactic element is chosen for a given context when it is (a) compatible with this context and (b) the most specific of all elements fulfilling precondition (a) (e.g., Opitz et al., 2013). Violations of compatibility are supposed to be more serious than violations of specificity and, thus, may result in different processing costs. Furthermore, Penke and colleagues argue that positive features are part of the representation of morphologically complex words or affixes while negative features are applied on the basis of the paradigmatic context. Therefore, only positive features can disagree with the syntactic context and, thus, decelerate RTs. Results, however, are to be interpreted with caution since in some sentences the mismatch of the inflectional markers became apparent already with the combination of preposition and adjective while in others it turned up first with the combination of adjective and noun. This might explain some RT-differences irrespective of the occurrence of positively marked features in the context. Opitz et al. (2013) asked German subjects to rate the grammaticality of visually presented prepositional accusative adjective-noun phrases. Each noun was combined with each adjective three times, containing a masculine, a feminine, or a neuter accusative marker, respectively. Overall, phrases with masculine nouns were more error-prone than phrases with feminine or neuter nouns. Additionally, event related potentials (ERPs) were analysed at the time of noun presentation. In all incorrect conditions, a P600 occurred, i.e., a positive deflection at 600-900 ms after presentation of the critical stimulus. The P600 is associated with syntactic processing difficulties or the need of a reanalysis (e.g., Frisch et al., 2002;Gouvea et al., 2009). Within the experiment, it indicated different processing of correct and incorrect phrases. Additionally, a left anterior negativity (LAN) was observed 300-550 ms after presentation of the noun. It is associated with the identification of a morphosyntactic error. In phrases with masculine and feminine nouns, there were no LAN-differences between the two incorrect conditions. Furthermore, in phrases with masculine nouns, there was no difference between incorrect and correct conditions. However, in phrases with neuter nouns, the LAN was larger in combinations with masculine adjectives-corresponding to a violation of compatibility-compared to feminine adjectivescorresponding to a violation of specificity. The authors interpret their results in terms of a feature-based account assuming maximal underspecification of features and a generally increased processing effort for masculine nouns (for a suggested explanation see the next paragraph).
The study of Opitz et al. (2013) was complemented by an experimental series of Opitz and Pechmann (2016). Experiment I was a replication of the experiment described above, but this time RTs served as dependent variable. In Experiment IIa, participants decided whether visually presented nouns were masculine or feminine. In Experiment IIb, participants decided whether visually presented nouns conformed to the gender specification of a given block of words. In Experiment III, participants decided whether a visually presented word was a noun or not; verbs and adjectives served as fillers. Overall, masculine nouns induced more errors and longer RTs compared to feminine nouns across all experiments, 1 3 while neuter nouns were taking a middle position. According to Opitz and Pechmann (2016), these results reflect differential processing efforts for representatives of the different gender specifications resulting from a different number of connections to gender feature nodes. Processing effort would be directly related to the number of gender features to be activated and retrieved. Opitz and Pechmann state that their experimental results speak in favour of least specified feminine nouns and most specified masculine nouns. Thus, feminine nouns would be the default gender specification with no connection to any feature node; neuter nouns would be connected to one gender feature ([− f]), and masculine nouns to two features ([− f] and [+ m]; see Fig. 1c).
All in all, Opitz and Pechmann question not only a categorial gender representation in favour of feature-based representations but also the restriction of underspecified gender representations to the domain of inflectional markers. Instead, they suggest that underspecification "is more broadly used in the mental lexicon and extends to the feature specification of nouns" (Opitz & Pechmann, 2016, p. 235). While it is conceivable that gender representation in the mental lexicon is based on features, the claim that underspecification of gender extends to nouns can, in our view, be called into question for empirical as well as theoretical reasons: First, Opitz and Pechmann's (2016) assumption of underspecified nouns relies heavily on the observation of longer RTs for masculine compared to feminine nouns in a series of experiments. However, this observation may be an artefact of confounding variables of the specific stimuli chosen. Even though nouns were controlled for frequency and length and phrases for plausibility and familiarity, the stimulus words differed critically regarding several formal characteristics. Particularly, 33 of 60 feminine nouns had a schwa-ending strongly associated with feminine gender (e.g., Wegener, 1995). No such clear formal cues existed for the masculine or the neuter nouns. In addition, due to their productivity and frequency, feminine morphological gender indicators were potentially easier to recognise within the experiments than masculine and neuter morphological markers of the nouns. Furthermore, 51 of 60 feminine nouns, but only 29 of 60 masculine and 28 of 60 neuter nouns had their stress on the first syllable which is the prototypical stress pattern of German nouns and, therefore, might also be associated with shorter phonological processing times (e.g., Sulpizio et al., 2015). Taken together, confounding variables such as cue validity or word stress-or others-might account for the experimental results (i.e., the masculine disadvantage) regarded as critical by Opitz and Pechmann (2016) without drawing on underspecified feature-based gender representations of nouns in the mental lexicon.
Second, for theoretical reasons it seems as if the assumption of underspecification of gender features in both the context and the dependent words undermines the whole idea of decomposition and underspecification. Underspecification accommodates for the fact that the same morphological markers appear in different syntactic contexts. For example, the German determiner dem is only specified for [− f] and can, thus, be applied to masculine or neuter contexts which are thought to be characterised as [+ m, − f] and [− m, − f], respectively. With underspecification of both the dependent word and the noun, not only the dependent word but also the noun can be combined with any other dependent word whose specification does not contradict its own specification. In consequence, the mapping direction seems to become somewhat arbitrary. Thus, if a noun like Klima neut "climate" was only specified as [− f], it should be possible to combine this noun not only with other neuter [− f] but also with masculine [+ m, − f] depending words neither of which disagrees with the [− f] specification of Klima. This is clearly not the case. For example, the incongruent combination der masc Klima neut would be identified as erroneous by a contextual specification including a fully specified noun (*der masc [+ m, Thus, empirical as well as theoretical considerations result in scepticism towards the assumption of underspecified feature-based gender representations of nouns. Therefore, this assumption will not be considered anymore hereinafter. Opitz and Pechmann's (2016) assumption, that the total number of features is relevant for processing costs, however, seems plausible and is taken up in the hypotheses.

The Present Study
In summary, there are two fundamentally different types of approaches to grammatical gender representation in the mental lexicon. On the one hand, predominant psycholinguistic models postulate categorial gender nodes. On the other hand, recent experimental results suggest that feature decomposition as discussed in theoretical linguistics is also the basis of mental representations and processing of morphological properties, including grammatical gender. Within such decomposition approaches, full specification of features as well as different kinds of underspecification are being discussed. Also, different suggestions have been made regarding the kind of feature processing which leads to specific response patterns in RTs, ERs, or electrophysiological potentials (e.g., relevance of all features involved vs. relevance of positive features of the dependent word only vs. relevance of feature compatibility and specificity).
The present study aimed at testing hypotheses resulting from these different suggestions. To this end, two experiments on gender agreement in visually presented combinations of definite determiners and nouns (Experiment I) as well as anaphoric personal pronouns and nouns (Experiment II) were conducted with German speaking participants. Every noun was combined with each of the three possible definite determiners or personal anaphoric pronouns in nominative case. Error rates (ERs) and reaction times (RTs) were compared for the incorrect combinations of the nouns with their two incongruent determiners or pronouns (agreement violations). Comparisons of ERs and RTs in the (congruent) agreement conditions across the three gender specifications were not considered in detail. This was to accommodate for the fact that different nouns differ not only regarding lexicalsemantic characteristics like abstractness, frequency, or length, but also regarding genderspecific factors like availability and reliability of semantic, morphological, or phonological gender indicators (e.g., Köpcke, 1982; see also the General Discussion). It seems virtually impossible to fully parallel nouns across gender specifications for all possible confounding variables.
Based on the explanatory approaches described above, the following hypotheses regarding the comparison of the incongruent conditions could be deduced: Categorial Representation of Grammatical Gender (Classical Discrete Two-Step Model, e.g., Jescheniak & Levelt, 1994) With the assumption of categorial gender representation, ERs and RTs should be similar across the two possible agreement violation conditions for each gender specification (cf. Table 2). However, differences could potentially arise from (a) differences of frequency of the three gender specifications or word forms of determiners / pronouns, (b) formal differences between the determiners / pronouns (e.g., different number of phonemes or graphemes, different degrees of phonemic / graphemic similarity), or (c) differential compatibility of formal or semantic gender markers of the noun and the incongruent determiner / pronoun. Thus, for example, 64% of the German one-syllable words are masculine, 22% neuter, and 14% feminine. In the light of this distribution, the presentation of a neuter one-syllable word like Haus "house" might result in a faster rejection of die / sie compared to der / er simply because die / sie is less frequently associated with a one-syllable noun than der / er. Furthermore, even in a nominative context, rejection of feminine nouns with das might be faster than rejection of feminine nouns with der as der can also appear in feminine genitive singular contexts, while das is not represented in any cell of the feminine paradigm (cf. Table 1). The personal pronouns es and er, however, are unique to masculine and neuter nominative singular contexts, respectively, and should therefore produce comparable RTs for combinations with feminine nouns, based on their validity (but ignoring their frequency of use).

Feature-Based Representations of Grammatical Gender
Regarding feature-based representations of grammatical gender, different explanation attempts have been made regarding (a) full specification or underspecification, and (b) relevant processing aspects like number of features involved or violation of compatibility vs. specificity. Even (c) the question may be raised of whether the morphosyntactic context is specified serially from left to right (i.e., in chronological order) or by the noun as the syntactic head of a combination. Combinations of these different dimensions result in a considerable number of different possible predictions regarding expected RT patterns in the violation conditions, which cannot be presented here in detail. Based on previous proposals, hypotheses can be derived as follows (cf. . Grammaticality effects are only observed when a positive feature of the dependent word is missing or negatively specified in the context. In a task which involves the detection of morphosyntactic violations, thus, in combinations with masculine and feminine nouns, die / sie and der / er respectively should induce less errors and shorter RTs than das / es (e.g., die Similar RTs are to be expected for der / er and die / sie in combinations with neuter nouns (e.g.,der [ According to Opitz et al. (2013, p. 246 and 254), differences in processing costs result from the degree of feature agreement violation between partly underspecified Masculine der / er Mantel,"coat" *die / *sie Mantel *das / *es Mantel die = das / sie = es Feminine *der / *er Party,"party " die / sie Party *das / *es Party der = das / er = es Neuter *der / *er Klima,"climate " *die / *sie Klima das / es Klima der = die / er = sie Table 3 Predictions for comparison of the two agreement violation conditions per gender specification based on the assumptions put forward by Penke et al. (2004) Finally, Opitz and Pechmann (2016) argue that the absolute number of features involved in processing critically affects RTs. Furthermore, they postulate underspecified feature representations not only for the dependent word but also for the noun. While we do not agree with this latter assumption for the reasons discussed above, the other assumptions seem plausible. Thus, combinations of masculine nouns with die / sie would induce less errors and faster RTs than combinations of masculine nouns with das / es (e.g., In combinations with feminine as well as neuter nouns, der/er would result in more errors and longer RTs compared to das / es and die / sie, Yet, even if none of these accounts may be entirely correct, principally, differences in RTs in the violation conditions speak in favour of some kind of decomposition of gender representation in the mental lexicon.

Experiment I: Gender Agreement Decision for Determiner Noun Phrases
In order to test the predictions presented above, in a first experiment, ERs and RTs were measured for decisions on gender agreement between definite determiners and nouns.

Materials and Procedure
One hundred and twenty morphologically simple German nouns served as stimuli for this experiment, 40 for each gender specification (masculine, feminine, neuter). Word length measured in number of syllables and graphemes as well as type frequency and lemma frequency according to dlex 1 were matched across gender specifications. Target nouns are listed in Appendix Table 11. As Experiment I was embedded in an experiment on compound processing, 60 compound nouns served as fillers. During the experiment, each noun appeared three times-once with each of the three definite determiners (e.g., das Kleid "dress", neut.-*die Kleid-*der Kleid)-in randomised order.
Stimuli were displayed visually in the centre of a computer screen using the DMDX software (http:// www.u. arizo na. edu/ ~kfors ter/ dmdx/ dmdx. htm; cf. Forster & Forster, 2003). First, a fixation cross appeared for 500 ms. It was followed by the determinernoun phrase whose parts (determiner and noun) were presented simultaneously, horizontally aligned in left-to-right order (i.e. in the default order of German noun phrases). Participants were instructed to decide as fast and accurately as possible on gender agreement of determiner and noun by pressing the corresponding button (YES or NO). YES answers were assigned to the participant's dominant hand. Stimulus presentation was terminated by the participant's response or automatically after 3000 ms. Subsequently, a new trial started automatically.
The experimental testing was preceded by 18 practice trials in order to familiarise participants with the task. Afterwards, all 540 nominal phrases (target nouns and fillers) were presented, with pauses at an interval of 60 trials. Overall, the experiment took 30-40 min. Correctness of the answers and RTs were recorded with DMDX.

Participants
Thirty native speakers of German took part in the experiment. With one exception, all of them were students at the University of Erfurt. 23 of them were female, seven male. Mean age was 22.6 years (range: 18-41). None of the participants was diagnosed with dyslexia. Four of them were left-handed. Participants were paid for their participation.

Results
Incorrect responses, responses lasting longer than 3000 ms, and responses exceeding 2.5 standard deviations of a participant's individual RT mean (calculated separately for YES and NO answers) were counted as errors. Across the 360 experimental target stimuli, no participant exceeded an ER of 11% (mean ER: 6.2%, range: 3.3-10.6%). The ERs in the agreement violation conditions were compared with paired t-tests across participants and items.

Reaction Times
Results of the RT analyses are summarised in Fig. 3. Responses classified as errors (see above) were excluded from these analyses.
The RTs in the agreement violation conditions were compared with paired t-tests across participants and items.

Interim Discussion
In Experiment I, participants had to decide on gender agreement in visually presented determiner noun phrases. ERs and RTs were compared for the two agreement violation conditions of each target gender specification. It turned out that incongruent combinations with das resulted in higher ERs and RTs compared to incongruent combinations with die (and der). For neuter target nouns, incongruent combinations with der caused higher ERs but no RT differences compared to incongruent combinations with die. Different difficulties of detecting violations with one versus the other wrong determiner are unpredicted by categorial accounts, thus their explanation would need additional assumptions (e.g., based on frequency of use). While different difficulties of detecting violations with different determiners are, in principle, predicted by feature based accounts, the specific patterns observed did not fully agree with any of the predictions deduced from the previous studies.
However, it cannot be excluded that RTs in the present experiment were influenced by inherent characteristics of the determiners. For example, der [deːɐ̯ ] and das [das] consist of three phonemes each while die [diː] consists of only two. Furthermore, frequencies of the definite determiners differ. According to dlex (www. dlexdb. de), type frequency of das (absolute number of occurrences: 677,120) is lower than type frequency of der (3,026,098) and die (2,510,938).
For this reason, a second experiment was conducted, using the personal anaphoric pronouns er [eːɐ̯ ], sie [ziː], and es [ɛs] which all consist of two phonemes and are more similar regarding their type frequencies according to dlex (er: 604,723, sie: 607,179, es: 548,281).

Experiment II: Gender Agreement Decision for Anaphoric Personal Pronouns and Nouns
Experiment II was a conceptual replication of Experiment I, differing in a) the function words used (personal anaphoric pronouns instead of definite determiners), b) the kind of stimulus presentation, and c) the participants.

Materials and Procedure
The procedure was analogous to Experiment I. This time, however, the anaphoric personal pronouns er masc , sie fem , and es neut were used as function words instead of definite determiners. Furthermore, presentation was not simultaneously in a left-to-right order. Instead, the pronoun was presented in the centre of the screen. After 600 ms, the noun appeared just below the pronoun. This was to accommodate for the fact that pronouns and nouns do not form immediate constituents of a single phrase in natural speech. Rather, they are connected by a paradigmatic relation. Presenting the pronoun first aimed at building up the expectation of a particular gender specification that could then be compared to the gender specification of the noun presented afterwards (instead of the other way round). Again, the participants had to decide on the agreement of gender specifications of pronoun and noun within a given pair.
As no compound filler stimuli had to be inserted in this experiment, the number of experimental stimuli could be increased with no extra effort for the participants. Thus, 180 morphologically simple nouns were used, 60 for each gender specification. One hundred and eighteen of them were taken from Experiment I (cf. Appendix Table 12 for a list of the stimuli). Each noun was presented with each of the three pronouns, thus calling for one YES answer (agreement) and two NO answers (agreement violations) per noun.
Testing was preceded by six practice trials in order to familiarise the participants with the task. The experiment consisted of 540 trials with pauses at an interval of 60 trials. Altogether, the experiment lasted approximately 30 min. RTs and correctness of the answers were recorded with DMDX.

Participants
Thirty-seven participants took part in Experiment II. None of them had taken part in Experiment I. Data of six participants had to be excluded from analysis due to bilingualism (n = 1), a technical error during data registration (n = 1), and ERs exceeding 10% (n = 4). 2 Of the remaining 31 participants, 27 were female and four male. Their mean age was 22.6 years (range: 19-30 years). All participants included were native speakers of German. Two were left-handed. None of them was diagnosed with dyslexia. They were paid for their participation.

Results
Error Rates Results of the ER analyses are summarised in Fig. 4. Agreement violation conditions are most relevant with respect to the different hypotheses.
The ERs in the agreement violation conditions were compared with paired t-tests across participants and items. Mean error rates (%) +/-standard error

Reaction Times
Results of the RT analyses are summarised in Fig. 5. Erroneous responses were excluded from these analyses. Again, agreement violation conditions are most relevant with respect to the different hypotheses.
The RTs in the agreement violation conditions were compared with paired t-tests across participants and items.

Interim Discussion
On the lines of Experiment I, in Experiment II errors and RTs were compared for gender agreement decisions on visually presented combinations of anaphoric personal pronouns and nouns. With only slight deviations, results of Experiment II equalled those of Experiment I and, again, differed from results of former studies. This was despite the fact that the anaphoric personal pronouns are more similar regarding phoneme number and frequency than the definite determiners.
Still, differences a) between individual processing times for the different determiners and pronouns and b) in acceptability of incorrect combinations of determiners or pronouns and nouns as described above might account for the diverging results. For this reason, regression analyses were conducted on the RTs of Experiments I and II including data from two control experiments.

Regression Analyses on Results of Experiments I and II Including Data from Two Control Experiments
Before conducting regression analyses, two control experiments were run.

Control Experiment I: Lexical Decision for Definite Determiners and Personal Pronouns
As explicated above, determiners and pronouns differ regarding lexical features like grapheme and phoneme number, graphemic and phonological similarity, word frequency, frequency of appearance within inflectional paradigms, number of associated nouns and many others. Some of the relevant factors may even be still unknown. As it is impossible to control for all these factors, a lexical decision experiment was run in order to collect data on the processing of the determiners and pronouns used in Experiments I and II and, thus, obtain a measure of 'processing costs', reflecting the combined effect of relevant variables.

Materials
The definite determiners der, die, and das as well as the anaphorical pronouns er, sie, and es served as target stimuli in this control experiment. Nine more German function words served as fillers and 15 non-lexical two-or three-grapheme combinations served as nonwords (stimuli are listed in Appendix Table 13).
Procedure Stimuli were displayed visually with the DMDX software in the centre of a computer screen. First, a fixation cross appeared for 600 ms. It was followed by the target item. Participants were instructed to decide as fast and accurately as possible about the lexical status of the target item by pressing the corresponding button (YES: word or NO: nonword). YES answers were assigned to the participant's dominant hand. Every item was presented ten times, order of presentation being randomised. The testing was preceded by eight practice trials. Afterwards, all 300 experimental trials were run, with pauses at intervals of 100 trials. Overall, the control experiment took about ten minutes. Correctness of the answers and RTs were recorded with the DMDX software.
Participants Thirty-five subjects participated in this control experiment. All of them had also participated in Experiment II. Thirty participants were female and five male. The mean age was 22.5 years (range: 19-30 years). All participants were native speakers of German. None of them was diagnosed with dyslexia. Two were left-handed. Participants were paid for their participation.
Results Incorrect responses, responses lasting longer than 3,000 ms, and responses exceeding 2.5 standard deviations of a participant's individual mean (calculated separately for YES and NO answers) were counted as errors. Across the 300 experimental stimuli, no participant exceeded an ER of 7% (mean ER: 3.8%, range: 2.0-7.0%). Descriptive statistical data for target determiners and pronouns are summarised in Table 4. Lexical decision times gathered in Control Experiment I were later used for the correction of RTs in regression analyses on the results of the main experiments.

Control Experiment II: Semantic and Formal Ratings of Noun Gender
In German, nouns differ regarding their semantic and formal characteristics which influence the probability of a particular gender assignment. For example, Mann "man" is masculine due to biological sex of the referent. 90.4% of the nouns ending with schwa are feminine (Wegener, 1995, p. 76). All nouns with the suffix -chen are neuter. In some cases, there is a discrepancy of semantic and formal characteristics of a word (e.g., das Mädchenneut "the girl"-the formal cue determines gender irrespective of the referent's biological sex). As semantic and formal characteristics might result in different perceived degrees of gender (dis)agreement between a given determiner or pronoun and a given noun, a control experiment was run collecting data to explore the presence and metalinguistic awareness of such characteristics.

Materials
The same nouns were used as in Experiment II (thus also including all 118 stimuli used in Experiment I).
Procedure Written target words were presented in randomised order one below the other in a column of a table. On the right, there were three empty columns. They were labelled as masculine-feminine-neuter, and coloured light blue, red, and green, respectively. Participants were instructed to rate on a scale of 1-10 as to how "masculine", "feminine", and "neuter" they perceived each noun. There was a semantic condition, in which participants were asked to concentrate on the meaning of the words, and there was a formal condition, in which they were asked to concentrate on the words' form or sound when making their decision. Each noun was to be provided with three values (one for each gender specification). Participants were instructed that the individual values for each gender specification were independent from each other, i.e., they did not have to sum up to ten. Semantic and formal ratings had to be carried out by the same participants but separately from each other. Half of the participants did the semantic ratings first, the other half started with the formal ratings. Word order was different in both conditions. Each rating run was preceded by six practice items. Overall, the investigation took 30-60 min.
Participants Thirty-four participants took part in Control Experiment II. All of them had also participated in Experiment II. One subject had to be excluded from analysis due to incomplete completion of the form. Of the remaining 33 participants, 29 were female and four male. Their mean age was 22.3 years (range: 19-30 years).

Results
Altogether, two answers in the formal condition were missing. Thus, 17,820 semantic and 17,818 formal values were collected. Descriptive statistical data for the ratings of formal and semantic gender features of masculine, feminine, and neuter nouns are summarised in Table 5.
Based on the mean values of all participants, a semantic and a formal quotient were calculated for each noun, respectively, dividing the mean gender value corresponding to the correct gender specification of a given noun by the sum of the two gender values not corresponding to its gender specification. Thus, for example, the semantic values for Hibiskus masc "hibiscus" were masculine = 2.00, feminine = 4.76, neuter = 6.12. This resulted in a semantic quotient for Hibiskus of 2.00: (4.76 + 6.12) = 0.18.
Semantic and formal quotients are supposed to indicate how much semantic and formal cues to gender are coded in a word, potentially resulting in an easier or more difficult decision on gender agreement. As can be seen in Table 5, the (correct) target gender usually yielded the highest ratings, both semantically and formally, but did not approach ceiling. In semantic ratings, feminine nouns had the highest feminine rating of the three noun types, but their own highest rating was neuter. In masculine nouns, masculine and neuter semantic ratings had similar values. This might be due to the fact that many words have no clear association with natural sex but are just 'neuter'. Apart from that, alternative (incorrect) genders yielded lower ratings still substantially above bottom. Semantic and formal quotients were later included as predictor variables in regression analyses on the results of the main experiments.

Regression Analyses
Taking into consideration the results of the two control experiments, linear mixed effects regression analyses were conducted on RTs of correct trials in Experiments I and II (including the congruent as well as the incongruent determiner-noun phrases and pronoun-noun phrases) using the lme4 package (Bates et al., 2015) in R (R Core Team, 2018). Lexical decision times from Control Experiment I were subtracted from RTs in Experiments I and II in order to accommodate for the fact that determiners and pronouns differ regarding a set of factors which influence processing but cannot be fully controlled for. Corrected RTs, thus, are thought to reflect the time needed for processing of gender specification (and other higher-level representations) devoid of word processing costs. They served as dependent variable. Most relevant potential predictor variables were Gender type of the noun (masculine / feminine / neuter) and Determiner (der / die / das) or Pronoun (er / sie / es), respectively. Other potential predictor variables included as fixed factors comprised Word length (number of graphemes), Word frequency (log10 lemma frequency according to dlex), Semantic quotient and Formal quotient (both yielded from Control Experiment II), Repetition (first, second, or third presentation of a given noun within the experiment), Position (consecutive number of a given item within the experiment), and Interaction of Frequency and Repetition. Participant was included as random factor.

Overall Analysis on Determiner-Noun Phrases in Experiment I Results of the over-
all analysis on determiner-noun phrases (Experiment I) are summarised in Table 6. All predictor variables except for Semantic quotient influenced RTs. Specifically, RTs were faster for more frequent and shorter nouns and nouns with higher formal quotients. They decreased with increasing number of repetitions of a given noun and later posi- tions within the experiment. Moreover, the influence of word frequency decreased with increasing number of repetitions of a given noun within the experiment. Overall, feminine nouns were processed faster than masculine nouns, and masculine nouns faster than neuter nouns. Combinations with der were processed faster than combinations with die, and combinations with die faster than combinations with das.

Analysis of Agreement-Violation Conditions in Experiment I Additionally, pairwise
comparisons among levels of factors were conducted with the emmeans-function in R (Lenth, 2022; cf. Table 7). They yielded faster RTs for congruent compared to incongruent combinations for each gender. Crucially, in the agreement-violation conditions, combinations with die and der were processed faster than combinations with das with masculine and feminine nouns, respectively. There was no significant difference in the RTs for combinations of die and der with neuter nouns. Overall Analysis on Pronoun-Noun Phrases in Experiment II Results of the overall analysis on pronoun-noun phrases (Experiment II) are summarised in Table 8. All predictor variables significantly influenced RTs. Specifically, RTs were faster for more frequent and shorter nouns and nouns with higher formal and semantic quotients. They decreased with increasing number of repetitions of a given noun and later positions within the experiment. Moreover, the influence of word frequency decreased with increasing number of repetitions of a given noun within the experiment. Again, overall feminine nouns were processed faster than masculine nouns, and masculine nouns faster than neuter nouns. Combinations with er were processed faster than combinations with sie, and combinations with sie faster than combinations with es.

Analysis of Agreement-Violation Conditions in Experiment II
As in Experiment I, pairwise comparisons (cf. Table 9) yielded faster RTs for congruent compared to incongruent combinations for each gender. Crucially, in the incongruent conditions, combinations with sie and er were processed faster than combinations with es with masculine and feminine nouns, respectively. There was no significant difference in the RTs for combinations of sie and er with neuter nouns.

Summary of Results
Overall, as in the studies of Opitz and colleagues, combinations with feminine nouns produced lowest reaction times. However, the observation of increased processing effort for masculine nouns (cf. Opitz & Pechmann, 2016;Opitz et al., 2013) was not confirmed. Congruent conditions were processed faster than incongruent conditions in both experiments. Incongruent combinations with der and er were processed faster than incongruent combinations with die and sie; incongruent combinations with das and es resulted in longest RTs.
As explicated above, we are cautious in interpreting the comparison of phrases containing nouns of different gender types as it seems virtually impossible to perfectly parallel these different nouns. Furthermore, the comparison of congruent vs. incongruent trials might reflect different processes involved in one but not the other (e.g., some kind of memorized visual picture which is recognized in the correct but not in the incorrect condition, e.g., Deutsch & Bentin, 2001). Therefore, the discussion focusses on the comparisons of the agreement violation conditions within each gender type, thus avoiding (a) the comparison of nouns of different gender types, and (b) possible different processing strategies related to the processing of congruent vs. incongruent phrases. Results of the pairwise comparisons of RTs in the agreement violation conditions and of the t-tests conducted before are summarised in Table 10.

General Discussion
The present study aimed at testing hypotheses on the representation of grammatical gender within the mental lexicon. While prevailing psycholinguistic models of language processing assume categorial gender representation with one separate node for each gender specification (e.g., the traditional discrete two step model), more recently some authors have argued that mental representation and processing of grammatical gender parallels accounts of decomposition and underspecification put forward in theoretical linguistics. That is, gender specification in the mental lexicon may be based on feature representations instead of categorial gender nodes.
Against this background, two experiments were conducted with German speakers who had to decide on gender (dis-)agreement for visually presented combinations of I) definite determiners and nouns and II) anaphoric personal pronouns and nouns in an implicit nominative singular experimental setting. Each noun was combined with each determiner or pronoun, resulting in one agreement condition and two agreement violation conditions per noun.
Overall analyses showed fastest RTs for feminine nouns but no processing disadvantage for masculine nouns (or masculine determiners / pronouns) as predicted by Opitz et al. (2013) and Opitz and Pechmann (2016). Congruent trials were processed faster than incongruent trials. However, as nouns of the three gender types were not fully parallelised (and cannot be perfectly parallelised, after all) and correct vs. incorrect trials might evoke different cognitive processes, we focussed on the comparison of ERs and RTs in the agreement violation conditions separately per gender specification. Thus, the same nouns were compared within the same condition (gender disagreement) but with different incongruent determiners and pronouns, respectively. Overall, violations with neuter das / es yielded more processing effort than combinations with die / T-tests on ERs die < das** der = das der > die** T-tests on RTs die < das** der < das* der = die Regression analyses on corrected RTs die < das* der < das** der = die Experiment II: pronoun-noun gender agreement T-tests on ERs sie < es** er < es** er > sie** T-tests on RTs sie < es** er < es** er > sie* Regression analyses on corrected RTs sie < es** er < es** er = sie sie or der / er, while no general difference was found for combinations with der / er compared to die / sie.

Categorial Versus Feature-Based Mental Representation of Gender
Differences between the two agreement violation conditions per gender specification as found in the present experiments clearly pose a challenge to models assuming categorial gender representation because three (in German) equivalent gender nodes should result in similar RTs or ERs irrespective of the type of agreement violation. Two objections can be raised, however. First, language specific frequency differences between the gender specifications or the corresponding determiners and pronouns might account for (part of) the results. In fact, according to different counts of type and token frequency (e.g., Baayen et al., 2003;Wegera, 1997;Hoberg, 1999), neuter words are less frequent than masculine and feminine words in German. In consequence, as explicated above, neuter das is less frequent than der and die. However, a purely frequency-based explanation of our results is contradicted by the fact that the neuter pronoun es is not less frequent than er and sie. Nonetheless, es produced longer RTs in agreement violation conditions compared to er and sie in Experiment II. It is, thus, argued that frequency differences alone do not account for the observed behavioural differences between gender specifications.
Second, nouns differ regarding semantic and formal characteristics, which may serve as cues to their gender and might, in consequence, make one of the competing wrong gender specifications more probable than the other. For example, for German one-syllable nouns, Köpcke (1982) has described 24 phonological regularities for gender specification. Eleven of these rules only exclude one gender specification, but do not allow differentiating between the other two. According to Köpcke's regularities, mostly, no differentiation between masculine and neuter is possible. This may be associated with more formal similarity between masculine and neuter nouns as compared to feminine nouns. The present study is the first one to meet such objections by taking individual semantic and formal quotients into consideration which are supposed to capture noun-inherent semantic and formal characteristics associated with one or the other gender specification. Indeed, these quotients did significantly influence RTs, but still left additional RT differences between different gender violation conditions. Thus, the results speak in favour of representations of grammatical gender in the mental lexicon that are more complex than categorial gender representation.

Specific Accounts for Feature-Based Gender Representation and Processing
So far, three specific suggestions have been made regarding feature-based representation and processing of gender information.
According to Penke et al. (2004) Opitz et al. (2013) interpret processing costs in terms of the differentiation of violation of compatibility vs. specificity, according to Opitz and Pechmann (2016) processing effort is directly related to the number of gender features involved. These different suggestions lead to different predictions regarding error rates and reactions times in the experiments in this study (cf. Table 3). The results obtained contradict the predictions resulting from the accounts put forward by Opitz and colleagues. No evidence was found for the kind of underspecification they suggested. Instead, the results are consistent with the account of Penke et al. (2004).
On the basis of the present study, we consider the specific accounts of Opitz and colleagues as improbable. It does not follow, however, that the account of Penke and colleagues is the only possible explanation. As has been explicated in the introduction, their own study is not without limitations. Furthermore, feature-based processing comprises several variables like the type of underspecification, the specific kind of computation of processing costs, but also the question of context setting. A multitude of combinations of different specifications of these variables are possible, and the combination suggested by Penke et al. (2004) might only be one of those consistent with the results of our study. Furthermore, it has to be noted that our experimental setting represents a considerable simplification compared to natural language contexts, as only two-word phrases were presented, and a nominative singular context was implicitly induced. In natural language contexts, case and number add to the number and kind of features involved, and the linguistic contexts might be more or less explicit regarding their specification. Additionally, the situation is expected to be different in language systems other than German, which, for example, might contain only two or more than three gender types. So, the experimental paradigm used in the present study can only be a first step towards a comprehensive picture of gender representations in the mental lexicon.

Conclusions
Altogether, the results of the experiments presented here call into question the assumption of categorial gender representation in the mental lexicon. Instead, they support the notion of feature-based mental representation and gender agreement processing. Future experiments will have to further support the specific explanation account of Penke et al. (2004) or to bring up another specific account into discussion and broaden the experimental setting in order to include further aspects of natural language processing.