Understanding the cognitive processes involved in visual and auditory word recognition remains a major challenge in cognitive psychology and psycholinguistics. In recent years, researchers have focused on lexical variables affecting the speed and accuracy with which words are processed (e.g., Baayen, Feldman, & Schreuder, 2006; Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Brysbaert, Warriner, & Kuperman, 2014; Ernestus & Cutler, 2015; Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012; New, Ferrand, Pallier, & Brysbaert, 2006; Yap & Balota, 2009). In the present study, we focus on ways in which word processing times converge or diverge in the lexical decision task across two modalities: visual and auditory. Visual and auditory word recognition have rarely been studied together. Indeed, to our knowledge, research on spoken-word recognition and reading has developed independently, with little attempt to draw parallels between the two.Footnote 1 The main goal of the present study was to apply the megastudy approach to compare processing times for a very large number of words presented in visual and auditory lexical decision. In this study, we focus (1) on lexical decision (rather than naming), and (2) on very large databases (with more than 10,000 words tested) (see Table 1).

Table 1 List of word recognition megastudies using the lexical decision task

The megastudy approach

Megastudies are studies in which word processing times are gathered for a large number of words (typically between 10,000 and 40,000 words; see Balota, Yap, Hutchison, & Cortese, 2012; Keuleers & Balota, 2015, for reviews). The findings are typically explored with multiple regression analyses. In the visual modality using the lexical decision task (see Table 1 for a review), the first megastudy was published by Balota et al. (2007).Footnote 2 In the so-called English Lexicon Project (ELP), lexical decision and naming data were collected for 40,481 words from several hundreds of participants (each responding to 3,400 stimuli in lexical decision). Ferrand et al. (2010) published a similarly collected set of lexical decision times for 38,840 French words (the French Lexicon Project; FLP) from 975 participants (each responding to 2,000 stimuli). Yap, Rickard Liow, Jalil, and Faizal (2010) collected lexical decision and naming data for 9,592 Malay words (the Malay Lexicon Project; MLP) from 44 participants (each responding to 1,020 stimuli in lexical decision). Keuleers, Diependaele, and Brysbaert (2010) published a set of lexical decision times for 14,000 Dutch words (the Dutch Lexicon Project; DLP) with a new approach, since 39 participants responded to all stimuli (see also Brysbaert, Stevens, Mandera, & Keuleers, 2016, for the Dutch Lexicon Project 2; DLP2). A similar approach was adopted for the British Lexicon Project (Keuleers, Lacey, Rastle, & Brysbaert, 2012), in which data on 28,000 English words were collected by testing 78 participants, each of them responding to one of the two lists of 14,000 words in lexical decision. Dufau et al. (2011) collected lexical decision times for thousands of words in seven languages (English, Basque, Catalan, Dutch, French, Malay, and Spanish) by using smartphone technology (each participant responding to 50, 100, or 140 stimuli). Finally, Tse et al. (2016) collected lexical decision data for 25,286 Chinese words (the Chinese Lexicon Project; CLP) from 594 participants (each responding to 2,810 stimuli).

In the auditory modality, megastudies are scarce, possibly because generating auditory stimuli entails a great deal more effort than generating visual stimuli. To our knowledge, only one published megastudy has used the auditory lexical decision task, BALDEY (“biggest auditory lexical decision experiment yet”). In this megastudy, Ernestus and Cutler (2015) collected lexical decision times on 2,780 Dutch words from 20 participants (see Table 1).

As was pointed out by Balota et al. (2007; Balota et al., 2012) and Brysbaert et al. (2016; see their Table 1, p. 442), megastudies have a number of important advantages over the more commonly used factorial designs. First, megastudies use multiple regression techniques applied to the data obtained with very large numbers of words, and one is no longer limited to analyzing a small number of words since all words (polysyllabic, polymorphemic, inflected forms such as feminine, plural, and verbal forms) are included in the analysis. This avoids the potential biases that may occur when one selects specific items (see, e.g., Forster, 2000). Second, variables (such as word frequency for instance) can be tested in a continuous way along their entire range. Third, megastudies identify the unique predictive power of a large set of targeted variables. Fourth, megastudies have proven valuable for comparing competing metrics of word frequency (e.g., Brysbaert & New, 2009; Ferrand et al., 2010), evaluating the impact of novel psycholinguistic variables (e.g., Bonin, Méot, Ferrand, & Bugaïska, 2015; Brysbaert et al., 2016; Juhasz & Yap, 2013; Yarkoni, Balota, & Yap, 2008), and exploring potential nonlinear functional relationships between lexical variables and word recognition performance (e.g., Baayen et al., 2006; New et al., 2006).

Because of these advantages, an increasing number of large-scale psycholinguistic studies have recently endorsed the megastudy approach using regression designs (see Table 1; see also Keuleers & Balota, 2015, and Brysbaert et al., 2016, for reviews).

Factors influencing visual and auditory lexical decision performance

As Brysbaert et al. (2016, p. 442) put it, “One of the main goals of megastudies is to examine which variables affect word recognition and what their relative importance is.” A survey of the megastudy literature reveals that word frequency is the most important variable in visual lexical decision (accounting for at least 40% unique variance in ELP: Balota et al., 2007; Brysbaert et al., 2016; 34% in DLP1: Keuleers et al., 2010; 33% in DLP2: Brysbaert et al., 2016; and 38% in FLP: Ferrand et al., 2010).Footnote 3 A second important variable accounting for a decent part of variance in lexical decision times is word length in letters (accounting for 16% of variance in ELP; and 7% in the DLP1 and the FLP).Footnote 4 Lexical decision latencies are generally longer for long words; but New, Ferrand, Pallier, and Brysbaert (2006) reexamined the length effect within the ELP and they found a quadratic effect: the effect of number of letters was facilitatory for words of three to five letters, null for words of five to eight letters, and inhibitory for words of eight to 13 letters. This quadratic effect was replicated in French (Ferrand et al., 2010). A third important variable that matters is the orthographic similarity to other words (OLD 20; Yarkoni et al., 2008). Lexical decisions are faster to words that are more orthographically similar to other words (accounting for 20% of variance in ELP, 5% in DLP and 6.5% in FLP; again, see note 4).

Other lexical variables have also been identified even if they only account for a small (but often significant) percentage of variance (around 1%–2% of additional variance). This is the case of (1) semantic variables (such as imageability, sensory experience ratings, concreteness, number of semantic features, and number of associates, to name a few; e.g., Balota et al., 2004; Bonin et al., 2015; Ferrand et al., 2011; Juhasz & Yap, 2013; Juhasz, Yap, Dicke, Taylor, & Gullick, 2011); (2) morphological variables (such as morphological family size, plural word forms, etc.; e.g., Baayen et al., 2006; Gimenes, Brysbaert, & New, 2016); and (3) affective variables (such as valence and arousal; e.g., Kuperman, Estes, Brysbaert, & Warriner, 2014).

In auditory lexical decision, the only megastudy available (Ernestus & Cutler, 2015) reported that word duration was an important variable and that it was a better predictor of response time than different measures of the uniqueness point (i.e., the first moment at which a word differs from all other words).Footnote 5 This result suggests that listeners tended to wait until they had heard the last phoneme in the word before making their decision. Word frequency was also an important variable, with the Dutch subtitle-based word frequency (Keuleers, Brysbaert, & New, 2010) better accounting for the auditory lexical decision data than other frequency measures (such as CELEX, for instance; Baayen, Piepenbrock, & Gulikers, 1995).

Standard research using the auditory lexical decision task in factorial experiments has mainly focused on the effects of lexical variables such as word frequency and word-form similarity. In a review of the auditory lexical decision task published twenty years ago, Goldinger (1996) mentioned one variable that surely influenced lexical decision performance (word frequency), one that probably affected it (neighborhood density) and one that should be controlled (total stimulus duration). So far, researchers have reported clear frequency effects (e.g., Cleland, Gaskell, Quinlan, & Tamminen, 2006; Connine, Mullennix, Shernoff, & Yelen, 1990; Luce & Pisoni, 1998; Petrova, Gaskell, & Ferrand, 2011; Taft & Hambly, 1986), as well as effects of phonological neighbors that are inhibitory in nature (e.g., Goh, Suárez, Yap, & Tan, 2009; Goldinger, Luce, & Pisoni, 1989; Luce & Pisoni, 1998; Vitevitch & Luce, 1999; Ziegler, Muneaux, & Grainger, 2003), and effects of cohorts (e.g., Marslen-Wilson, 1990; Taft & Hambly, 1986) or uniqueness point (e.g., Goodman & Huttenlocher, 1988).

More recently, Goh, Yap, Lau, Ng, and Tan (2016) collected auditory lexical decision (and semantic categorization) data for 514 nouns from 80 participants. The goal of their study was to determine the unique contribution of semantic richness variables (such as concreteness, number of semantic features, and emotional valence), above and beyond the contribution of lexical variables (such as word duration, frequency, structural properties, and number of morphemes) to auditory word recognition. Their results showed that the lexical variables collectively accounted for 44.5% of the variance in lexical decision RT. There were significant positive relationships between RT, on the one hand, and word duration and number of morphemes, on the other hand, such that words that had longer duration and more morphemes were associated with slower RTs. Semantic richness variables (concreteness, valence, and number of semantic features) collectively accounted for an additional 3% of unique variance in lexical decision RT, with faster responses for spoken words that were concrete, emotionally valenced, and with a high number of semantic features, suggesting that words with richer semantic representations are recognized faster.

The present study

The main goal of the present study was to investigate and compare visual and auditory lexical decision times using the megastudy approach. The data collected in the MEGALEX project can be used to perform different types of analyses. As an example, here we outline the different central questions that we had specifically in mind when we designed MEGALEX: Which word frequency measure best predicts response times in the visual and auditory lexical decision tasks? What is the relative importance of the different lexical variables in both modalities? A specific question is when, during stimulus presentation, participants in the auditory lexical decision task decide whether a stimulus is a word? A more general question concerns the similarities and differences of the different variables involved in the visual and auditory lexical decision task. Finally, one methodological goal was to compare the power of the design used in MEGALEX (in which each participants responded to half of the full set of stimuli) relative to the design used in the French Lexicon Project (in which participants responded to only 2,000 of about 40,000 stimuli).

Experiment

Method

Participants

A total of 197 right-handed participants finished the experiment (96 in the visual lexical decision task [LDT] and 101 in the auditory LDT). Five other participants started the visual or auditory experiment but did not finish it. Participants were students or employees coming from three different universities: University Blaise Pascal (Clermont-Ferrand), University Pierre-Mendès France (Grenoble) and Aix-Marseille University (Marseille). Their age ranged between 17 and 52 years (mean = 25.76, SD = 5.71). All were native speakers of French, and had normal or corrected-to-normal vision. None of them had hearing problems. Each participant participated in either the visual LDT or the auditory LDT. Participants were informed that successful completion of the (visual or auditory) experiment would take about 20 h, for which they would receive a payment of €300. They were also informed that they would be excluded if their accuracy dropped below 80% in three successive blocks (as in Keuleers et al., 2012). They were also informed that if they dropped out the experiment or if their accuracy fell consistently below the 80% benchmark, they would be paid €5 per hour completed.

Ethical statement

This study was approved by both the Clermont-Ferrand Sud-Est VI Statutory Ethics Committee (Comité de Protection des Personnes (CPP) Sud-Est 6, France; Authorization #AU 1058) and the French National Agency for Medicines and Health Products Safety (Agence Nationale de Sécurité du Médicament et des Produits de Santé (ANSM), France; Authorization #2013-A00591-44 and #B-130783-81) according to the articles of law L. 1121-1-2 and R 1121-3.

Apparatus

The experimental software (OpenSesame; Mathôt, Schreij, & Theeuwes, 2012) and testing apparatus were identical at each of the three testing sites (Clermont-Ferrand, Grenoble, and Marseille). All the participants were tested on the same platform.

For the visual LDT, the stimuli were presented on a 22-in. Dell LCD monitor (P2212H) with a refresh rate of 60 MHz and a resolution of 1,920 × 1,080 pixels, placed at a distance of about 60 cm from the participants. The monitor was controlled by a PC (Dell Precision T3600). The stimuli were presented in lowercase in Inconsolata (12-point font size), and they appeared on the screen as dark characters on a grey background. The participants responded on a Logitech Gamepad F310, which is used for superfast computer games and does not have time delays with keyboards (see, e.g., Shimizu, 2002).

For the auditory LDT, the apparatus was identical to the one used in the visual LDT, except that the stimuli were played binaurally over headphones (Sennheiser HD 25-1 II) through a high-quality audio soundcard (Asus Xonar DX PCIe). Timing calibration was performed using the Blackbox Toolkit, version 2 (www.blackboxtoolkit.com/).

Word and pseudoword stimuli

The stimuli consisted of a list of 28,466 words. The words were taken from the French Lexicon Project (Ferrand et al., 2010; see Table 2 for descriptive statistics on the words and their behavioral measures). For each word, a matched pseudoword was generated using the Lexique toolbox (available from www.lexique.org/toolbox/toolbox.pub/index.php?page=trigramme). In particular, we used the tool that creates pseudowords based on randomization of bigrams or trigrams coming from real words. The advantage of this generator is that it creates pseudowords having characteristics very close to the category of words chosen. For instance when one enters all the words having six letters, two syllables, being a noun, the toolbox will generate pseudowords having six letters, two syllables and looking like possible nouns. One can therefore generate words resembling plausible French nouns or plausible French inflected verbs, and so forth. For instance, for three letter words, we entered all our words having three letters and we generated many three letter pseudowords. Then we screened the pseudowords in order to exclude real words and in order to end up with as many three letter pseudowords as we had three letter words.

Table 2 Descriptive statistics of the words and behavioral measures (RT, zRT, and accuracy) involved in item-level regression analyses used in the visual and auditory lexical decision tasks

Visual stimuli

In each session, the participants received 260 words and 260 pseudowords. Each session contained two blocks of 130 words and 130 pseudowords, with a short break between the two blocks. Each session lasted about 20 min. Overall, each participant received 55 sessions (or 109 blocks), corresponding to about 20 h of testing.

Auditory stimuli

They were identical to those used in the visual lexical decision task, except that homophonous forms were suppressed, giving a total amount of 17,876 words and 17,876 pseudowords. The auditory stimuli were synthetized using Apple’s Text-to-Speech (TTS) accessibility feature (TTS’s “say” command; Apple, Cupertino, CA, USA).Footnote 6 For instance, shell scripts containing “say” commands synthetized all the words and pseudowords of the visual experiment at regular speed in a female French voice (called “Audrey”) and produced 22100-Hz stereo wav files (each text stimulus had its own associated wav file). Pairwise auditory signal comparison was then carried on to detect identical wav files (hence homophonous forms). Such forms were then removed from stimulus lists. Finally, leading and trailing noninformative zero values in wav files were removed. In each session, the participants received 178 words and 178 pseudowords. Each session contained two blocks of 89 words and 89 pseudowords, with a short break between the two blocks. Each session lasted about 20 min. Overall, each participant received 50 sessions (or 100 blocks), corresponding to about 20 h of testing.

Creation of lists for participants

Both in the visual and in the auditory experiments, each participant was presented with exactly half of the experimental items. This was accomplished in the following way. For each successive pair of participants, we shuffled the words and pseudowords lists separately, then split each of them in half. The items in the first halves were attributed to the first participant and the items in the second half were attributed to the second participant. The words and pseudowords were then mixed and their order shuffled again for each participant. Finally, the obtained lists were split in blocs of 260 stimuli in the visual modality or 178 stimuli in the auditory modality to create individual experimental runs.

Procedure

Each participant started with a one-hour startup session at one of the three universities and was assigned to the visual or the auditory LDT. During that session, participants received information about the experiment and signed the approval forms. Participants then completed practice session of 40 trials (20 words and 20 pseudowords), allowing us to demonstrate the main features of the experiment. This was immediately followed by a real session of 20 min (Session 1, consisting of two blocks). After this first session, participants were free to enter the different labs during office hours and to go through the experiment at their own pace, using a booking system to reserve time slots. After reservation, participants could sit at any of the seven computers specifically devoted to the study. After entering their registration code in the experiment system, they were presented with their next block of trials. After each completed block, participants could choose whether to continue or to stop the session. The only advice given to the participants was to limit their participation to 2 h per half day. The fastest participant finished the experiment in 15 days’ time, the slowest took 58 days (on average, participants took 32 days, SD = 15.40).

Visual LDT

There were ten practice trials before each experimental block. The participants saw words and pseudowords presented on the center of a screen and they had to indicate as rapidly and as accurately as possible whether the presented letter string was a French word or a pseudoword. The participants responded using response buttons on a Logitech Gamepad F310. They answered “yes” by pressing the button corresponding to the index finger of their right hand and “no” by pressing the button corresponding to the index finger of their left hand.

The sequence of events was as follows (for one trial): (1) a fixation point (“.”) appeared in the center of the screen for 250 ms; (2) a blank screen appeared for 100 ms; (3) a stimulus (word or pseudoword) was presented in the center of the screen; (4) the participant made the response as fast and as accurately as possible; and (5) the stimulus was erased from the screen. The stimulus remained on the screen until a manual response was detected or for 2 s maximum if no response was made. At the end of each trial, there was a 1,000-ms intertrial interval with a blank, gray screen. A feedback was provided during each session, one at the end of each block (with mean correct reaction times and percentage of errors for words).

Auditory LDT

The procedure was similar to the one used in the visual modality. Ten practice trials were presented before each experimental block. The sequence of events was as follows (for one trial): (1) a fixation point (“.”) appeared in the center of the screen for 250 ms, announcing the auditory stimulus; (2) a stimulus (word or pseudoword) was then played binaurally over headphones; and (3) the participant made the response as fast and as accurately as possible. The response was triggered as soon as possible, even if the auditory stimulus was still playing. There was a 2-s timeout if the participant had not given any response. At the end of each trial, a blank, gray screen was shown for a 1,000-ms intertrial interval. Feedback—mean correct reaction times and percentage of errors—was provided at the end of each block.

Results

Trimming procedures

Visual lexical decision

First, 3,939 trials with technical problems were eliminated, which left us with 2,596,095 trials (1, 311,832 words).

As in Ferrand et al. (2010), all response latencies shorter than 300 ms or longer than 2,000 ms were then identified as outliers (resulting in the elimination of 8,658 trials [4,383 featuring words]—i.e., 0.33% of the data). Nearly all participants had a mean accuracy higher than 80%. The data for three participants who did not fulfill these criteria were dropped (73,272 trials, 37,071 words). Eight blocks with accuracies lower than 75% were also eliminated (1,981 trials, 1,009 words). The resulting 2,512,184 trials (1,269,369 words) are freely available on our website (https://sedufau.shinyapps.io/megalex/) and are searchable on www.lexique.org inside the Open Lexique search engine.

For the regression analyses on reaction times (RTs) and accuracy presented below, 2,690 words (119,684 trials) with accuracies below 67% were removed (as in Balota et al., 2004; Ferrand et al., 2010; Ferrand et al., 2011). Moreover, for the remaining correct RTs, the means and SDs were computed within Blocks × Participants, and all RTs more than three SDs above or below this mean were considered as outliers as well. This resulted in the rejection of 1.7% (19,229) of the RTs on correct trials. The mean RTs for the trimmed correct word trials, as well as the mean accuracy rates, are presented in Table 2.

Auditory lexical decision

We used the same procedure for the RTs for correct responses as applied in the visual modality. The elimination of 16,878 trials with technical problems led first to a set of 1,669,585 trials (845,078 words). Response latencies shorter than 600 ms or longer than 2,500 ms were then identified as outliers (resulting in the elimination of 1,084 trials [557 words]—i.e., 0.06% of the data). These threshold values are slightly different from the ones used for the visual modality. These were chosen after inspection of the cumulative frequency of the distribution of RTs for correct responses: 600 ms corresponds to the value for which the slope began to change significantly and for which eliminations seemed not to concern particular items. Because the decrease of the slope was very monotonic at the right of the distribution, the upper bound was fixed at 2,500 ms. One participant (16,116 trials, 8,168 words) with a global accuracy below 80% and 17 blocks (2, 738 trials, 1,396 words) with accuracies lower than 75% were then dropped. The resulting 1,649,647 trials (834,957 words) are freely available on our website.

For the regression analyses, 2,034 words (94,197 trials) with accuracies below 67% were then removed (as in Balota et al., 2004; Ferrand et al., 2010; Ferrand et al., 2011). Moreover, for the remaining correct RTs, all RTs more than three SDs above or below the within Blocks × Participants mean were considered outliers as well. This resulted in the rejection of 1.3% (9, 580) of the RTs on the correct trials. The mean RTs for the trimmed correct word trials, as well as the mean accuracy rates, are presented in Table 2.

The trimming procedure left us with 25,776 words (out of 28,466) in the visual modality and 15,842 words (out of 17,876) in the auditory modality. In both modalities, raw RTs were transformed into standardized z scores per participant and per block to remove effects due to block differences and variability/speed between participants (as recommended by Faust, Balota, Spieler, & Ferraro, 1999; see also Ferrand et al., 2010, for such a procedure). By-items RT and zRT means and accuracies for the independent variables used in the regression analyses are freely available on our website (https://sedufau.shinyapps.io/megalex/) and are searchable at www.lexique.org, inside the Open Lexique search engine. A general description of the online materials can be found in the Appendix.

Reaction times and accuracy in MEGALEX

Figure 1 displays the effects of practice on RTs and accuracies over the total duration of the experiment in both modalities (i.e., 109 blocks in the visual modality and 100 blocks in the auditory modality).

Fig. 1
figure 1

Practice effects in MEGALEX for accuracy (left panels) and reaction times (right panels) for the visual modality (upper panels) and the auditory modality (lower panels). pW: pseudowords; W: words. RTs are given only for correct responses

In the visual modality, the practice effect is about 100 ms different in RTs for words and 120 ms for pseudowords. This is similar to the effect observed in the BLP (Keuleers et al., 2012), which was also 100 ms. In the auditory modality, the practice effect is also around 100-ms difference in RT for words and 140 ms for pseudowords (see Fig. 1).

Reliability of the MEGALEX measures

As in previous megastudies (Ferrand et al., 2010; Keuleers, Diependaele, & Brysbaert, 2010; Tse et al., 2016), we determined the reliability of accuracy and RTs in order to evaluate how useful they might be. For each word, participants were split in two groups each containing half of the participants having seen that word. RT means, zRT means, and accuracies were calculated for each half. Correlations were then calculated between item scores obtained in the first and second halves. Spearman–Brown corrections are also reported as rcorr [2r/(1 + r)]. In this way, the correlation between (1) accuracy calculated on the first half of the participants who saw a given word and (2) accuracy calculated on the remaining participants who saw the same word was .66 (rcorr = .8) in the visual modality and .68 (rcorr = .81) in the auditory modality. For RTs in the visual modality, r = .70 (rcorr = .82) and for zRT, r = .82 (rcorr = .9). For the RTs in the auditory modality, r = .82 (rcorr = .9) and for zRT, r = .89 (rcorr = .94). The fact that the reliability of zRT is higher than the reliability of the raw RTs confirms that taking away differences in overall RT and variability between participants and blocks removes noise from the data and does not artificially reduce the variability of the items.

Predictors of the MEGALEX lexical decision times

We considered the following predictors for the by-items analyses:

Word frequency

We used the following corpora to measure word frequency: books (New, Pallier, Brysbaert, & Ferrand, 2004; New, Pallier, Ferrand, & Matos, 2001); subtitles (New, Brysbaert, Véronis, & Pallier, 2007); and blogs, Twitter and newspapers (Gimenes & New, 2016).

Word length (in letters)

The number of letters in the word.

Word length (in phonemes)

The number of phonemes in the word.

OLD20

The orthographic similarity to other words (Yarkoni et al., 2008), calculated on the basis of the Lexique vocabulary (Ferrand et al., 2010).

PLD20

The phonological similarity to other words (Yarkoni et al., 2008), calculated on the basis of the Lexique vocabulary (Ferrand et al., 2010).

Orthographic uniqueness point

The point in the letter string, proceeding from the first letter to the last, at which only one orthographic candidate remains in the cohort of possible words (calculated on the basis of Lexique vocabulary: New et al., 2004).

Phonological uniqueness point

The point in the phoneme string at which only one phonological candidate remains in the cohort of possible words (calculated on the basis of Lexique vocabulary: New et al., 2004).

Stimulus duration

The total duration of the spoken item, from the onset of the word to the end of the word, which corresponded to the duration of the edited sound files.

Which word frequency measure best predicts response times in the visual and auditory lexical decision task?

As Tse et al. (2016, p. 7) put it, “Word frequency has consistently been shown to be one of the strongest predictors of visual word recognition in various languages. Given that this is a central lexical variable to manipulate and control, it is important to determine which word frequency count accounts for the largest proportion of variance in lexical decision performance.”

To find out which word frequency measure best predicts response times in the visual and auditory lexical decision task, we have considered five frequency measures that are publicly accessible for the present megastudy.

First, we used two measures of word frequencies from Lexique (New et al., 2004; New et al., 2001): (1) the book-based frequency (Frantext), and (2) the subtitle-based frequency (SUBTLEX-FR). The subtitle-based frequencies were first computed in French by New, Brysbaert, Véronis, and Pallier (2007). They found that the subtitle-based frequencies were a better predictor of lexical decision times than the book-based frequencies (see also Brysbaert & New, 2009, for a similar result in English). Ferrand et al. (2010) replicated this finding in the French Lexicon Project, the subtitle-based frequencies explaining 5%–6% more of the variance in the percent errors and reaction times than did the book-based frequencies. Furthermore, Ferrand et al. (2010; see also Brysbaert & New, 2009) showed that the predictive power of the frequencies further increased when the averages of the subtitle and book frequencies were used. We investigated whether this is also true for auditory lexical decision.

Second, we used three new measures of word frequencies based on (1) Twitter, (2) blog posts, and (3) newspapers (taken from Worldlex by Gimenes & New, 2016). On the basis of lexical decision times for 35,658 words taken from the French Lexicon Project (Ferrand et al., 2010), Gimenes and New (2016) showed that the three new frequencies averaged (Twitter + blogs + newspapers) explained slightly more variance (48.46%) than the two classic frequencies—book-based and subtitle-based averaged—(47.56%). The same result was obtained for the English Lexicon Project (Balota et al., 2007) and the Dutch Lexicon Project (Keuleers, Diependaele, & Brysbaert, 2010).

The frequency measures were log-transformed (see Baayen et al., 2006). Also, because Balota et al. (2004; see also Ferrand et al., 2011) reported a nonlinear relationship between log frequency and lexical decision times, we report regression analyses both for log(frequency) and log(frequency) + log2(frequency).

Table 3 shows the percentages of variance in reaction times and accuracy explained by the different frequency measures for all words in each modality (visual and auditory), and also for common words in both the modalities. Several of the results are interesting. First, in line with previous findings (Ferrand et al., 2010; New et al., 2007), the subtitle frequency measure outperforms the book frequency measure in the visual modality. Note, however, that this was not the case in the auditory modality, for which both frequencies were roughly similar (with a slight advantage for the book frequency); this goes against Ernestus and Cutler’s (2015) finding that the subtitle frequency measure outperformed the book frequency measure, as in the visual modality.Footnote 7 Second, the proportion of variance explained by frequency is relatively low in the auditory as compared to the visual modality (for common words, 11.3% vs. 45.1%); this is much lower than the estimates previously reported for visual lexical decision performance.Footnote 8 Third, as in the French Lexicon Project (Ferrand et al., 2010), more variance is explained when the averages of subtitle and book frequencies are used than when these two frequencies are taken separately; this is the case in both modalities, suggesting that a combination of spoken and written frequencies may be the way forward. Fourth, this frequency measure combining subtitle and book frequencies shows results roughly similar to those from the three new frequency measures combining Twitter, blogs, and newspapers when the square of the log frequency was not added (similar to what Gimenes & New, 2016, found; see Table 3). However, when the square of the log frequency was added, the frequency measure combining subtitle and book frequencies outperformed the three new frequency measures combining Twitter, blogs, and newspapers; this was the case in both modalities. Fifth, these analyses confirm that less noise is present in the zRT variable than in the raw RT variable.

Table 3 Percentages of variance explained in the MEGALEX data by five frequency measures available in Lexique (based on subtitles and on books) and Worldlex (based on Twitter, blogs, and newspapers) and their combination

Given that the combination of subtitle and book frequencies is unequivocally the best measure, for both visual and auditory word processing (see Table 3), we will use this measure in the remainder of this article.

Comparison of MEGALEX with the French Lexicon Project

A comparison of the visual lexical decision times of MEGALEX and the FLP (25,122 words in common) reveals a clear superiority of MEGALEX over the FLP (in terms of the percentage of variance accounted for). For word frequency (based on a combination of the book and the subtitle frequencies), 8.6% more variance is explained by MEGALEX (40% in MEGALEX vs. 31.4% in FLP; see Table 4). Note, however, that the shapes of the frequency effects are very similar in both megastudies (see Fig. 2).Footnote 9 There is also a slight advantage of MEGALEX over FLP for word length (in number of letters) and orthographic similarity (see Table 5).

Table 4 Percentages of variance explained in MEGALEX and the French Lexicon Project (FLP) for 25,122 common words and for two frequency measures available in Lexique (based on subtitles and on books) and their combination
Fig. 2
figure 2

Shapes of the word frequency effect in MEGALEX (solid line) and the French Lexicon Project (dotted line) for the 25,122 words in common. Lowess smoothing was obtained using the lowess function in R, which computes locally weighted polynomial regression between dependent variables and independent variables (see Cleveland, 1981)

Table 5 Percentages of variance explained in MEGALEX and the French Lexicon Project (FLP) for 25,122 common words and for frequency, word length (in letters) and orthographic similarity (OLD20)

For word length (in letters), Ferrand et al. (2010; see also New et al., 2006, for English) found that this variable had a quadratic effect on visual lexical decision times: The effect of number of letters was facilitatory for words of three to five letters, null for words of five to eight letters, and inhibitory for words of eight to 13 letters. We replicated this quadratic length effect in MEGALEX (see Fig. 3), even if the range was more limited in the present study (three to nine letters vs. three to 13 letters in FLP).

Fig. 3
figure 3

Shapes of the word length effect in MEGALEX (black dots) and the French Lexicon Project (gray dots) for the 25,122 words in common. As in Ferrand et al. (2010), error bars are given for residuals of the quadratic regression, with RT as the dependent variable and the sum of book and subtitle frequencies (in logarithms) as an independent variable

For orthographic similarity, Ferrand et al. (2010) found that similar words (low OLD 20) were responded to more quickly than dissimilar words (high OLD 20). We replicated this orthographic similarity effect in MEGALEX (see Fig. 4).

Fig. 4
figure 4

Shapes of the orthographic similarity effect in MEGALEX (black dots) and the French Lexicon Project (gray dots) for the 25,122 words in common

A comparison of MEGALEX and FLP further confirms the superiority of a design in which the participants process half of the stimuli (MEGALEX) relative to a design in which participants only see a subset of the stimuli (FLP; see also Ferrand et al., 2015, for the same results based on preliminary data collected in MEGALEX). Indeed, more variance was accounted for (nearly 10% more) in the lexical decision task of MEGALEX than in FLP (see Table 5). This is in line with findings from the Dutch and the British Lexicon Projects, indicating that there is less noise in megastudies with a complete crossover of participants and stimuli (Keuleers, Diependaele, & Brysbaert, 2010; Keuleers et al., 2012).

Correlations between the variables

Tables 6 and 7 show the intercorrelations between the various predictor variables and RTs from MEGALEX (for both the visual and auditory modalities). The analysis was limited to the words that were recognized by at least 67% of the participants (N = 25,776 in the visual modality and N = 15,842 in the auditory modality).

Table 6 Correlations between variables in the MEGALEX data set for the visual modality
Table 7 Correlations between variables in the MEGALEX data set for the auditory modality

In the visual modality (see Table 6), word frequency had the highest correlations with RT. In the auditory modality (see Table 7), stimulus duration had the highest correlation with RT.

Regression analyses in the visual and auditory modalities

To further examine the impact of the predictor variables for the two modalities, we conducted multiple regression analyses at the item level for each modality separately (25,776 words in the visual modality and 15,842 words in the auditory modality). Regressions have the advantage that nonlinear relations can be examined (by making use of cubic splines). A forward approach was used in order to choose the nonlinear terms: (1) at each step, such terms were included for the predictor variable for which the inclusion of splines lead to the greatest variance increase in comparison with a model in which no such terms were included, and (2) the stopping rule used to choose the number of knots was the value for which adding more knots did not lead to an increase of R 2 above .001. The maximum numbers of knots were limited to 4 for length (N-letters and N-phonemes), OLD20 (PLD20), and acoustic duration, and 7 for frequency.Footnote 10 Table 8 shows the results.

Table 8 Results of the forward procedure used to choose knots for the predictor variables

For both modalities and both dependent variables (see Tables 8 and 9), all included variables were significant at p < .001. For RTs, the effects of frequency and word similarity (OLD20 and PLD20) were nonlinear in both modalities, with relatively comparable patterns (see Fig. 5).

Table 9 Regression analyses with all words in each modality (25,776 in the visual modality and 15,842 in the auditory modality)
Fig. 5
figure 5

Shapes of the frequency effects, word length effects (number of letters and number of phonemes), and word similarity effects (orthographic and phonological) in each modality (visual and auditory) for zRTs. x-axis: standardized variable; y-axis: predicted within-task standardized zRT. 95% confidence intervals are indicated by gray shading

As is shown in Fig. 5, for both modalities, word frequency effects (frequent words were responded to faster) were manifested in the significant negative relationship between RTs and frequency. Figure 6 also shows the shapes of the word frequency effects for both modalities in MEGALEX and for the visual modality in the French Lexicon Project (Ferrand et al., 2010) for the 14,868 words in common. A weak negative effect of word length appeared in the visual modality (number of letters), whereas this effect (number of phonemes) was more pronounced in the auditory modality. For both modalities, the word similarity effect had a negative effect at the left of the scale and then inverted near the mean: similar words (low OLD20 or PLD20) were responded to more quickly than dissimilar words (high OLD20 or PLD20). Finally, acoustic duration showed the most important effect, linear and positive, in the auditory modality: words whose tokens had longer durations took longer to recognize.

Fig. 6
figure 6

Shapes of word frequency effects in the visual (black line) and auditory (dashed line) modalities in MEGALEX, as well as in the French Lexicon Project (dotted line), for the 14,868 words in common. The RT scales being different in MEGALEX (both modalities) and the French Lexicon Project, we centered RTs (RTs – Mean RTs)

Table 9 shows the results of regression analyses including all words in each modality.

For the visual modality, frequency, orthographic similarity, and word length (in letters) explained 43.51% of the variance in RTs and 24.32% in accuracy. For the auditory modality, frequency, phonological similarity, word length (in phonemes), and acoustic duration explained 52.12% of the variance in RTs and 14.26% in accuracy.

Comparing effects across modalities and their interactions

Table 10 shows the intercorrelations between the various predictor variables and RTs from MEGALEX for both the visual and auditory modalities. The analysis was limited to the 14,868 words common in both modalities. Table 10 shows in particular that the correlation between visual RTs and auditory RTs is limited (r = .37), suggesting differences between visual and auditory lexical decision times. These differences between modalities might suggest that people always wait until a perceptual decision can be reached, but that they are constrained differently across modalities.Footnote 11

Table 10 Bivariate and multiple correlations between variables in the MEGALEX data set for the visual and the auditory modalities (14,868 words in common)

A comparison of the visual and auditory lexical decision times of MEGALEX (14,868 words in common) reveals that whereas the effect of word frequency is the most important factor in the visual modality (accounting for at least 45% of unique variance; see Table 11), stimulus duration is the main variable in the auditory modality (accounting for at least 46% of the variance; see Table 12).

Table 11 Percentages of variance explained for the visual lexical decision task (14,868 words in common with the auditory lexical decision task) and for frequency, word length (in letters) and orthographic similarity (OLD20)
Table 12 Percentages of variance explained for the auditory lexical decision task (14,868 words in common with the auditory lexical decision task) and for stimulus duration, frequency, word length (in phonemes) and phonological similarity (PLD20)

To further examine the impact of the predictor variables across the two modalities, we conducted multiple regression analyses at the item-level for the 14,868 words common in both modalities. In a first step, using the same forward procedure as above, two preliminary analyses were done within tasks in order to choose the numbers of splines to use in each. Both analyses included all variables under study, that is to say frequency subtitles, the numbers of letters and phonemes, the orthographic and phonologic Levenshtein distances and acoustic duration. In a second step, a global analysis was conducted on scores from both tasks and the inclusion of nonlinear terms retained in the first step (when the numbers of knots differed between the tasks in the first step, the highest one was chosen). In addition, interactions between the two modalities and the predictor variables were entered in the regression model in order to compare the effects of the predictor variables between the two modalities. Tables 13 and 14 show the results in terms of interaction effects and simple effects for both zRTs and accuracy.

Table 13 Across modalities comparisons for 14,868 common words in the visual and auditory lexical decision (zRTs)
Table 14 Across modalities comparisons for 14,868 common words in the visual and auditory lexical decision (accuracy)

For RTs, only the interaction between modalities and number of letters was not significant (all the other interactions between modalities and predictor variables were significant; see Table 13 and Fig. 7). For the visual modality, frequency, orthographic similarity, phonological similarity, word length (in letters and in phonemes), and acoustic duration explained 47.80% of variance in RTs and 26.52% in accuracy. For the auditory modality, frequency, orthographic similarity, phonological similarity, word length (in letters and in phonemes), and acoustic duration explained 52% of variance in RTs and 12.30% in accuracy.

Fig. 7
figure 7

Shapes of the word frequency effects, word length effects (number of letters and number of phonemes), word similarity effects (orthographic and phonological), and acoustic duration, for both modalities (V: visual and A: auditory) for zRTs. x-axis: standardized variable; y-axis: predicted within-task standardized zRT

Figure 7 shows the interaction effects for RTs between modalities and predictor variables. The effect of number of letters did not interact with modalities: It was significant and positive in both modalities. Given the high number of observations, however, it appeared quite weak. All the other effects interacted with modalities. The effect of number of phonemes was also significant in both modalities, with a negative sign in the left part of the scale, becoming then slightly positive. This pattern was more pronounced in the auditory modality. Phonological similarity and acoustic duration effects were only observed in the auditory modality. The orthographic similarity effect had a pattern comparable to the phonological similarity effect, but it was more pronounced in the visual modality; a weak negative effect—nearly linear—was, however, also observed in the auditory modality. Finally, classical frequency effects were observed in both modalities, with more important decreases at the left of the scale. The effect was more pronounced, however, in the visual modality.

These results were generally in agreement with those obtained previously using all the words available in both modalities. The only exception concerned the effect of number of letters in the visual modality, which was negative in the preceding analysis and positive in this one. This effect was, however, rather weak in both analyses.

A similar analysis was conducted with the inclusion of orthographic uniqueness point (OUP) and phonological uniqueness point (PUP) values. For the visual modality, frequency, orthographic similarity, phonological similarity, word length (in letters and in phonemes), and uniqueness point (orthographic and phonological) explained 47.95% of the variance in RTs and 26.84% in accuracy. For the auditory modality, frequency, orthographic similarity, phonological similarity, word length (in letters and in phonemes), acoustic duration and orthographic and phonological uniqueness point explained 52.50% of the variance in RTs and 13.26% in accuracy

Some studies have reported that words with an early OUP were processed more quickly than words with a late OUP in the visual modality (Kwantes & Mewhort, 1999; Lindell, Nicholls, & Castles, 2003), an effect similar to the effect obtained in auditory tasks (e.g., Radeau & Morais, 1990; Radeau, Mousty, & Bertelson, 1989). This has been taken to suggest that readers process the letters of words sequentially in a left-to-right order (but see Lamberts, 2005, for another interpretation). However, other studies found the opposite pattern, words with an early OUP were processed more slowly than words with a late OUP (Izura, Wright, & Fouquet, 2014; Miller, Juhasz, & Rayner, 2006; Radeau, Morais, Mousty, Saerens, & Bertelson, 1992), suggesting that word processing is likely to occur in a parallel manner. Here, we found that words with an early OUP were processed more slowly than words with a late OUP in the visual modality (see Fig. 8). These results are consistent with those of Izura et al. (2014), Miller et al. (2006), and Radeau et al. (1992). The same OUP effect was obtained in the auditory modality. To our knowledge, this is the first time that an OUP effect has been reported in this modality.

Fig. 8
figure 8

Shapes of the orthographic uniqueness point and phonological uniqueness point effects for both modalities (V: visual and A: auditory) for zRTs. The analysis conducted on the 14,868 words present in both modalities. x-axis: standardized variable; y-axis: predicted within-task standardized zRT

In the auditory modality, it has also been reported that words with an early PUP were processed more quickly than words with a late PUP (Radeau & Morais, 1990; Radeau, Morais, Mousty, & Bertelson, 2000; Radeau, Mousty, & Bertelson, 1989). In particular, Radeau et al. (2000) found a PUP effect only in the slow rate condition, not in the fast one (this last condition being closer to natural speech). They also reported that the PUP effects were smaller with synthetic than with natural speech. Here we found that words with an early PUP were processed more quickly than words with a late PUP (see Fig. 8), replicating Radeau et al. (2000). However, concerning the visual modality, words with an early PUP were processed more slowly than words with a late PUP, an effect similar to that for the OUP.

General discussion

Summary of findings

In general, the findings of the visual lexical decision times of MEGALEX are consistent with the results of previous megastudies conducted in English (Balota et al., 2007; Brysbaert et al., 2016; Cortese & Khanna, 2007; Keuleers et al., 2012; Yap & Balota, 2009), Dutch (Brysbaert et al., 2016; Keuleers, Diependaele, & Brysbaert, 2010), and French (Ferrand et al., 2010). Without question, word frequency is the most important predictor in visual lexical decision (accounting for at least 45% unique variance in our study; 40% in ELP: Balota et al., 2007; Brysbaert et al., 2016; 34% in DLP1: Keuleers et al., 2010; 33% in DLP2: Brysbaert et al., 2016; and 38% in FLP: Ferrand et al., 2010). In line with what Ferrand et al. (2010) reported for French, the extra contribution of word length was significant but rather limited (0.8%). The extra contribution of orthographic similarity was larger (around 1.5%). Furthermore, the present megastudy confirms the superiority of a design in which the participants process half of the stimuli (MEGALEX) relative to a design in which participants only see a subset of the stimuli (FLP). This is in line with findings from the Dutch and the British Lexicon Projects, indicating that there is less noise in megastudies with a complete crossover of participants and stimuli (Keuleers, Diependaele, & Brysbaert, 2010; Keuleers et al., 2012).

Concerning the auditory modality, stimulus duration is the most important variable (accounting for 46% of the variance in our study), which is consistent with the results of a previous relatively large-scale study conducted in Dutch (Ernestus & Cutler, 2015; see also Goh et al., 2016). The extra contribution of frequency is around 4% and the extra contribution of number of phonemes is 1.5%. Phonological similarity and phonological uniqueness point also influenced auditory lexical decision times (less than 1%). However, we showed that words that are phonologically similar to many other words (i.e., with low PLD20 values) were responded to faster than phonologically dissimilar words (i.e., with high PLD20 values). This latter result departs from the general finding that, in the auditory modality, words with more similar sounding (or closer phonological neighbors) are usually recognized more slowly than more distinct word-forms (e.g., Goh et al., 2009; Goh et al., 2016; Suárez, Tan, Yap, & Goh, 2011; Ziegler et al., 2003), but it converges with the general finding that, in the visual modality, words that are orthographically similar to many other words are responded to faster than orthographically dissimilar words (e.g., Brysbaert et al., 2016; Ferrand et al., 2010; Keuleers, Diependaele, & Brysbaert, 2010; Keuleers et al., 2012). Concerning phonological uniqueness point, our results are consistent with the general finding that words with an early phonological uniqueness point are responded to faster than words with a late phonological uniqueness point (e.g., Radeau & Morais, 1990; Radeau et al., 2000; Radeau et al., 1989).

Similarities and differences across modalities

A comparison of the effects of lexical variables across modalities revealed interesting differences between auditory and visual lexical decision times, as was predicted by Goldinger (1996; see also Segui, 1994, for a discussion). A first difference across modalities is that word frequency is the most important factor in the visual modality, whereas it is word duration in the auditory modality. A second difference is that the proportion of variance explained by frequency is relatively low in the auditory (11.3%) as compared to the visual (45.1%) modality. A third difference is that listeners tended to wait until they had heard the last phoneme in the word before making their decision (as shown by a strong effect of stimulus duration), whereas readers can make their decision before the word has been identified (as shown by a strong effect of OLD20). Indeed, in the visual modality, as suggested by Ernestus and Cutler (2015, p. 1471), “Participants can see at a glance whether a presented form is long or short, but in the auditory modality, listeners must wait for silence to tell them that the presented form has ended.” In the visual modality, as suggested by Brysbaert et al. (2016, p. 443), “Lexical decisions are partly based on the overall activation in the orthographic lexicon, such that when this overall activation exceeds a certain level, a word response is initiated before the target word itself has been identified. This overall activation is higher when a target word activates many similar word forms than when it has a unique letter sequence” (see Grainger & Jacobs, 1996; see also Dufau, Grainger, & Ziegler, 2012; Grainger, 2017). A fourth difference is that readers process words with an early orthographic uniqueness point more slowly than words with a late orthographic uniqueness point (see also Izura et al., 2014; Miller et al., 2006; Radeau et al., 1992), whereas listeners process words with an early phonological uniqueness point more quickly than words with a late phonological uniqueness point (see also Radeau et al., 2000; Radeau et al., 1989). These results suggest that processing of visually presented words is likely to occur in a parallel manner, whereas processing of auditory presented words is likely to occur in a sequential manner. A fifth difference is that more variance is explained in the auditory modality (max = 53%) than in the visual modality (max = 48%).

Turning to those effects that revealed a similar pattern in the two modalities, a first converging pattern concerns word similarity: Words orthographically similar to many other words (i.e., with low OLD20 values) were responded to more quickly than more orthographically distinct words (i.e., with high OLD20 values) in the visual modality, and words phonologically similar to many other words (i.e., with low PLD20 values) were responded to faster than more phonologically distinct words (i.e., with high PLD20 values) in the auditory modality (see Figs. 5 and 7). A second converging pattern concerns the shape of the word frequency effects in both modalities. Even if a lower proportion of variance is explained by frequency for the auditory than for the visual modality, the shapes of the word frequency effects are very similar (see Fig. 6).

Future directions

Although this study has examined the influence of a certain number of measures on visual and auditory word recognition performance, a number of questions remain open. For instance, it will be useful to test other potentially interesting variables (when these variables will be available for French), such as age of acquisition (AoA; see, e.g., Bonin, Barry, Méot, & Chalard, 2004; Cortese & Khanna, 2007; Ferrand et al., 2008), word prevalence (Wp; see, e.g., Keuleers, Stevens, Mandera, & Brysbaert, 2015), sensory experience rating (SER; see, e.g., Bonin et al., 2015), semantic variables (such as imageability, concreteness, or number of semantic features, to name a few; see, e.g., Goh et al., 2016; Yap, Pexman, Wellsby, Hargreaves, & Huff, 2012), morphological variables (see, e.g., Baayen et al., 2006; Gimenes et al., 2016), and so forth. Indeed, the maximal percentages of variance explained (48% in the visual modality and 53% in the auditory modality) suggest that other variables are still to be discovered.

Conclusion

In this article, we have described the results of a project in which lexical decision data were collected for tens of thousand French words presented in the visual and auditory modalities.

The results of the present megastudy provide an empirical “goldmine” for researchers interested in understanding the core mechanisms of word recognition, independent of modality, as well as for providing a better specification of modality-specific processing. This is clearly a crucial step toward developing a general psycholinguistic theory of word comprehension.