The megastudy, in which performance measures are collected for a large number of stimuli, has recently become a popular method in cognitive psychology. Perhaps the quintessential example of the megastudy is the English Lexicon Project (ELP; Balota et al., 2007), whereby reading-aloud and lexical-decision reaction time (RT) and accuracy measures were collected for over 40,000 English words. According to the Web of Science, the ELP has been cited 722 times in the literature (1,327, according to Google Scholar). Furthermore, the ELP has motivated lexicon projects in other languages, including British English (Keuleers, Lacey, Rastle, & Brysbaert, 2012), Chinese (Sze, Rickard Liow, & Yap, 2014), French (Ferrand et al., 2010), and more. In addition, the technique has been extended to auditory lexical decision (Ernestus & Cutler, 2015), semantic priming (Hutchison et al., 2013), and recognition memory (Cortese, McCarty, & Schock, 2015), to name just a few of these areas.

The megastudy has become a popular method due to its advantages over the traditional factorial design method (see, e.g., Balota et al., 2004). For example, due to the large number of items examined and the fact that continuous variables are not dichotomized, the megastudy offers considerable power to detect relationships among the variables used to test important theoretical and empirical issues. In addition, where factorial designs often compare performance only in extreme conditions, the megastudy allows for assessing performance across a much larger range of variables. Also, through multiple-regression analyses conducted on the item means, the megastudy affords the opportunity to examine the relative influences of a wide array of factors. As Cutler (1981) pointed out many years ago, in a factorial design it is difficult to control for all factors related to performance while varying only one or two independent variables (for more thorough discussions of these issues, see Balota et al., 2004; Balota, Yap, Hutchison, & Cortese, 2012).

In a typical megastudy, participants respond to thousands of stimuli, sometimes over two long sessions. Thus, a reasonable concern is that the responses are not as reliable as they would be in briefer testing environments. For example, participants might become overly fatigued and/or practiced, and this may lead to noisy data. Fortunately, the evidence indicates that the data obtained from megastudies are very reliable. Balota et al. (2004) tested the reliability of megastudies by examining participant reading-aloud and lexical-decision performance on a large set of monosyllabic words and comparing the reading-aloud and lexical-decision item means for those same words from the ELP. It is important to note that, whereas Balota et al.’s (2004) participants only responded to monosyllabic words, the ELP participants responded to many mono- and multisyllabic words, so the list contexts differed considerably between the studies. Despite this difference, the patterns of results obtained between studies were highly similar. In addition, Keuleers, Diependaele, and Brysbaert (2010) compared performance between the first and last sessions of a lexical-decision megastudy and found little difference in the patterns of results, despite the fact that participants tended to slow down at the end of the study. More recently, in a megastudy of reading aloud conducted by Cortese, Hacker, Schock, and Santo (2015), participants read aloud 585 critical words, each with a different orthographic rime, at either the beginning or the end of a 2,614-word experiment. Cortese and colleagues found that the RTs for the critical items were longer at the end than at the beginning of the experiment, but there was very little change in the effects of the specific variables (e.g., frequency, spelling-to-sound consistency, length, age of acquisition [AoA], etc.), assessed as a function of the location of the critical set. Finally, we note that via the intraclass correlation coefficient, Courrieu and colleagues (e.g., Courrieu, Brand-D’Abrescia, Peereman, Spieler, & Rey, 2011) demonstrated that the data obtained from the ELP are reliable and highly reproducible. Essentially, this method involves repeatedly taking two random samples of N and computing the correlations between the samples.

However, one related issue that has not been examined within the megastudy context is whether or not participants apply a response criterion that restricts the range of RTs when items are presented randomly. In seminal factorial experiments, Lupker, Brown, and Colombo (1997) demonstrated that the effect on RTs of a variable that is related to item difficulty (e.g., word frequency) was smaller when the presentation order was random than when the items were blocked by difficulty. For example, in their Experiment 3, participants read aloud high-frequency (e.g., post) and low-frequency (e.g., pint) exception words. In one condition the presentation order was random, and in the other condition the words were blocked by frequency. Lupker and colleagues found that the frequency effect was much larger in the blocked than in the random condition. To explain these results, they proposed a response deadline hypothesis, wherein participants establish a deadline to respond that is based on the average difficulty of the list items. When easy (e.g., high-frequency) and difficult (e.g., low-frequency) items are mixed within a list, the average difficulty is somewhere in the middle. According to the hypothesis, in this context, participants will speed up for the difficult items and slow down for the easy items, relative to their RTs when these items are presented in different blocks. In fact, Lupker et al. found that participants were 25 ms faster to respond to high-frequency words in the blocked than in the random condition, and 24 ms slower to respond to low-frequency words in the blocked than in the random condition. In other words, the effect of frequency was almost 50 ms larger when items were blocked according to difficulty.

This outcome has potential implications for megastudies of reading aloud in which multiple-regression analyses are used to uncover the influences that certain variables have on performance (e.g., Balota et al., 2004). Perhaps typical megastudies, in which random presentation is employed, underestimate the influences of different factors on RTs. If participants are utilizing a response criterion such as the one hypothesized by Lupker and colleagues, then they are speeding up for the more difficult items and slowing down for the easier items in order to meet the perceived average difficulty of the items. The net effect of this process is to restrict the range of RTs, thus reducing statistical power. Another possibility is that the effect of the response criterion only occurs for items at the extreme levels of difficulty (i.e., very easy and very difficult items). In this case, one would not expect the predictive power of standard variables to be affected much over the thousands of items tested. It is important to determine the extent to which a response criterion affects reading-aloud megastudy performance because, if the range of RTs is restricted in the standard setting, there will be less power to detect the influences of variables on performance, and the relationships that exist may be weakened.

In addition, theoretical models of word processing have been evaluated in their ability to account for item-level variance in RTs by extracting these RTs from megastudies (see, e.g., Perry, Ziegler, & Zorzi, 2010). If the range of RTs is being restricted due to the random presentation of items, then the ability for theoretical models to account for the RT variance is also restricted. Furthermore, if response criterion changes were demonstrated in megastudies, then the models should be developed further to account for those changes.

The following experiment was designed to test for the influence of a response deadline in a megastudy of reading aloud. In the experiment, we presented 25 blocks of 100 trials. In the control condition, the words were ordered randomly, and in the experimental condition, the words were blocked by difficulty and the order of the blocks was randomized. To vary item difficulty, we took 2,500 words from a recent megastudy by Cortese et al. (2015) and rank-ordered them by their mean RT in that study. This resulted in a blocking pattern in which one block of items contained the 100 items with the fastest RTs, another block contained the items with the next fastest 100 items, and so on. We predicted that, among the item means, the range and standard deviation would be larger in the blocked than in the random condition. We also predicted that when we isolated the extreme conditions (i.e., the easiest vs. the most difficult items), we would show faster RTs for the easiest items in the blocked than in the random condition, and slower RTs for the most difficult items in the blocked than in the random condition. In addition, we assessed frequency, length, feedforward onset consistency, feedforward rime consistency, orthographic neighborhood size, AoA, and imageability as predictor variables. We chose to limit the set of variables to these because we had predictor values for each of these variable for all 2,500 words, and we wanted to avoid diffusing the effects of difficulty across a larger set of variables. We predicted that the predictor variables in our statistical model would generally show stronger relationships with RTs in the blocked than in the random condition, because the range of RTs would be greater in the blocked than in the random condition. Finally, assuming that the results were consistent with these predictions, we planned to correlate RTs from the CDP++ model (Perry et al., 2010) with human RTs for each of our conditions. This particular model was selected on the basis of the recent success it has had in accounting for item-level variance in RTs (see Perry et al., 2010). We predicted that the model would account for more variance in the blocked than in the random condition, due to the extended range of RTs associated with the predictor variables in that condition. Furthermore, we assumed that the CDP++ model would also show similar stronger correlations between model and human RTs in the blocked than in the random condition, to the extent that it was also sensitive to these predictor variables.

Method

Participants

A total of 64 undergraduates (48 females, 16 males) participated in the experiment. Forty of the participants were students from the University of Nebraska at Omaha, and 20 were students from Creighton University. Three of the participants did not provide age or education level information, and six did not report their ethnicity. For the remaining participants, the mean age was 21.4 years (range = 18–36) of age, and the mean education level was 2.48 (where 1 = freshman and 4 = senior). Forty-two participants identified themselves as White or Caucasian, five identified themselves as Black or African American, four identified themselves as Hispanic or Latino/a, four identified themselves as Asian, two identified themselves as Pacific Islanders, and one other participant identified him or herself as Egyptian.

Materials

We presented 2,500 monosyllabic words in the experiment. The stimulus characteristics for these words can be found in Table 1.

Table 1 Stimulus characteristics for the words used in the experiment

In the blocked condition, 25 lists of 100 items were created based on the mean item RTs from Cortese et al. (2015). Those item means were generated from 60 participants. We rank-ordered the items from the fastest to the slowest RT, and created 100 word lists based on this ordering. For example, the “easiest” list consisted of the 100 items with the fastest RTs, the next “easiest” list consisted of the 100 items with the next fastest RTs, and so on. The stimulus characteristics for each of the 25 lists can be found in Table 2.

Table 2 Stimulus characteristics (with standard deviations in parentheses) for the words used in the experiment as a function of difficulty level, with Level 1 being the least difficult or fastest in Cortese et al. (2015)

For the random condition, items were again presented in lists of 100 items, but the membership of items within a list was created randomly, with no regard to the word characteristics or the RTs produced for the words in Cortese et al. (2015). Thirty-six other words were used as practice stimuli. For the blocked condition, we created three practice lists of 12 words each. These words were drawn from Cortese et al. (2015) and varied in their average RT. Of these 36 words, the 12 with the fastest RTs were placed in one list, those with the next fastest RTs were placed in a second list, and those words with the slowest RTs were placed in a third list. The same practice stimuli were used in the random condition, but the three blocks of 12 words were all drawn randomly from a single list of 36 items.

Predictors

Word length

Word length referred to the number of letters a word contains.

Subtitle frequency

This measure refers to the frequency with which a word appears in the written text of film subtitles (Brysbaert & New, 2009). For our analyses, we used the log of the word frequency per million words, taken from Brysbaert and New (2009).

Feedforward consistency

Feedforward consistency refers to orthographic-to-phonological consistency (see Balota et al., 2004). We used the values from Kessler et al. (2008), which were based on the Zeno frequency norms (Zeno, Ivens, Millard, & Duvvuri, 1995). We analyzed both feedforward rime consistency and feedforward onset consistency (the orthographic-to-phonological consistency of the initial letter or letter cluster preceding the vowel). For words in our experiment that did not have a consistency value in the Kessler et al. (2008) database, we calculated their values using the Zeno et al. frequency norms.

Orthographic neighborhood size (i.e., Coltheart’s N: Coltheart, Davelaar, Jonasson, & Besner, 1977)

N refers to the number of words that can be constructed from a target word by changing a single letter while maintaining the identity and position of the word’s other letters. For example, ball has the neighbors call, bill, bale, and so forth.

Age of acquisition (AoA)

AoA represents an estimate of the age at which a word was first learned. In our analyses, we used the Cortese and Khanna (2008) norms.

Imageability

Imageability refers to the ease or difficulty one has in generating a mental image of the referent of a word. In our analyses, we employed the values obtained by Cortese and Fugett (2004).

Equipment

The stimuli appeared on a 19-in. computer monitor that was controlled by a microcomputer running the E-Prime Professional software (Schneider, Eschman, & Zuccolotto, 2002).

Procedure

Participants read aloud 25 blocks of 100 words in a single session that lasted about 2 h. In the random condition, the assignment of words to each block was random, and in the blocked condition, words appeared as a predetermined set based on our blocking procedure. In this condition, the blocks were ordered randomly, and the words within a block were presented in random order, but the words within each block were always the same 100 words. Prior to the experiment, participants were instructed to name each word as quickly and accurately as possible. After the instructions were read, participants performed three blocks of 12 practice trials. In the blocked condition, the blocks varied by difficulty. The easiest block occurred first, and the most difficult block occurred third. In the random condition, the order of words was random. On each trial, a fixation mark (+) appeared in the center of the screen for 300 ms and was followed by a blank screen for 200 ms. The to-be-named word followed and remained in view until the voice key had registered the onset of the acoustic signal. Accuracy was then coded by a researcher as correct, incorrect, or noise. A 700-ms blank screen separated trials.

Results

The data from two participants were discarded because their error rates exceeded 20%. Before the major analyses, the RT data were screened for outliers and voice key errors. For each participant, an initial screening removed voice key errors and RTs less than 300 ms and greater than 1,250 ms (1.5% in the blocked condition, and 1.2% in the random condition). Then, RTs beyond three standard deviations from the participant’s mean were eliminated (1.5% in the blocked condition, and 1.6% in the random condition). In all, this eliminated 2.9% of the RT data. Finally, the RT analyses were conducted on correct responses only.

Table 3 provides the mean RTs, the ranges and standard deviations of the RTs, and the mean accuracies for items, as a function of stimulus presentation condition.

Table 3 Mean reaction times (RTs) in milliseconds, range, standard deviation, and accuracy percentages as a function of list context

We note that participants were generally faster and slightly less accurate in the blocked than in the random condition. However, although participants in the blocked condition were generally faster than those in the random condition, they were slower at the higher end of list difficulty than those in the randomized condition. We also note that although participants were slightly less accurate in the blocked than in the randomized condition, they were less accurate at the extreme difficulty levels, at which their RTs were either faster or slower than the randomized condition, indicating that their faster RT was not simply a function of a speed–accuracy trade-off. Finally, it is important to note that, consistent with our prediction, the ranges and standard deviations among the item mean RTs were larger in the blocked than in the random condition.

Analytic strategy

One set of analyses was conducted using a two-level multilevel modeling framework (in HLM, ver. 7.20; Raudenbush, Bryk, Cheong, & Congdon, 2000) with sets of words (Level 1) nested within each participant (Level 2). Multilevel modeling is a variation of traditional multiple regression that serves to address the violation of independence inherent with nested data, as in the present study. Given that combinations of words were presented to participants in either random or blocked fashion, we could not assume that the variability across sets was random. Multilevel modeling allowed us to nest the various combinations of words within each individual to account for the proportion of differences within participants (i.e., across sets), as compared to the variability that exists between participants (as a function of condition). Multilevel modeling analyses were conducted separately for RTs and accuracy rates, for the items in which our specific predictors were tested as a function of presentation condition. Multilevel modeling analyses began first with an unconditional model (containing only the dependent variable), so that the proportion of variability between participants could be compared to the variability within participants (based on the intraclass correlation), thus providing a justification for the analytic approach. This unconditional model also served as the baseline model to which subsequent models were compared. The models were built up by adding predictors and testing for significant improvements (using a χ 2 test) and proportional reductions in prediction error (interpreted as a pseudo-R 2).

Again, the dependent variables were RT and accuracy. We conducted an initial analysis on RTs to determine whether a response criterion was operating at the extreme levels of difficulty within our dataset. For this analysis, only the RTs for the 100 words at the least difficult level and the 100 words at the most difficult level were analyzed. For this analysis, the first level was analyzed within participants, and consisted of 12,400 (200 words × 62 participants) monosyllabic words. The second level consisted of the 62 participants who read aloud each word. Our model building began with an unconditional model without any predictors to demonstrate the proportions of variance both within (Level 1) and between (Level 2) individuals. An analogous analysis was conducted on accuracy rates. One of the advantages of multilevel modeling is that the effects at lower levels (significant or not) can vary significantly at higher levels (as measured using a chi-square test), so variability can be accounted for. At Level 1 (differences between types of words), hypothesis testing began with level of difficulty. At Level 2 (between-group differences), the group variable of whether the stimuli were presented randomly or in a blocked fashion was included in the model. The Level-2 model also tested the interaction between group and level of difficulty.

For the analyses on the complete set of 2,500 words, the first level was performed within participants and consisted of 155,000 (the 2,500 words × 62 participants) monosyllabic words. The second level consisted of the between-participants analyses of the 62 participants who named each word. Our model building began with an unconditional model without any predictors, to demonstrate the proportions of variance within the individual (Level 1) and between (Level 2) individuals. At Level 1 (differences between types of words), hypothesis testing began with the initial-phoneme variables added as a block. Then length, log frequency, feedforward rime consistency, feedforward onset consistency, and orthographic neighborhood size were added to the model. In the next step, AoA and imageability were assessed. At Level 2 (between-group differences), the group variable of whether the stimuli were presented randomly or in a blocked fashion was included in the model. The Level-2 model also tested the interaction between group and each of our (sub)lexical and semantic predictor variables. All variables were grand-mean centered and entered at random (meaning that we assumed that there would be between-participants variability in the Level-1 predictors).

Finally, we conducted stepwise multiple-regression analyses on the item mean RTs, standardized item mean RTs (i.e., zRT), and accuracy levels collapsed across participants. These analyses included initial-phoneme characteristics in the first step, length, log frequency, feedforward rime consistency, feedforward onset consistency, and orthographic neighborhood size in the second step, and AoA and imageability in the final step. These results indicated that the relationships between a standard set of predictor variables and RTs (and zRTs) were generally stronger in the blocked than in the random condition.

Analyses of presentation condition (random vs. blocked)

Multilevel analyses were conducted separately for RTs and accuracy rates, for the items in which specific predictors were tested as a function of presentation condition. In the first set of analyses, we examined the two extreme levels of difficulty, to test the prediction that response biases would be present at these extreme levels. If so, we would expect a crossover interaction between presentation condition and level of difficulty. The item mean RTs and accuracy for these conditions are presented in Table 4.

Table 4 Reaction times in milliseconds (with accuracy rates in parentheses) for the least difficult and most difficult conditions, as a function of the list context

In subsequent analyses, we focused on length, log frequency, feedforward rime consistency, feedforward onset consistency, N, AoA, and imageability.

Results for the extreme levels of difficulty

Reaction times (see Table 4)

The intraclass correlation for the unconditional model indicated that 64.09% of the variability in RTs was at the between-participants level (inversely, 35.91% of the variability was at the within-participants level), and that this variability in RTs between participants was statistically significant [χ 2(61) = 80,789.62, p < .001]. A second model included level of difficulty. This model showed a significant main effect of level of difficulty (b = 49.65, SE = 2.99), t(61) = 16.63, p < .001, which improved the model’s prediction of RTs [Δχ 2(2) = 1,579,029.52, p < .001] over the intercept alone, with a proportional reduction in prediction error (PRPE = 11.89%). Thus, more difficult words took longer to read aloud than less difficult words. We observed no main effect of presentation condition (b = −0.42, SE = 15.78), t(60) = 0.03, p = .979, but there was a significant interaction between level of difficulty and presentation condition (b = 20.94, SE = 5.35), t(60) = 3.92, p < .001, that improved the modeling of the effect of difficulty on RTs [Δχ 2(1) = 175.56, p < .001] and reduced prediction error by 19.77%. In other words, the difference in RTs as a function of difficulty was significantly larger in the blocked than in the random condition.

Accuracy (see Table 4)

The intraclass correlation for the unconditional model indicated that 2.01% of the variability in accuracy rates occurred between participants (again, conversely, 97.99% of the variability was at the within-participants level), and that this variability in accuracy rates between participants was statistically significant [χ 2(61) = 3,202.35, p < .001]. A second model included level of difficulty. This model showed a significant main effect of level of difficulty, reducing accuracy (b = −0.04, SE < 0.001), t(61) = 8.42, p < .001, which improved the model’s prediction of accuracy [χ 2(2) = 107,176.78, p < .001] relative to the unconditional model. Neither the main effect of group nor the group by level of difficulty interaction was significant, both ps > .25.

Results for length, log frequency, feedforward rime consistency, feedforward onset consistency, and neighborhood size as predictor variables

Reaction times (see Table 5)

Table 5 Effects of length, frequency, rime consistency, onset consistency, neighborhood size, AoA, and imageability on reaction times

At Level 1 (differences between types of words), hypothesis testing began with the initial-phoneme variables added as a block. Then length, log frequency, feedforward rime consistency, feedforward onset consistency, and orthographic neighborhood size were added to the model. In the next step, AoA and imageability were assessed. As can be seen in Table 5, all of the predictor variables except imageability were significantly related to RTs. Including these variables provided a significantly better prediction than did using the intercept alone [χ 2(35) = 12,099.33, p < .001], reducing prediction error by 5.63%. Most important were the tests that examined the interaction between presentation condition and each of our (sub)lexical and semantic predictor variables. In this case, we found that although the interaction between frequency and presentation style was only marginally significant, t(60) = −1.93, p = .06, including this term did improve the fit of the model [χ 2(1) = 7.86, p < .01; PRPE = 7.62%]. The same was true for AoA, t(60) = 1.99, p = .051 [χ 2(1) = 9.79, p < .01; PRPE = 6.58%]. Orthographic N interacted with presentation style and improved the fit of the model, t(60) = −2.54, p = .014 [χ 2(1) = 12.25, p < .001; PRPE = 14.66%]. Although the effect of rime consistency was not significant, t(60) = −1.53, p = .13, it also improved the model [χ 2(1) = 4.48 p < .05; PRPE = 3.62%]. No other interactions approached significance, all ps > .13.

Accuracy

As can be seen in Table 6, all of the predictor variables except length were significantly related to accuracy. Including these variables provided a significantly better prediction than did using the intercept alone [χ 2(35) = 1,217.70, p < .001], reducing prediction error by 1.21%. In the accuracy data, we found no significant interactions between presentation group and any of the (sub)lexical or semantic predictor variables, all ps > .14 (PRPEs = 0.00%).

Table 6 Effects of length, frequency, rime consistency, onset consistency, neighborhood size, AoA, and imageability on accuracy

Stepwise regression analyses on item means for RTs, zRTs, and accuracy

The correlation matrix for the predictor variables used in this experiment is presented in Table 7. The results of the regression analyses are presented in Tables 8 and 9.

Table 7 Correlation matrix of variables assessed in the stepwise regression analyses
Table 8 Standardized regression coefficients, percentage reductions in prediction error (PRPE), and R 2, for RTs and zRTs
Table 9 Standardized regression coefficients, percentage reductions in prediction error (PRPE), and R 2, for accuracy

The results also demonstrate that the relationships between the (sub)lexical and semantic predictor set and RTs are stronger in the blocked than in the random condition. The (sub)lexical + semantic predictor set account for 2.8% more variance in raw RTs, and 2.6% more variance in zRTs, in the blocked than in the random condition. For accuracy, the predictor set accounted for a small amount of variance overall. In general, all of the (sub)lexical and semantic variables were significantly related to RTs, except length. Imageability was significantly related to accuracy in the random but not the blocked condition, and AoA was marginally related to accuracy in the random condition, and significantly related in the blocked condition.

Analysis of CDP++ model

We also predicted that, to the extent to which models of word processing are sensitive to the variables associated with RTs, they should account for more variance in the blocked than in the random condition. We tested this prediction with the CDP++ model (Perry et al., 2010). We chose this model because prior research had demonstrated that it typically accounts for substantially more variance in RTs than do other contemporary models, and thus would provide a good test for our prediction (see Perry et al., 2010). We obtained RTs from the model for the 2,500 words in the experiment, and then eliminated RTs associated with incorrect codes. We also eliminated words that are considered irregular in British but not in American English, and vice versa (e.g., vase, aunt). This left us with RTs for 2,393 words. For this set of RTs, the CDP++ model accounted for 12.0% and 13.0% of the variance in RTs and zRTs, respectively, in the blocked condition, and 10.6% and 11.8% of the variance in RTs and zRTs in the random condition. When we entered the initial-phoneme characteristics in Step 1 and the model RTs in Step 2, the CDP++ model accounted for an additional 11.5% and 12.8% of the variance in RTs and zRTs, respectively, in the blocked condition, and 10.9% and 11.6% of the variance in RTs and zRTs in the random condition. Obviously, the overall change in the amount of variance accounted for in the blocked versus the random condition is greater than the change in the overall amount of variance accounted for by the CDP++ model across the two conditions. We think that this outcome can best be explained in terms of the individual factors that show a change across the two conditions and how sensitive the model is to these factors. Specifically, the results demonstrated that the blocked condition showed larger effects for frequency, AoA, orthographic neighborhood size, and feedforward consistency. Because the CDP++ model is sensitive to frequency and consistency, but not necessarily to N (see Perry, Ziegler, & Zorzi, 2007), and probably not to AoA (see Cortese & Schock, 2013), this modest change in its predictive power across conditions is not particularly surprising.

Discussion

We have shown strong evidence that the list homogenization effect identified by Lupker et al. (1997) applies to megastudies of reading aloud and affects the predictive power of the CDP++ model. First, across the item means, the range and standard deviation were larger in the blocked than in the random condition. This outcome confirmed the hypothesis that randomized presentation of stimuli restricts the range of RTs relative to blocking items by difficulty. Second, when comparing the extreme levels of difficulty across random versus blocked presentation, the level-of-difficulty effect was much larger in the blocked than in the random condition. Third, sublexical, lexical, and semantic variables generally showed stronger relationships to RTs in the blocked than in the random condition. More specifically, orthographic neighborhood size, frequency, feedforward rime consistency, and AoA were more strongly related to performance in the blocked than in the random condition. Since virtually all megastudies of reading aloud have employed random presentation, their power to uncover the relationships that variables have with RTs likely has been reduced. Finally, we demonstrated that the CDP++ model accounted for slightly more variance in RTs in the blocked than in the random condition.

The results of this experiment lead to questions about how best to represent item mean RTs in megastudies of reading aloud, as well as what type of presentation schedule should be utilized in future megastudies of reading aloud. Our view is that blocking items by difficulty may be preferable to randomized presentation, especially when a new variable is being examined. On the basis of our results, blocking should increase the power to uncover new relationships, as well as strengthen already-known relationships between predictor variables and RTs. However, because we utilized a somewhat crude method of equating difficulty with RT, it is not clear that our method of blocking stimuli on the basis of item RT means from another study (Cortese et al., 2015) was optimal. In fact, looking at Fig. 1, one can see that, although the overall range of RTs is extended in the blocked condition, there is considerably more variability among the items as a function of the level of difficulty, as we defined it.

Fig. 1
figure 1

Reading-aloud reaction times (RTs) for the random and blocked conditions as a function of list difficulty.

One might prefer a function that was smoother, like the one seen for random presentation, but with the extended range seen in our blocked condition. One may ask why the data in the blocked condition appear so variable, list by list. One possibility is that, although RTs may be strongly correlated with item difficulty, they may also be influenced by other factors not related to difficulty. Although we do not think that these factors would differentially affect the conditions of our experiment, they might add variability to the RTs that is unrelated to item difficulty. For example, the RT measure is affected by voice key artifacts that may have more to do with the sensitivity of the measurement device than the difficulty of the items (Rastle & Davis, 2002). In addition, one would expect accuracy also to be related to item difficulty, yet the correlation between RT and accuracy is moderate at best (r = −.34 for items in the randomized condition, and r = −.41 for items in the blocked condition). Therefore, using only RTs to represent item difficulty may not be the best way to organize the stimuli in the blocked condition. One possibility would be to define level of difficulty after voice key factors are removed, and also to take accuracy levels into consideration.

In addition, we note that some of the list-to-list variability in RTs associated with the blocked condition may be associated with the randomization of block order. Thus, a particular block of words would be preceded by a block of very easy words for one participant, and by a block of more difficult words for another participant. The only way for the participant to determine the list difficulty would be to proceed through the list. Perhaps one would want to modify the experiment to include information about the difficulty of the items in any upcoming block of stimuli. For example, one might say something like “On the basis of previous research, the following list of words are typically perceived to be very easy/easy/moderately easy/moderately difficult/difficult/very difficult.” However, anything that we suggest about blocking procedures at this point would be purely speculative. More research will need to be done to identify a more optimal method of blocking that would lead to a smoother function between item difficulty and RT.

Finally, it is important to note that although the overall R 2 differences in the blocked versus the random condition (2.8% in RTs, 2.6% in zRTs) may, on the surface, seem somewhat small, they are actually quite substantial. As a comparison, consider that adding AoA in the final step of a regression model, as Cortese and Khanna (2008) did, increased R 2 by less than 0.5% above and beyond the other predictors. In our experiment, while the overall difference in R 2 between the two conditions may seem relatively small (2.8% for raw RTs, 2.6% for zRTs), this difference is larger than that of any of the single predictor variables when that variable is added to the regression equation (performed on overall RTs) in a step after all of the other variables have been assessed. More specifically, word frequency accounted for 2.0% of the change in R 2, and all of the other predictors accounted for less than that.

In sum, the experiment reported in this article indicates that participants reading aloud words in megastudies utilize a response criterion based on the perceived difficulty of the words in a list. When the words follow a standard random-presentation schedule, this criterion functions to restrict the range of RTs and reduces the strengths of the relationships between item-level variables, such as frequency and AoA, and RT. Alternatively, blocking items together on the basis of average RT increases the range of RTs and strengthens the power of these relationships. In addition, using blocked presentation strengthens the predictive relationship between the CDP++ model RTs and human RTs. More research will need to be done to establish the optimal blocking technique.