About 7 % of the responses were empty cells, which were removed. Valid responses were defined as either a numeric AoA rating that was less than the responder’s age or an “x” response, which signified a “Don’t know” answer. AoA ratings that were equal to the responder’s age were relabeled as “Don’t know” responses (less than 0.5 % of all responses). About 1 % of the nonempty responses were removed as they did not match our definition of a valid response or exceeded the responder’s age. The participants were instructed that a lower-boundary correlation with control words in the list was required in order for them to earn payment for the completed list. This discouraged participants from simply entering random numbers in order to receive easy payment (a similar precaution is taken in laboratory studies, where participants are excluded if their ratings do not correlate with the ratings from the other participants; e.g., Ghyselinck et al., 2000). Participants were paid if they provided valid numeric ratings to 30 or more of the 52 control words and if those ratings correlated at least .2 with the Bristol norms.
In the data analysis, we removed all target lists with a correlation of less than .4 with the Bristol norms for the set of control words. This led to the removal of 350 lists or 126,700 ratings (15 % of the collected ratings). Finally, the distribution of AoA ratings had a positive skew. Therefore, we removed another 1 % of extremely large values of AoA ratings (ratings exceeding 25 years of age) to attenuate the disproportionate influence of outliers on statistical models. The resulting data set comprised 696,048 valid ratings, accounting for 83 % of the original data set. Of these, 615,967 were numerical (89 % of the valid ratings), and 76,211 (11 % of the valid ratings) were “Don’t knows.” The resulting set of responders included 1,729 responders, or 88 % of the original participant pool. Of the words that we included in our study, 2,300 (7.7 %) were not known to half of the respondents. For completeness, this article and the supplementary materials provide mean numeric ratings for all of the words; we also base our correlational and regression analyses on the full word list. For experiments with a small number of items, it is advisable, however, to use the mean numeric ratings only if they are reported to be based on at least five numeric responses.
All but eight of the words received 18 or more valid ratings. The correlation between the mean numeric ratings for the control words and the Bristol norms was r = .93 (N = 50, p < .0001). The correlation between the odd-numbered and even-numbered participants for the items with 10 or more numeric ratings (N = 26,532) was r = .843, which gives a very high split-half reliability estimate of 2 × .843/(1 + .843) = .915.
Some previous studies collecting AoA norms in blocks of words (e.g., Bird et al., 2001; Stadthagen-Gonzalez & Davis, 2006) used a linear transformation procedure to homogenize the means and standard deviations of the blocks (for details, see p. 600 of Stadthagen-Gonzalez & Davis, 2006). We applied this procedure to a random sample of five of our lists and found that the differences between the raw and the corrected ratings were negligible (usually less than 0.2). Therefore, we decided not to apply this transformation to our data.
Of the valid responders, 1,136 were female and 593 were male. Their ages ranged from 15 to 82 years, with 8 % of the responders being younger than 20 years, 47 % from 20 to 29 years old. 22 % from 30 to 39, 12 % from 40 to 49, and 11 % older than 49. Twelve of the participants (0.7 %) reported a single language other than English as their first language; another 31 responders (1.8 %) reported more than one language as their first languages, including English. As their responses did not differ from the rest, they were included.
Education levels were labeled as follows: 1, Declined to answer or No high school; 2, High school graduate; 3, Some college, no degree; 4, Associate degree; 5, Bachelors degree; and 6, Master or higher degree. Table 1 shows the distribution of ratings and responders over the various categories. Most of the participants came from categories 3 (some college) and 5 (bachelor’s degree)
Does demography affect the numeric ratings?
Women gave slightly but significantly higher AoA numeric ratings (M = 10.2, SD = 4.4) than did men (M = 10.1, SD = 4.2) (t = −10.27, df = 440,410, p value < .0001). The numeric AoA ratings did not vary by the education levels of the responders, as is shown in the box plots of the AoA ratings in Fig. 1. This null effect in subjective judgments of AoA is surprising, given the wealth of developmental literature showing that early advantages in vocabulary size (e.g., larger numbers of word types learned earlier) are excellent predictors of future educational achievements (e.g., Biemiller & Slonim, 2001).
AoA correlated strongly with word frequency, and the relationship was log-linear (see below). To test whether this association was affected by education level, we divided education into low (Levels 1–3, up to and excluding the associate college degree) and high (4–6). Figure 2 shows the functional relationships between the AoA ratings and log10 SUBTLEX frequency for both groups. There is a hint of an interaction (which is significant at p < .05, due to the very high number of observations), but the size of the effect is very small. Higher-educated individuals tended to give earlier AoAs for high-frequency words and later AoAs for low-frequency words than did lower-educated individuals; both differences were well within 0.2 year.
Finally, there was a weak positive correlation between AoA ratings and the age of the participants (r = .07; t = 61.00, df = 615,965, p < .0001). On average, older participants gave higher AoA ratings than did younger participants, presumably because they had a broader age range to choose from.
Does demography affect the number of “don’t knows”?
For each word, we computed the ratio of numerical responses to total responses as an index of the responders’ familiarity with the word. The ratio correlated strongly with the log frequency of the word (r = .56; t = 509.9, df = 565,587, p < .0001), but no demographic variable was a significant predictor of the ratio. Perhaps most surprisingly, the average percentages of unknown words did not vary by education level, ranging from 12 % for the “no high school” level to 11 % for the “masters or higher” level.
Correlations with other AoA norms
Of course, the most important question is how strongly our Web-collected ratings correlate with those of typical laboratory studies, and whether we jeopardized the quality of the data by using less controlled sources. We can compare our mean ratings with those from three large-scale studies: Cortese and Khanna (2008) collected AoA ratings for 3,000 monosyllabic words from 32 psychology undergraduates from the College of Charleston. Bird et al. (2001) collected ratings for 2,700 words from 45 participants in the U.K.; most of their participants were between 50 and 80 years of age (mean age 61 years). Finally, Stadthagen-Gonzalez and Davis (2006) collected norms for 1,500 words from 100 undergraduate psychology students from Bristol and combined them with the Gilhooly and Logie (1980a, 1980b) ratings (collected in Aberdeen) for another 1,900 words.
Our data set had 2,544 words in common with that of Cortese and Khanna (2008). The correlation between our ratings and theirs is r = .93 (Fig. 3). A total of 1,787 words were shared with Bird et al. (2001), and these ratings correlated r = .83. Finally, 3,117 words were shared with the Bristol norms, which correlated r = .86 with our ratings.
On the basis of these correlations, we can safely conclude that our ratings are as valid as those previously collected under more controlled circumstances. Some small differences may be present in the AoA ratings between the U.S. and the U.K., given the higher correlation with the Cortese and Khanna (2008) ratings than with the Bird et al. (2001) and Stadthagen-Gonzalez and Davis (2006) ratings.
Correlation with the lexical decision data of the English Lexicon Project
Further validation of our AoA ratings was obtained by correlating them with the lexical-decision data of the English Lexicon Project (ELP). Our lists have 20,302 words in common with the ELP. For these words, we calculated the correlations with AoA, log frequency, word length in number of letters and syllables, Coltheart’s N, and OLD20 (values were from the ELP website). Because the correlations are higher with standardized response times (zRTs) than with raw response times (Brysbaert & New, 2009), we used the former behavioral measure. Table 2 summarizes the results.
As can be seen in Table 2, AoA has the second highest correlation with zRT (after log frequency) and the highest correlation with the percentage of correct responses. Surprisingly, the relationship of the mean AoA ratings with lexical-decision times was completely linear, with an estimated 27-ms increase in response time per increase by 1 year of AoA (see Fig. 4).
The importance of the AoA variable becomes yet more clear in stepwise multiple regression analyses. In these analyses, we took into account the finding that the effects of log frequency and word length on lexical-decision outcome variables are nonlinear, by using restricted cubic splines for these variables. Of the many analyses that we ran (and which can easily be replicated by any interested reader, as all of the values are freely available), we list below the ones that highlight the predictive power of AoA. For their interpretation, it is important to realize that R
2 differences of even .01 typically (and in the present analyses) come with p values below the conventional thresholds of significance (because of the large numbers of observations).
AoA explains an extra 4 % of variance in zRTs after log word frequency (Freq), word length (in letters [Nlett], and syllables [Nsyl]), and similarity to other words (OLD20) are controlled for. For the accuracy data, the extra variance explained by AoA approaches 10 %. Relative to the influence of other variables (which usually explain less than 1 % additional variance; see the introduction), these are substantial effects.
Are AoA ratings also predictive of inflected word forms?
Having access to AoA ratings of 30,000 lemmas is beneficial in itself, as this is a tenfold increase in the existing pool of AoA ratings. However, it would be even more beneficial if the ratings we collected for lemmas could also be used for the lemmas’ inflected forms. Given that each base noun has one inflected form (the plural) and that a regular base verb has three inflected forms (3rd person as well as present and past participles), the number of words to which our ratings apply would be considerably higher if the ratings also explained differences in lexical-decision performance to inflected word forms. A total of 10,011 inflected word forms in the ELP are associated with one of the lemmas rated in our study. For the correct interpretation of this finding, it is important to realize that the inflected forms do not include verb forms used more frequently as adjectives (such as “appalled”). These were included in our list of lemmas presented to the participants of the AoA study (see above). Table 3 shows the results for the inflected words.
As Table 3 suggests, there were strong correlations between lexical-decision performance on the inflected forms and the AoAs of the base words. The same was true for the frequencies of the base words (e.g., for the inflected form “played,” this would be the frequency of the word “play”). However, because the correlation between the frequency of the inflected form and the frequency of the lemma was higher than the correlation between the frequency of the inflected form and the AoA of the lemma, AoA came out as a better predictor in multiple regression analyses, as can be seen below:
By controlling inflected word forms on lemma AoA in addition to word frequency, word length, and similarity to other words, one gains 2.5 % explained variance in standardized response times and more than 4.5 % in the percentage-accurate value.
How does AoA relate to other ratings?
Our data also allow us to examine the relationship of AoA to other word variables. Clark and Paivio (2004) ran an analysis of 925 nouns for which they had information about many rated values, in addition to the usual objective measures (frequency, length, and similarity to other words). More specifically, they looked at the impact of 32 variables, including:
word frequency (Kučera & Francis, Thorndike & Lorge)
estimated word familiarity (two ratings from different studies)
word length (in letters and syllables)
word availability (the number of times a word is given as an associate to another word or is used in dictionary definitions)
number of meanings a word has
estimated context availability (how easy participants find it to think of a context in which the word can be used)
estimated concreteness and imageability (two ratings from different studies)
estimated AoA and number of childhood dictionaries in which the word is explained
emotionality, pleasantness, and goodness ratings of the words, and the degree of deviation from the means
how gender-laden the word is (two ratings from different studies)
number of high-frequency words starting with the same letters
subjective estimates of the number of words that begin with the same letters and sounds, rhyme with the words, sound similar, and look similar
pronounceability ratings of the words
estimated ease of giving a definition, and estimate of whether a word has different meanings
Factor analysis suggested that the 32 variables formed nine factors: frequency, length, familiarity, imageability, emotionality, word onset, gender-ladenness, pleasantness, and word ambiguity. The last factor was the weakest and on the edge of significance.
To see how the new AoA measure related to the variables investigated by Clark and Paivio (2004), we added three extra variables (log SUBTLEX frequency, our new AoA rating, and OLD20) to the list and looked at the correlations with zRT in the ELP lexical-decision task. Values were present for 896 of the original 925 words. Table 4 lists the correlations in decreasing order of absolute values. This shows that the correlation with zRT was strongest for word frequency, followed by the estimated pronounceability of the word, familiarity, word availability, and context availability. The lowest correlations were observed for the estimated similarity of the word to other words, the emotionality, and the gender-ladenness of the words. Also interesting is that our AoA ratings correlated .90 with those of Clark and Paivio, and correlated slightly higher with zRTs than did the Clarke and Paivio AoA ratings.
To examine the relationship between our AoA ratings and the many ratings mentioned by Clark and Paivio (2004), we repeated their factor analysis (using the factanal procedure of R with the default varimax rotation). As we had slightly fewer data (896 instead of 925), we failed to observe a significant contribution of the final factor (meaning ambiguity). Therefore, we worked with an eight-factor model instead of the original nine-factor model. We also included the additional variables log SUBTLEX-US frequency, OLD20, and zRT in the ELP lexical-decision task. The latter variable allowed us to see on which factors lexical-decision times load and to what extent these differ from those on which the other variables load.
The outcomes of the factor analysis are shown in Table 5. This analysis indicates that lexical-decision times only loaded on the first four factors (word frequency, length, familiarity, and imageability). They were not significantly related to the emotionality, word onset, gender-ladenness, or pleasantness of the words. Interestingly, AoA loaded on exactly the same factors, just as word frequency did. This is further evidence that AoA and word frequency are strongly related to lexical-decision times. For the Clark and Paivio (2004) set of nouns, we also see a strong influence of familiarity that is surprising, given that in two previous analyses on monosyllabic words, familiarity no longer seemed to have a strong influence, if a good frequency measure and an AoA measure were used (Brysbaert & Cortese, 2011; Ferrand et al., 2011).
AoA ratings and vocabulary growth
The availability of AoA ratings for a large number of content words also makes it possible to estimate the number of words thought to be learned at various ages—that is, the guesstimated vocabulary growth curve. We divided the mean AoA ratings into yearly bins, from 1 to 17, and computed the cumulative sum of the word types falling in each bin. This subjective estimate of vocabulary growth is compared in Fig. 5 to the estimates obtained via experimental testing of children’s vocabulary in Biemiller and Slonim (2001). Biemiller and Slonim presented both a representative sample and a sample with an advantaged socioeconomic status with multiple-choice questions requiring definitions of words from a broad frequency range. They tested children from Grades 1, 2, 4, and 5 and estimated the number of words acquired from infancy to Grade 5 (see Tables 10 and 11 in Biemiller & Slonim, 2001). We relabeled Grades 1–5 to ages 6–10, respectively.
Figure 5 shows the subjectively estimated vocabulary growth curve on the basis of the AoA ratings (solid line). As can be seen, this is a sigmoid curve typical of learning tasks. Figure 5 further includes the estimates of vocabulary size both for the representative (or normed) sample (thick dashed line) and the group with an advantaged socioeconomic status (thick dotted line), as reported by Biemiller and Slonim (2001). For each group, we also included confidence intervals (based on the estimated numbers of lemmas known to the 0–25th and 75th–100th percentiles of the groups).
Several aspects of the comparison between the estimated and measured vocabulary growth are noteworthy. First, our responders put the main weight of word learning in the elementary school years, from ages 6 to 12. This underestimates growth in the years 2–5 (the AoA estimates are lower than those in Biemiller & Slonim, 2001) and overestimates the growth after the age of 9 (AoA estimates are higher than those in Biemiller & Slonim, 2001). Also, responders reported that hardly any words entered their vocabulary before the age of 3 and after the age of 14. Only a small percentage (1.2 %) of the mean AoA ratings were below 4 years of age, even though the receptive vocabulary is not negligible in these age cohorts. This result is in line with the well-described phenomenon of infantile amnesia, the inability of adults to retrieve episodic memory (including lexical memory) before a certain age (de Boysson-Bardies & Vihman, 1991). Reporting only a small percentage of words acquired after the age of 15 (3 %–5 %) was true even for a more educated population (bachelors, masters, or PhD degree) that is likely to have substantially broadened their vocabulary throughout the higher education years.