The number of occurrences of a word within a corpus is one of the best predictor of word processing time (Howes & Solomon, 1951). High-frequency words are processed more accurately and more rapidly than low-frequency words, both in comprehension and in production (Baayen, Feldman, & Schreuder, 2006; Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Monsell, 1991; Yap & Balota, 2009). This word frequency effect was observed in different tasks such as lexical decision tasks (Andrews & Heathcote, 2001; Balota et al., 2004), perceptual identification tasks (Grainger & Jacobs, 1996; Howes & Solomon, 1951), pronunciation tasks (Balota & Chumbley, 1985; Forster & Chambers, 1973), and semantic categorization tasks (Andrews & Heathcote, 2001; Taft & van Graan, 1998). This word frequency effect is a robust effect, since it was found in many languages.

For a long time, corpora were compiled from written texts, principally books (Kučera & Francis, 1967; Thorndike & Lorge, 1944). Book-based corpora were created in different languages: Brulex (Content, Mousty, & Radeau, 1990) and Frantext, used in the Lexique database in French (New, Pallier, Ferrand, & Matos, 2001); Celex in English, German, and Dutch (Baayen, Piepenbrock, & van Rijn, 1993); and Kučera & Francis in English (Kučera & Francis, 1967). Some corpora were used despite their age and despite some critics (see Brysbaert & New, 2009, for a discussion of the Kučera & Francis corpus).

More recently, another source of corpora was found to be reliable: movie subtitles. The subtitle-based frequencies were first computed in French by New, Brysbaert, Véronis, and Pallier (2007). The authors showed two main results. First, they showed that the subtitle-based frequencies were a better predictor of reaction times than the book-based frequencies. Second, the subtitle-based frequencies were complementary to book-based frequencies. For instance, typical words from spoken language in everyday life were much more frequent in the subtitle-based than in the book-based corpora. Because the book-based and subtitle-based frequencies were shown to be complementary in the analyses (they explained more variance together than separately), the authors concluded that book-based frequencies could be good estimates of written language and that subtitle-based frequencies could be good estimates of spoken language. The subtitle-based frequencies were then created in other languages in which these results have been replicated, such as English (Brysbaert & New, 2009), Dutch (Keuleers, Brysbaert, & New, 2010), Chinese (Cai & Brysbaert, 2010), Greek (Dimitropoulou, Duñabeitia, Avilés, Corral, & Carreiras, 2010), Spanish (Cuetos, Glez-Nosti, Barbon, & Brysbaert, 2011), German (Brysbaert, Buchmeier, Conrad, Jacobs, Bölte, & Böhl, 2011), British (van Heuven, Mandera, Keuleers, & Brysbaert, 2014), and Polish (Mandera, Keuleers, Wodniecka, & Brysbaert, 2015).

Another source that has yielded good frequency measures was the Internet. The Internet presents two advantages: It is easier to get a large corpus from the Internet than from books (since there is no need to scan documents). Second, the language used on the Internet is more varied than the language in books. Lund and Burgess (1996) proposed a corpus (named HAL) based on approximately 160 million words taken from Usenet newsgroups. Burgess and Livesay (1998) found that the word frequencies from HAL were a better predictor of lexical decision times than the Kučera and Francis (1967) frequencies. Balota, Cortese, Sergent-Marshall, Spieler, and Yap (2004) reached the same conclusion, and they recommended the HAL frequencies (Balota et al., 2007). According to Balota et al. (2004), the poor performance of the Kučera and Francis frequencies was largely due to the small size of the corpus. In order to investigate this question of the importance of the size of the corpus, Brysbaert and New (2009) selected various sections of the British National Corpus (Leech, Rayson, & Wilson, 2001). The sections were of different sizes (from 0.5 million words to 88 million words). They then correlated the word frequencies in the different sections with the reaction times in the English Lexicon Project (Balota et al., 2007). They showed that the percentage of variance accounted for in the lexical decision times reached its peak when the section size was of 16 million words, especially for low-frequency words. The conclusion was that a corpus of 16 million words seems to be sufficient.

Another important variable in explaining cognitive word processing is contextual diversity (Adelman, Brown, & Quesada, 2006), which is defined as the number of documents in a corpus in which a given word is found. Adelman et al. showed that contextual diversity was a better predictor than word frequency in naming and lexical decision tasks. This result was confirmed with subtitle-based frequencies (Brysbaert & New, 2009).

Nowadays, a great number of languages do not yet have reliable word frequency norms. Moreover, people now spend a lot of time on the Internet.Footnote 1 For instance, American people are spending 23 h per week texting. More importantly, this duration has increased since last year, and the proportion of Internet users is increasing in most countries.

The goal of this article is twofold. First, we wanted to make available new word frequencies based on Twitter, blogs, and newspapers for 66 different languages. The distinction between the three sources (Twitter, blogs, newspapers) can be justified because the language constraints are not the same: The blog and newspaper frequencies are similar because both are not limited in length. The Twitter and blog frequencies are similar because both allow more informal language than newspapers, but they differ in length (a tweet is limited to 140 characters). We also hypothesized that Twitter would be more similar to spoken language, since anybody can produce short text messages that will appear on Twitter, while blogs would be more similar to written language, since it requires more investment to write a blog than to write on Twitter. Second we wanted to test whether these Web frequencies could be as reliable as already known frequencies. In order to test the reliability of these new frequencies, we used reaction times from megastudies available in French, English, Dutch, Malay, and simplified Mandarin Chinese.

Method

In this section, we describe how the Twitter, blog, and newspaper corpora were collected. Then we describe the word frequencies used from books and subtitles. Finally, we present the five megastudies (French, English, Dutch, Malay, and Chinese) that we used to validate all of these frequencies.

New frequencies and contextual diversity from Twitter, blogs, and newspapers

The three new frequencies were calculated from a freely available collection of corpora for various languages that was created by Hans Christensen.Footnote 2 The documents were collected from public Web pages by a Web crawler, and each entry is tagged with the date of publication. The main characteristics of the three corpora, such as the sizes of the different corpora for the different languages, are presented in Table 1 for the 66 languages. All sources taken together, 51 languages out of the 66 are based on a corpus containing more than 10 million words, and 39 out of the 66 on a corpus containing more than 16 million words. We downloaded the corpora and converted them to lowercase. We then calculated the frequencies of all of the different words in the corpora, and the lists were filtered with the spellchecker Hunspell 1.3.2, or with Aspell 0.60.6.1 (nine languages) if the Hunspell dictionary for that language was not available. The filtering resulting from spellchecking the words was important, because the original lists of words contained a lot of entries having orthographic or typographic errors, as well as foreign language entries. The corpora contain words from foreign languages for two reasons: because the automatic language checker may have confounded two similar languages, or because parts of a text mainly in one language could be in a different language. Twelve of the languages could not be filtered by a spellchecker because reliable spellcheckers could not be found for them. Because Chinese and Japanese do not separate words with spaces, we tokenized Chinese using the Stanford Word Segmenter 3.5.1, and Japanese with Kuromoji 0.7.7.

Table 1 Numbers of words (in millions) and numbers of documents in the three new corpora for each language

Frequencies already existing (which we name the “classic frequencies” from now on)

French

We used the book-based frequency (Frantext) and the subtitle-based frequency (Subtlex-FR) from Lexique 3.80 (New, Pallier, Brysbaert, & Ferrand, 2004; website: www.lexique.org).

English

We used the HAL frequencies (Burgess & Livesay, 1998) and the Subtlex-US frequencies (Brysbaert & New, 2009).

Dutch

We used the Celex frequencies (Baayen et al., 1993) and the Subtlex-NL frequencies (Keuleers et al., 2010).

Malay

We used frequency counts from two corpora based, respectively, on a Malaysian newspaper and a Singaporean newspaper (Yap, Rickard Liow, Jalil, & Faizal, 2010).

Chinese

Because the Chinese Lexicon Project (see the section about megastudies below) makes available lexical decision reaction times for Chinese simple characters, and because Chinese characters can be isolated words or may be combined with other characters into words, we used two different frequencies: the word frequencies (only the occurrences of the character when it was presented as a single word) and the character frequencies, which included all occurrences of the character (Cai & Brysbaert, 2010).

Megastudies in French, English, Dutch, Malay, and Chinese

In order to validate the new frequencies empirically, we ran analyses on megastudies, which are large databases of descriptive and behavioral data. The first such study was published by Balota et al. (2007). In the English Lexicon Project (ELP), they collected naming times and lexical decision times for over 40,000 English words from several hundred participants. Here we tested our new frequencies on English behavioral data coming from the ELP, on French behavioral data (the French Lexicon Project, or FLP; Ferrand et al., 2010), on Dutch (Keuleers, Diependaele, & Brysbaert, 2010), on Malay (the Malay Lexicon Project, or MLP; Yap et al., 2010), and on Chinese (Sze, Rickard Liow, & Yap, 2014).

Results

Classic frequencies versus new frequencies

We conducted linear regression analyses to predict standardized lexical decision times according to the classic frequencies, the new frequencies, and all frequencies together. The frequency measures used were the log10-transformed raw frequencies (Baayen et al., 2006), on which we added 1 and we used a polynome of degree 3 because Balota et al. (2004) showed a nonlinear relationship between log frequency and lexical decision times. For each regression analysis, we included the number of letters, the number of syllables, and an orthographic neighborhood measure (OLD20; Yarkoni, Balota, & Yap, 2008) as control variables. The number of letters was transformed with a polynome of degree 2 because New, Ferrand, Pallier, and Brysbaert (2006) showed that the relationship between lexical decision times and word length was not linear. All words with an error rate greater than 33 % were excluded from the analyses. Finally, we conducted our analyses on 35,658 words in French, 32,088 words in English, 11,855 words in Dutch, 1,363 words in Malay, and 2,277 words in Chinese.

Table 2 shows the percentages of variance in reaction times explained by the different frequency measures. For each analysis, the value presented in Table 2 is the adjusted R 2 multiplied by 100 (to express it as a percentage). In addition to the adjusted R 2 values, we also used inferential tests to compare the different linear models. More precisely, when two nested models were compared, we used an analysis of variance test. However, when two nonnested models were compared, we used Vuong’s test (Vuong, 1989).

Table 2 Percentages of variance in reaction times (RT) explained by the different frequency measures

In French, the first line in Table 2 shows the results of the regression analysis when the classic frequencies were used as predictors. The second line presents the results when the three new frequencies were used, and the third line presents the results when all five frequencies were used. The first result to note is that the three new frequencies explained more variance (48.46 %) than the classic frequencies (47.56 %), a difference which was significant (R 2 change = 0.9, p < .0001). The second result is that the greatest percentage of variation was explained when all five frequencies were entered as predictors in the regression model (50.27 %): This model performed significantly better than those based on the two classic frequencies (R 2 change = 2.71, p < .0001) and the three new frequencies (R 2 change = 1.81, p < .0001).

In English, we observed a similar pattern to the one we observed in French: Our new frequencies explained as much variance (68.06 %) as the classic frequencies (67.99 %), since the difference was not significant (R 2 change = 0.07, p = .78). The model containing all five of the frequencies explained more variance (69 %) than the models with the new or the classic frequencies alone (R 2 change = 1.01 for the classic frequencies and 0.94 for the new frequencies, ps < .0001).

In Dutch, contrary to the result observed in French and in English, the classic frequencies explained more variance (40.2 %) than our new frequencies (38.92 %), and this difference was significant (R 2 change = 1.28, p = .004). However, as in French and English, the model containing all five of the frequencies was better, as it explained more variance (42.81 %) than the other two other models (R 2 change = 2.61 for the classic frequencies and 3.89 for the new frequencies, ps < .0001). A possible explanation for these slightly worse results from Dutch could be that the language detectors used for collecting the new frequencies often confounded Dutch and Flemish (the Dutch language spoken in Flanders), which would be less the case for the subtitle corpus.

In Malay, we observed that our new frequencies explained a lot more variance (65.87 %) than the classic ones (63.2 %), a difference that was significant (R 2 change = 2.67, p < .01). Again, the model including all five frequencies explained more variance (67.44 %) than the other models (R 2 change = 4.24 for the classic frequencies and 1.57 for the new frequencies, ps < .0001).

Finally, in Chinese, we ran the analyses using character frequencies and also word frequencies. For the character frequencies, the variance explained by our new frequencies (52.86 %) was greater than that with the classic frequencies (47.56 %), a difference that was significant (R 2 change = 5.3, p < .0001). As in the other languages, the model with all five frequencies explained significantly more variance (53.86 %) than the other models (R 2 change = 6.3 for the classic frequencies and 1.0 for new frequencies, ps < .0001). For the word frequencies, we observed the same pattern (all of the differences were significant except for the difference between the new and classic frequencies, which was almost significant). It is worth noting that the variance explained was much greater using the character frequencies than using the word frequencies, suggesting that, as in alphabetic languages, whenever a word is presented the characters in this word are activated (Baayen, Dijkstra, & Schreuder, 1997; New, Brysbaert, Segui, Ferrand, & Rastle, 2004; New & Grainger, 2011). For this reason, the next Chinese analyses in this article will be presented using the character frequencies only.

Because our new frequencies explained either similar amounts of variance or more variance than the classic frequencies for all five of these languages, we can conclude that these new frequencies are reliable alternatives to the frequencies used until now. Furthermore, the fact that the model including all five frequencies explained, in all five languages, more variance than the other models means that these two frequency sources can be complementary. In Malay and in Chinese, our new frequencies gave particularly better results than the classic frequencies. One possible explanation could be that these are the two most recent megastudies. We will come again to this issue in the Discussion.

Relationship between classic frequencies and new frequencies

Table 3 shows the Pearson correlation coefficients for each pair in the five frequencies (except for Chinese, with only four frequencies). All correlations were significant (ps < .0001) in all languages. In French, the new frequency that correlated the strongest with Subtlex-FR was Twitter (.920). The new frequency that correlated the strongest with Frantext was blogs (.972). The same pattern was found in English: Among the three new frequencies, the most correlated with Subtlex-US was Twitter (.925), and the most correlated with HAL was blogs (.988). In Dutch, the new frequency that correlated the strongest with Subtlex-NL was Twitter (.877), and the one that correlated the strongest with Celex was news (.974). In Malay, the new frequency that correlated the strongest with both MlNews and SgNews was news; this last result can easily be explained, since all three of these frequencies come from newspaper corpora. In Chinese, the new frequency that correlated the strongest with Subtlex-CH was Twitter (.879). Overall, the results showed that when we want to use a frequency similar to books or HAL, it is better to use blog frequencies. When we want to use a frequency similar to spoken or subtitle frequencies, it is better to use Twitter frequencies.

Table 3 Correlations for each pair in the five frequencies

Twitter versus blogs versus newspapers

Among the three new frequencies, are all three useful, or could we omit some of them? To answer this question, we ran regression analyses to compare whether the adjusted R 2 decreased when we remove one of the three frequencies. Table 4 presents the percentages of variance in reaction times explained by the new frequency measures. For each analysis, the value presented in Table 4 is the adjusted R 2 multiplied by 100 (to express it as a percentage).

Table 4 Regression results to compare the three new frequencies

In French, the first line in Table 4 presents the percentage of variance in reaction times explained by the three new frequencies as predictors (48.46 %). The second line presents the regression model with only Twitter and blogs as predictors: The percentage of variance explained is virtually the same (48.4 %). However, because of the important number of observations, this difference was significant (p < .0001). When Twitter (48.2 %) or blog (46.19 %) frequencies were removed from the model (third and fourth lines), we observed further drops in the variance explained (ps < .0001). In English, we observed a similar pattern: The variances explained were virtually the same with the three frequencies (68.06 %) and when newspaper frequencies were removed (68 %), even though the difference was significant (p < .0001). When Twitter (67.18 %) or blog (67.51 %) frequencies were removed, we observed further drops in the variance explained (ps < .0001). In Dutch, the variances explained were almost the same with the three frequencies (38.92 %) and without the newspaper frequencies (38.86 %), but the difference was significant (p = .003). Without Twitter (38.33 %) or without blog frequencies (37.04 %), the drops in variance were greater (ps < .0001). In Malay, again the variances explained were very similar with all three frequencies (65.87 %) included and with a model based only on blogs and Twitter frequencies (65.91 %) and this difference was not significant (p = .64). On the contrary, the variance explained dropped when Twitter (65.3 %) or blog (64.8 %) frequencies were removed from the model (ps < .0001). Finally, in Chinese, although all differences were significant (all ps < .01), the difference between the three-frequencies model (52.86 %) and the models without newspapers (52.66 %) or without Twitter (52.53 %) was weak. However, the drop in variance was greater when blogs were removed (51.53 %). Overall, these results indicate that in the five languages, the newspaper frequencies are not crucial if we already use blog and Twitter frequencies.

Frequency versus contextual diversity

In the book and subtitle corpora, contextual diversity (CD) is a better predictor of word processing than word frequency (Adelman et al., 2006; Brysbaert & New, 2009). We decided to verify whether we could replicate this result with our new corpora. In order to do so, we ran regression analyses. For each of the five new corpora, we compared the word frequency and the CD measure. Table 5 presents the percentages of variance in reaction times explained by the different frequency and CD measures. For each analysis, the value presented in Table 5 is the adjusted R 2 multiplied by 100 (to express it as a percentage).

Table 5 Regression results to compare the frequency and contextual diversity (CD) measures

In French, the first line shows the percentage of variance explained when Twitter frequency or Twitter CD was used as a predictor: The values were extremely similar (43.77 for Twitter frequency and 43.78 for Twitter CD), and the difference was not significant (p = .14). The same pattern was observed for blogs (except that the difference was significant: p < .0001) and for news (p = .4). In English and in Malay, all differences were not significant (ps > .05). In Dutch, all differences were significant (ps < .0001), but for each predictor, the frequency was better than the CD. Finally, in Chinese, the variances explained by the frequency measure and the CD measure were very similar, and all of the differences were nonsignificant. From these results, we can conclude that for the three new corpora (Twitter, blogs, and news), CD is not a better predictor than word frequency.

Effect of time on the classic and new frequencies

Twitter and blogs are recent tools, as compared to books and even subtitles, since subtitles have been collected from movies released from the beginning of spoken cinema to nowadays. For instance, Twitter was created only in 2006. As a consequence, we predicted that our new frequencies will improve with time and will predict reaction times better and better. A first way to test this prediction is to compare reaction times collected in different years. We compared the evolution of Subtlex, Twitter, and blog frequencies between two megastudies in American English that used the same methodology: the reaction times for young people in Balota et al. (2004) and the reaction times in ELP (Balota et al., 2007). The results are presented in Table 6.

Table 6 Regression results to compare Subtlex, Twitter, and blog frequencies in 2004 and 2007 (2,511 words)

The results show that the variance explained by Subtlex did not improve between 2004 and 2007, contrary to the Twitter (+~1 %) and blog (+~2 %) frequencies.

Discussion

The goals of this study were to validate empirically new frequencies derived from Twitter, blog post, and newspaper corpora, and to make available these new frequencies for 66 languages; many of these languages have never had good frequency sources until now.

Our results showed that these new frequencies predict lexical decision reaction times similar to, or even better than, the frequencies that have been used until now, such as book-based and subtitle-based frequencies. This result was found in French, English, Dutch, Malay, and Chinese. Therefore, we can reasonably infer that the new frequencies can be used in other languages, as well. For a great number of languages that have not yet had reliable frequencies, we provide frequencies calculated from Twitter, blog posts, and newspaper corpora. Furthermore, we showed that, for the five languages we analyzed, the newspaper frequencies did not explain much variance that was not already explained by the Twitter or blog frequencies.

Another advantage of these frequencies is that the original corpora are available to downloadFootnote 3 for people looking for extra information, such as the contexts in which the words occur or the frequency of a chain of words.

Surprisingly, in our analyses we did not observe that contextual diversity was a better predictor of latencies than was the word frequency, which is a result that has been replicated several times (for lexical decision tasks, see, e.g., Keuleers et al., 2010, or Cai & Brysbaert, 2010). A possible explanation could be that the documents in our corpora were too small. For copyright reasons, a document in our corpus contains only a few sentences (mostly one to three sentences), often one. From this result, we can conclude that when the documents are too small, the differences between frequency and CD disappear, so both indices can be used equally.

Moreover, these new frequencies were collected in identical ways in the different languages, and this can be useful for cross-language studies. Indeed, this will allow researchers to control frequency in very similar ways across possibly very different languages. For example, studies about bilingualism could benefit from Worldlex: If bilinguals have to read the same words in two languages, Worldlex would be a useful tool to control the frequencies across the two languages.

Although the Twitter and blog frequencies are complementary, it can be noted that the blog frequencies were always as good as or better than the Twitter frequencies. A possible explanation would be that, because tweets are often messages typed into a cell phone, the Twitter frequencies would be particularly affected by word length, whereas this bias would not be present for the blog frequencies. To test this prediction, we compared Twitter and blog frequencies for different word lengths: We did not find any argument in favor of this prediction (at least for all words with less than 12 letters). To summarize, for a given language, if blog frequencies are better than Twitter frequencies, this difference occurs whatever a word’s length is.

Finally, the sources of these new corpora are very recent (especially Twitter and blogs), and despite their young age, the lexical frequencies are already very good for predicting reaction times in lexical decision tasks. If Twitter and blogs continue to increase in popularity, it is possible that these frequencies will become better and better. Indeed, our analysis comparing English megastudies in 2004 and 2007 suggested that our new frequencies improve with time. This interpretation could also explain why the new frequencies predict reaction times much better than the classic frequencies in Malay and in Chinese, since the two corresponding megastudies are the most recent ones.

Availability

The WorldLex Twitter, blog, and newspaper frequencies are available at http://worldlex.lexique.org. For each language, a word frequency file was created that contains the following information: the raw frequency, the frequency per million words, the contextual diversity, and the percentage of contextual diversity. This information is available for the blog, Twitter, and newspaper corpora (except when one of these corpora was not available, of course). For each language, there are two files: one with the frequencies of all of the character strings that were in the corpus, and one with only the strings that were validated by the spellchecker for the given language. As we already mentioned, the advantage of the file filtered by the spellchecker is that it contains many fewer orthographic or typographic errors than the raw-frequency file. It also does not contain words from foreign languages. However, we have also made the file not filtered by the spellchecker available, because it can contain proper names or new words that were removed by the spellchecker. Except for these specific analyses, we advise researchers to use the spellchecked files.