Word frequency influences most aspects of language processing. Together with word length, it is among the strongest predictors of lexical decision times (i.e., the time needed to decide whether a string of letters is a word or a nonword in a lexical decision task, or LDT), where high-frequency words are responded to faster than low-frequency words (e.g., Brysbaert et al., 2011). It is also important in word naming, although to a lesser extent (e.g., Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004). Moreover, word frequency is also relevant in memory research, in which high-frequency words are recalled better but recognized more poorly than low-frequency words (e.g., Cortese, Khanna, & Hacker, 2010). Consequently, this variable is routinely controlled for in psycholinguistic and memory studies. To do that, researchers rely on frequency databases. Such databases have traditionally been elaborated by counting the number of appearances of words within corpora obtained from written texts, including books, newspapers, and other periodical sources, such as magazines. Among these corpora, most research in English has been based on the database of Kučera and Francis (1967), the frequency estimates coming from the British National Corpus (BNC; Kilgarriff, 2006), and the CELEX Lexical Database (Baayen, Piepenbrock, & Gulikers, 1995). There are similar databases in other languages, including Spanish (Alameda & Cuetos, 1995; Sebastián-Gallés, Martí, Carreiras, & Cuetos, 2000), French (Imbs, 1971; New, Pallier, Brysbaert, & Ferrand, 2004), and Italian (De Mauro, Mancini, Vedovelli, & Voghera, 1993).

Corpora elaborated from written texts have some limitations that can affect the validity of the frequency estimates based on them. As some authors have pointed out, the topics included in those corpora are not necessarily related to daily life. Furthermore, they tap onto a linguistic style that is not representative of the language used commonly by speakers (Dimitropoulou, Duñabeitia, Avilés, Corral, & Carreiras, 2010). This is because writers try to use a more formal register and to avoid repeating words. This yields greater lexical diversity, which can lead to an overestimation of the frequency of words that are hardly used by the speakers, as well as an underestimation of more frequently used words (Cuetos, González-Nosti, Barbón, & Brysbaert, 2011).

In an effort to overcome such limitations, some authors have turned to the Internet as a source of materials for elaborating frequency norms that are more suited to everyday language. For instance, Burgess and Livesay (1998) created the HAL corpus, which includes more than 130 million words, by gathering texts from online Usenet newsgroups. More recent Internet-based frequency norms have been based on even larger corpora, containing up to 500 billion words (Brants & Franz, 2006; Shaoul & Westbury, 2013).

During the last decade, some corpora have been elaborated from another type of digitized text—in particular, film and television subtitles. The first authors using such an approach were New, Brysbaert, Veronis, and Pallier (2007), who developed a database of French frequencies based on a corpus of 52 million words coming from 9,474 different films and television series. The authors compared these new measures to preexisting frequency measures obtained from written (New et al., 2004) and spoken (Reference Corpus of Spoken French; Equipe DELIC, 2004) corpora in relation to lexical decision times, either taken from a previous experiment (Bonin, Chalard, Méot, & Fayol, 2001) or collected by themselves. The authors observed that subtitle frequencies were better predictors of lexical decision times than were the more traditional frequency measures. Two years later, Brysbaert and New (2009) replicated the findings of New et al. (2007) with subtitle frequencies for American English (SUBTLEX-US). These authors used reaction times (RTs) obtained from a large pool of words (40,000), observing that frequency estimates derived from subtitles predicted lexical decision performance even better than did Internet-based frequencies.

Since those pioneering studies, frequency estimates based on film and television subtitles have been elaborated in several languages, such as Dutch (SUBTLEX-NL; Keuleers, Brysbaert, & New, 2010), Greek (SUBTLEX-GR; Dimitropoulou et al., 2010), Chinese (SUBTLEX-CH; Cai & Brysbaert, 2010), Spanish (SUBTLEX-ESP: Cuetos et al., 2011; EsPal’s subtitle corpus collection: Duchon, Perea, Sebastián-Gallés, Martí, & Carreiras, 2013), German (SUBTLEX-DE; Brysbaert et al., 2011), Brazilian Portuguese (SUBTLEX-PT-BR; Tang, 2012), Albanian (SUBTLEX-AL; Avdyli & Cuetos, 2013), British English (SUBTLEX-UK; van Heuven, Mandera, Keuleers, & Brysbaert, 2014), European Portuguese (SUBTLEX-PT; Soares et al., 2015), and Polish (SUBTLEX-PL; Mandera, Keuleers, Wodniecka, & Brysbaert, 2015). The authors of these studies have commonly obtained the raw material needed to elaborate the corpora from specialized websites. The number of subtitles considered has varied widely across databases, from the around 6,000 subtitle files in the SUBTLEX-GR corpus (Dimitropoulou et al., 2010) to the around 45,000 broadcasts in the SUBTLEX-UK corpus (van Heuven et al., 2014). Similarly, the final numbers of words (tokens) included in various corpora differ, ranging from two million words in SUBTLEX-AL (Avdyli & Cuetos, 2013) to more than 450 million words in EsPal’s subtitle corpus collection (Duchon et al., 2013).

A clear advantage of subtitle frequency norms is that they are derived from large corpora. Corpora containing large sets of words are preferable over small corpora for several reasons. One of these is that they provide better frequency estimates overall, because the standard error of the word counts diminishes as the sample size increases (Lee, 2003). The other is that larger corpora include a greater representation of low-frequency words than do smaller corpora, thus leading in particular to more accurate frequency estimates of such words. According to Brysbaert and New (2009), a corpus of 16–30 million words allows for reliable frequency estimates. Such a requirement is fulfilled by the majority of the above-mentioned databases. However, it seems that the size of the corpora from which subtitle frequency norms derive cannot entirely explain their advantage in predicting word recognition performance. Indeed, such an advantage has been demonstrated even when the subtitle frequency estimates have been compared with frequency estimates obtained from written sources involving large sets of words (e.g., Brysbaert et al., 2011). Instead, it seems that corpus representativeness (i.e., the extent to which the linguistic samples included in the corpus represent the language as a whole; Sinclair, 2005) can be more relevant. Film and television subtitles have a higher capacity to capture the language that people use in everyday life than do written corpora, because they are transcripts of oral communication. Furthermore, the majority of the studies on the effects of frequency on language processing have relied on young undergraduate students (Dimitropoulou et al., 2010), which is a population that spends more time watching TV than reading (Instituto Nacional de Estadística, 2011) and that mostly uses informal language. Thus, it becomes apparent that film and television subtitles can be a very appropriate source of word selection for the elaboration of frequency norms.

To test for the adequacy of such norms, all of the studies above validated their frequency estimates by comparing them to more traditional measures (i.e., frequencies derived from written corpora) in word recognition tasks. These studies have mainly relied on the LDT, because it is a task highly sensitive to word frequency (e.g., Balota et al., 2004; Brysbaert & New, 2009; Cortese & Khanna, 2007; Yap & Balota, 2009). To do this, distinct approaches have been adopted. In some cases, the authors have taken advantage of already published experiments and have correlated or regressed the RTs from those experiments with distinct frequency measures (e.g., Brysbaert & New, 2009; Cai & Brysbaert, 2010, Exp. 1; Cuetos et al., 2011; Duchon et al., 2013; New et al., 2007; van Heuven et al., 2014). In a few cases, those experiments were megastudies involving large sets of words. As an example, van Heuven et al. performed correlations between the SUBTLEX-UK frequencies, the BNC frequencies (Kilgarriff, 2006), or the CELEX frequencies (Baayen et al., 1995) and the RTs and accuracy from the British Lexicon Project (BLP, which includes over 28,000 words; Keuleers, Lacey, Rastle, & Brysbaert, 2012). Similarly, Brysbaert and New (2009) used the lexical decision times from the more than 40,000 words included in the English Lexicon Project (Balota et al., 2007) to validate their American English frequency norms in comparison to frequencies obtained from distinct written sources: the Kučera and Francis norms (Kučera & Francis, 1967), CELEX (Baayen et al., 1995), the HAL corpus (Burgess & Livesay, 1998), the BNC (Leech, Rayson, & Wilson, 2001), and the Zeno corpus (Zeno, Ivens, Millard, & Duvvuri, 1995).

It should be noted, however, that word recognition megastudies are not available in all languages. In such cases, the preferred option by researchers has been to conduct an experiment involving a few hundred words as a validation of their frequency estimates in comparison to written-based frequencies (Avdyli & Cuetos, 2013; Brysbaert et al., 2011; Cai & Brysbaert, 2010, Exp. 2; Dimitropoulou et al., 2010; Mandera et al., 2015; Soares et al., 2015). As Brysbaert et al. pointed out, this is a good approach, provided that the experimental words are sampled from the entire frequency range. The experimental designs chosen have varied across studies. For instance, Brysbaert et al. did several lexical decision experiments manipulating the difficulty of the task by including distinct types of pseudowords. A different approach was followed by both Mandera et al. and Dimitropoulou et al., who selected the words for which the written-based corpora and the subtitle-based corpora gave the most divergent frequency estimates. Regardless of the approach taken (i.e., data coming from megastudies or from experiments involving a few hundred words), the conclusion of the reviewed studies has been unanimous: Frequency estimates derived from subtitles clearly outperform those derived from written texts in predicting word recognition performance.

Apart from demonstrating the superior predictive capacity of subtitle-derived measures in word recognition experiments, the above-reviewed studies have consistently revealed that one particular frequency measure is even more predictive: so-called contextual diversity (CD), defined by Adelman, Brown, and Quesada (2006) as the proportion of contexts in which a word appears in a lexical database. In an influential article, those authors demonstrated that CD predicted RTs in both lexical decision tasks and naming tasks better than the traditional word frequency measure (i.e., the number of times a word appears in a lexical database). Although these authors derived their measures from written corpora, the results of the studies that have relied on subtitles, as mentioned above, have reached the same conclusion: CD (defined here as the number of subtitles in which a particular word appears) is a better predictor of word recognition RTs than is word frequency (Cai & Brysbaert, 2010; Dimitropoulou et al., 2010; Keuleers et al., 2012; van Heuven et al., 2014).

The facilitative effect of the number of contexts in which a word appears is a widespread finding. Indeed, RTs are faster for high-CD words (i.e., words appearing in many contexts) than for low-CD words in single-word identification tasks (e.g., Perea, Soares, & Comesaña, 2013; Vergara-Martínez, Comesaña, & Perea, 2017). Furthermore, fixation durations are shorter for the former than for the later in sentence-reading tasks (Plummer, Perea, & Rayner, 2014). An advantage for high-CD words has also been observed in the acquisition of new vocabulary (Rosa, Tapia, & Perea, 2017; see also Huang, 2017, for a review). As for the nature of the CD effect, considering that word frequency and CD are highly correlated, a common mechanism for both effects might be expected. Alternatively, it might be that CD effects have an origin distinct from that of word frequency effects. This is what Vergara-Martínez et al. have recently demonstrated in an event-related potential study. Concretely, these authors showed that these two effects have distinct neural signatures. In particular, the direction of the N400 component associated with CD effects resembles that obtained with the manipulation of “semantic richness” variables (e.g., Rabovsky, Sommer, & Abdel Rahman, 2012), suggesting that encountering words in different contexts may increase their semantic richness (Vergara-Martínez et al., 2017). Considering all of the factors above, it is relevant to include both traditional word frequency measures and CD measures in databases of frequency estimates.

In this article, we introduce SUBTLEX-CAT, a database of subtitle frequency estimates for Catalan. Catalan is a Romance language spoken by approximately nine million people (Simons & Fennig, 2018). It is a co-official language in Catalonia, Valencia, and the Balearic Islands, together with Spanish. Hence, virtually all Catalan speakers are bilinguals of Catalan and Spanish. Due to the particularities of this population—they are highly proficient bilinguals who commonly acquire both languages during early childhood and live immersed in both languages—it has been the focus of interest of a large number of psycholinguistic studies dealing with topics including bilingual memory (e.g., Ferré, Sánchez-Casas, & Guasch, 2006; Guasch, Sánchez-Casas, Ferré, & García-Albea, 2008, 2011; Moldovan, Demestre, Ferré, & Sánchez-Casas, 2016); parallel activation of languages in bilinguals (e.g., Comesaña et al., 2015; Guasch, Ferré, & Haro, 2017); emotional processing in the two languages (e.g., Ferré, Anglada-Tort, & Guasch, 2018; Ferré, García, Fraga, Sánchez-Casas, & Molero, 2010; Ferré, Sánchez-Casas, & Fraga, 2013); the linguistic, cognitive, and neural consequences of bilingualism (e.g., Branzi, Calabria, Boscarino, & Costa, 2016; Calabria, Branzi, Marne, Hernández, & Costa, 2015; Kandel et al., 2016; Martin et al., 2013; Rodríguez-Pujadas et al., 2013); and language deterioration in demented bilinguals (e.g., Calabria et al., 2017; Calabria, Marne, Romero-Pinel, Juncadella, & Costa, 2014), among others.

Considering the interest in Catalan–Spanish bilinguals, it is very relevant to have Catalan frequency norms that will allow researchers to perform monolingual experiments in Catalan as well as bilingual experiments in which both Catalan and Spanish are involved. To date, researchers have relied on the Corpus Textual Informatitzat de la Llengua Catalana (CTILC; Rafel, 1998) in order to get the frequency values for Catalan words. It is a corpus composed of 51,253,669 words, compiled from sources whose publication dates range from 1833 to 1988. As a corpus obtained from written texts, it has the limitations mentioned above. Furthermore, considering the year range covered, it is clear that the frequency estimates for some of the words might be outdated nowadays. In relation to that problem, it is worth mentioning the study of Brysbaert and New (2009), in which they observed that word frequency estimates based on subtitles previous to 1990 explained 3% less of the variance in lexical decision times for young participants than do post-1990 subtitles. Interestingly, the effect is reversed for older participants. This finding points out that, to get language-representative norms, it is highly desirable to derive frequency estimates from recent sources if the aim is to test young participants (which is the most common case in psycholinguistic studies).

Another corpus that has recently appeared is worth to be mentioned here. It is WorldLex (Gimenes & New, 2016), which compiles word frequency and contextual diversity norms for 66 languages (including Catalan), based on Internet sources. The Catalan corpus consists of 19.7 million words gathered in 2013 from blog posts (8.2 million), Twitter (6.5 million), and newspapers (5 million), 25 years later than the newest CTILC documents.

In this study, we developed SUBTLEX-CAT, which is based on the subtitles of television programs broadcast from 2011 to 2016 by four Catalan TV channels. In developing the database, we used a procedure similar to that followed in the subtitle databases elaborated in other languages. Our database, however, has some particularities that are worth mentioning here. First, with 278.6 million words, it is one of the largest available subtitle corpus in any language. Second, all of the subtitles were transcribed by professional translators or transcribers, whereas in most previous SUBTLEX databases the subtitles were collected from websites constructed by the Internet community. Moreover, more than half of the subtitles correspond to broadcasts originally filmed in Catalan, whereas in the previous non-English subtitle corpora, the majority of their contents were adaptations from broadcasts (i.e., movies and TV series) originally filmed in English. Third, the variety of contents is significantly higher than in most SUBTLEX databases, which have exclusively included films and fiction TV shows (with only the exception, to our knowledge, of the UK version). In our corpus, this kind of content represents only about a quarter of the subtitles. The SUBTLEX-CAT corpus includes a wider range of contents (i.e., documentaries, news, talk shows, debates, sports broadcasts, etc.). We consider especially interesting that live shows were also subtitled and included in the corpus, thus offering a more natural approach to oral language than scripted shows. The last advantage of our corpus is the availability of metadata provided for most files, including production type, year, title, duration, and original language, among other information, which has allowed us to examine the corpus in more depth. Moreover, it has allowed us to construct several subcorpora by selecting the properties of the subtitles we wanted to include and to compare their performance as predictors on experimental psycholinguistic tasks.

Apart from the characteristics of SUBTLEX-CAT, some particularities of the speaking Catalan environment should be commented on. In particular, it should be noted that although Catalan is spoken in three Spanish autonomous communities (i.e., Catalonia, Valencia, and the Balearic Islands), there are several differences between the dialectal variants spoken in these regions. These differences are mainly lexical and phonological. Because the subtitles were obtained from Catalan broadcast television, the variety reflected in the subtitles is the normative Catalan spoken in the autonomous community of Catalonia. Thus, any researcher intending to use the SUBTLEX-CAT database should be aware of that. On the other hand, researchers using the present norms should also be aware that Spanish is present to a certain extent in the corpus. The reason is that, although the subtitles were written by professional translators, bilingual conversations took place in a small number of the broadcasts. This is a reflection of the pattern of speaking in Catalonia, where bilingual conversations are common. Hence, although the corpus did not include any entire show in Spanish, it was not possible to eliminate all of the Spanish from the corpus. In any case, the presence of non-Catalan words in our corpus was very low.

In the following sections, we first detail the procedures followed to develop the corpus and compute the frequency measures. After that, we present a validation of the database. To do this validation, we tested the predictive capacity of frequency measures derived from SUBTLEX-CAT in comparison to those derived from other sources in two lexical decision experiments.

SUBTLEX-CAT

Corpus description

SUBTLEX-CAT was constructed from 84,141 subtitles in Catalan provided by the CCMA (the Catalan Audio-Visual Media Corporation), which include material from four broadcast TV channels (TV3, 3/24, Canal 33, and Esports 3). The shows corresponding to the subtitles provided were broadcast at least once between October 2011 and July 2016, although the shows could have been produced before this (see Fig. 1). In total, the subtitles in the corpus include 278.6 million word occurrences (tokens) of 751,078 different word forms (types) extracted from a total of 44,784 hours of television programming. The corpus includes 51.5% TV shows whose original language is Catalan, followed by 30.8% in English, 7.5% in Japanese, 6.4% in French, 1.4% in German, 1.2% in Spanish, and 1.3% in other languages.

Fig. 1
figure 1

Histogram of the years of production of the shows whose subtitles were included in the corpus

Text cleaning

Previous to the final word count, a minimal cleaning of the subtitles was necessary. Since all the subtitles came from the same source, they had a very consistent structure. First, all the time codes and style tags were removed in order to leave only the raw text. Then, all the stage directions that were used to describe some aspect of the scene (e.g., to indicate who was speaking in a conversation in which the characters were off-screen) were also eliminated. Furthermore, in the subtitles provided, all song lyrics were subtitled and marked with a particular tag. TV shows always begin with the same song. For that reason, we decided to remove all song lyrics, because otherwise the frequencies of the words in the lyrics would be overestimated. Additionally, all words were converted to lowercase before adding up the frequencies of the words, regardless of whether they appeared in the middle of a sentence or at its beginning. Finally, because punctuation marks were not going to be taken into account, they were deleted.

A special remark is worth making in reference to a particularity of the Catalan language. In Catalan, some pronouns (i.e., clitic pronouns) can be attached to either the front or the end of a verb with the character or -, (e.g., dóna-m’ho “give it to me,” where both m and ho are pronouns). In addition, sometimes the article can also be attached at the front of a noun with the character (e.g., l’home “the man,” where l refers to the article el “the”). In such cases, we decided to split these words into their constituents, because in fact such words can be considered compounds. In doing so, we ensured that the frequencies of the nouns, pronouns, and inflected verb forms were computed correctly.

After text cleaning, but before final processing, we extracted a list of all the characters still remaining in the subtitle files (see Appendix 1). The list includes all the valid letters in Catalan, plus all their diacritical marks and the ten digits. These data show that the presence of nonlegal characters in Catalan in the corpus is marginal (0.04%). Furthermore, 95.84% of the appearances of such characters correspond either to the letter “ñ,” a nonlegal character in Catalan but present in several Spanish surnames and words, or to the character “á”, which is a typical typo in Catalan, because the letter “a” can only carry the accent mark in the opposite direction (i.e., “à”).

Text cleaning and the final word count were both carried out with a computer program devised for this purpose in the Python language.

Division of the corpus

We will henceforth refer to the corpus that includes all 84,141 subtitle files and 278.6 million words as SUBTLEX-CATALL. Only the text-cleaning procedure described in the previous section was applied to the files.

Furthermore, thanks to the metadata provided with the subtitles, it was possible to build a second corpus, including only the subtitles from films and TV shows. We will refer to this corpus and the metrics derived from it as SUBTLEX-CATFILM. This corpus consisted of 15,097 subtitles that include 73.7 million words (26.5% of the original corpus), greatly exceeding both the sizes of many existing subtitle corpora in other languages and the requirement for a corpus of 16–30 million words for reliable word frequency norms (Brysbaert & New, 2009). Our goal was to build a corpus that was as similar as possible to those in which the subtitles were downloaded from the Internet. It is possible that fiction content, such as in films and TV series depicting everyday situations, more resembles the natural language to which speakers are exposed than do other kinds of content. Many nonfictional contents are very topic-specific (e.g., concerning savannah lions, modernist architecture, folk music, . . .), whereas others are highly related to people’s everyday life (news) or are very topic-diverse (talk shows). Whether the inclusion of more diverse kinds of content improves or worsens the frequency measures is an empirical issue that will be addressed in the next sections.

Since we were provided with metadata for each subtitle file, we built other, additional subcorpora apart from SUBTLEX-CATALL and SUBTLEX-CATFILM. For instance, we built a subcorpus eliminating around 20% of the subtitle files that were very short (less than 10 min) or that contained a great number of words in languages other than Catalan (mostly Spanish). We also computed frequency estimates for a subcorpus that included only broadcasts that were originally in Catalan, only live shows, excluding shows for children, and so forth. It is worth mentioning that none of these corpora outperformed the metrics derived from SUBTLEX-CATALL in predicting lexical decision times and accuracy. For that reason, the analyses made with those subcorpora are not included here.

Obtaining the frequency measures

Following the previous SUBTLEX databases, we collected data for word frequency and contextual diversity. Word frequency (WF) has traditionally been expressed as the number of occurrences per million words in the corpus (relative frequency). Over the years, it has been increasingly common to use the logarithm of the relative frequency, plus one (Howes & Solomon, 1951), to mitigate the exponential nature of word frequency expressed by Zipf’s (1949) law. Following van Heuven et al. (2014), we used the Zipf measure for WF, computed as 3 plus log10 of the relative frequency. This measure not only improves the parametric properties of the WF values, because it is a logarithmic transformation, but it also offers a scale from 0 to 7 with a more natural and straightforward interpretation. Zipf values were computed for all the corpora considered in the following experiments.

Contextual diversity (CD) can be defined as the proportion of different contexts in which a word appears (Adelman et al., 2006). Following Brysbaert and New (2009), we considered each subtitle as a context and counted the number of contexts per million in which each word appeared. For the same reasons as for WF, the logarithm of the CD + 1 was used in all the analyses.

Apart from the frequency measures, we also included other indexes of interest in psycholinguistic studies, such as number of syllables, Coltheart’s N (Coltheart, Davelaar, Jonasson, & Besner, 1977), number of higher-frequency neighbors, OLD20 (Yarkoni, Balota, & Yap, 2008), and mean bigram frequencies (by types and by tokens). Coltheart’s N and OLD20 were computed using the vwr package in R (Keuleers, 2013). Coltheart’s N refers to the number of words present in the corpus that can be obtained by changing just one letter from a given word (being all the words of the same length). The number of higher frequency neighbors, in turn, is the number of such words that have higher lexical frequency than that of the given word. OLD20 is the average Levenshtein distance between a word and its 20 nearest neighbors (including single-letter substitutions, but also taking into account letter insertions, deletions, and transpositions). Finally, to compute the mean bigram frequencies, first we obtained a list of all the bigrams existing in the corpus. For instance, the word casa (“house”) contains three bigrams: ca, as, and sa. Then we counted the number of words in which each bigram appeared (type bigram frequency). We also computed the sum of the relative frequencies of all the words in which each bigram appeared (token bigram frequency). Thus, the mean bigram frequency of a word is the sum of the type or token bigram frequencies of its bigrams, divided by the number of bigrams that word has. It is worth mentioning that when computing these last two variables, and also the variables relating to neighborhood, we considered characters with diacritic marks as being different from their base characters (e.g., the à character was considered as different from the a character).

Supplementary materials

The norms are available as a Microsoft Excel .xlsx file, as supplementary materials to this article, from https://psico.fcep.urv.cat/projectes/gip/papers/SUBTLEX-CAT.xlsx, or can be accessed online using the web tool NIM (Guasch, Boada, Ferré, & Sánchez-Casas, 2013; https://psico.fcep.urv.cat/utilitats/nim/). For correct computation of the variables related to lexical neighborhood and bigram frequencies, the raw database was trimmed. Concretely, the final set only contained those words included in one of the following online Catalan dictionaries or lists of words: the online dictionary of the Institut d’Estudis Catalans (1995, DIEC2; http://mdlc.iec.cat/), the online dictionary of the Enciclopèdia Catalana (http://www.diccionari.cat/), and the Catalan Softcatalà spellchecker (https://www.softcatala.org/corrector/Footnote 1). Additionally, we removed those words that contain either numbers or nonlegal characters in Catalan. Thus, the final supplementary materials include data from 202,204 Catalan words, along with the following information:

  • Num_letters: the length of a word counted as number of letters.

  • Num_Syl: the number of syllables of the word.

  • Abs_WF: the absolute number of appearances of a word in the corpus.

  • Rel_WF: the relative WF per million—that is, (Abs_WF/ 278,617,824)*1,000,000.

  • Zipf: the value of a word in the Zipf scale computed as log(Rel_WF) + 3.

  • Abs_CD: the absolute number of subtitle files in which a word appears.

  • Rel_CD: the relative CD per million—that is, (Abs_CD/84,141)*1,000,000.

  • log(Rel_CD+1): the log frequency of Rel_CD + 1.

  • N: the number of words in the corpus that can be obtained by changing one letter from a word (i.e., Coltheart’s N).

  • NHF: the number of words in the Coltheart’s N that have a higher lexical frequency than the given word.

  • OLD20: mean Levenshtein distance of the 20 nearest neighbors of a word.

  • MBF_To: mean token frequency of the bigrams of the word.

  • MBF_Ty: mean type frequency of the bigrams of the word.

Validation

The LDT has proven to be especially sensitive to frequency measures and is commonly used to compare the proportions of variance explained by several lexical predictors. Due to the lack of previous published lexical decision studies in Catalan, we designed two different experiments in order to test the predictive power of the different measures of WF and CD. We compared the SUBTLEX-CAT frequency estimates to the already available CTILC and WorldLex frequency norms.Footnote 2 In Experiment 1, we used a set of randomly selected words for each frequency level. In Experiment 2, we selected a set of stimuli that gave the most divergent frequency estimates between CTILC, SUBTLEX-CATFILM, and SUBTLEX-CATALL. Of note, due to the characteristics of the speaking environment in Catalonia, all the participants were bilinguals of Catalan and Spanish. Although they were highly proficient in both languages, they considered themselves to be either Catalan- or Spanish-dominant. For that reason, we analyzed the results considering all the participants together as well as dividing them by dominance.

Experiment 1

Method

Participants

A total of 47 Catalan college students (38 women and nine men) from the Universitat Rovira i Virgili (Tarragona, Spain) with ages between 18 and 30 (M = 21.3 years; SD = 2.5), took part in the experiment. All participants had normal or corrected-to-normal vision, and they participated in exchange for extra course credits. Participants were living in Catalonia at the moment of the test, had acquired the Catalan language before the age of 6 (M = 2.86, SD = 1.48), and rated themselves as being highly proficient on a 7-point Likert scale (M = 6.72, SD = 0.42). All the participants were bilinguals of Catalan and Spanish. We divided participants, according to their responses to a language history questionnaire, into two groups: 26 reported Catalan as their dominant language, and 21 considered Spanish their dominant language. Additionally, participants rated their proficiency, age of acquisition, and preference and frequency of use in listening, speaking, reading, and writing in both Catalan and Spanish. All the details about the characteristics of the Catalan-dominant and Spanish-dominant bilinguals may be found in Appendix 2.

Materials

The stimuli consisted of 475 Catalan words selected from the set of words in common between CTILC and SUBTLEX-CAT. Because not many studies have used Catalan in a lexical decision task, we used a pseudo-random method to select the final list of stimuli. We firstly removed all words longer than 15 letters or shorter than three letters. We divided that set into five different lists according to their Zipf values in the SUBTLEX-CAT corpus (between 5 and 6, between 4 and 5, between 3 and 4, between 2 and 3, and less than 2). The lists were then randomized, and the first 95 words of each list were selected. Misspelled words and proper nouns were removed and replaced by the next word on that list. The final 475-word lists had mean word lengths of 8.2 letters (SD = 2.54) and 3.2 syllables (SD = 1.19) (see Table 1).

Table 1 Distribution of mean word length, WF, and CD in the randomly sampled words for each of the five tiers (standard deviations in parentheses)

Finally, a total of 475 nonwords were either selected from previous studies or constructed by replacing one or two letters from an existing Catalan word. All of them were pronounceable and legal letter sequences in Catalan and did not correspond to any word in the corpora.

Procedure

The lexical decision experiment was run in the laboratory using the DMDX software (Forster & Forster, 2003). Participants were individually tested in separate soundproof booths. Each trial began with a fixation cross in the center of the screen for 500 ms, followed by the presentation of each stimulus for 2,000 ms or until a response was given. Participants were instructed to respond as quickly and accurately as possible by pressing a “yes” or “no” button on a keypad. They were told to select “yes” if they believed that the word existed in the Catalan language, and “no” if they believed that it did not. After participants had responded, the next fixation point appeared. After a ten-trial practice block, the experimental trials were presented in a continuous running mode and stopped every 100 trials (approximately every 5 min) until the participant resumed the task. The order of item presentation was randomized for every participant. It took around 45 min to complete the task. After the experiment, participants filled out the language history questionnaire.

Results and discussion

Incorrect responses (approximately 12.9% of the data) and RTs shorter than 250 ms and longer than 1,500 ms (1% of the data) were excluded from the latency data analyses. Trials with response latencies three standard deviations above or below each participant’s average were also removed (1.4% of the data). Data from nine items with accuracy rates under 75% were also excluded. Therefore, all further analyses included 466 items.

We used separate linear regression analysis in which RTs or accuracy rates were the dependent variables and the different frequency measures were the predictors. All word frequency measures were transformed to the Zipf scale (van Heuven et al., 2014). The square of Zipf value for each word was included, too, as a predictor in the model to account for very high frequency effects, following Balota et al. (2004). We also constructed regression models using log10(CD + 1) and its squared value instead of WF. For each regression analysis, we included the number of letters, the number of syllables, Coltheart’s N, NHF, OLD20, MBF token, and MBF type as control variables. We carried out the analyses considering all the participants together, as well as separating them by dominance. We present adjusted R2 *100 as a measure of the percentage of explained variance in the following analyses.

Considering all participants together, the analyses revealed that both the WF and CD measures from the two SUBTLEX-CAT corpora improved the percentage of explained variance when compared with the CTILC frequency metrics. The SUBTLEX-CATFILM word frequency explained 2.7% and 5% more variability in the latency and accuracy analyses, respectively. This result shows that frequency estimates from film and fiction TV series subtitles are reliable predictors of lexical access, in line with the findings from other SUBTLEX databases. Interestingly, SUBTLEX-CATALL benefited either from the larger corpus size or the variety of contents, outperforming SUBTLEX-CATFILM by an additional 3%–4% for both RTs and accuracy. As in previous studies, CD measures were better predictors of response latencies and accuracy than were their WF counterparts, but the increase was lower here, not reaching 1% in any case (see Table 2). The three measures of WF and CD obtained from WorldLex (from blogs, Twitter, and newspapers) were also added to the control variables as predictors in the corresponding analysis. The metrics from WorldLex were effective predictors of RTs, improving on those from the CTILC and SUBTLEX-CATFILM, but explaining 2% less variance than SUBTLEX-CATALL for both WF and CD. Nonetheless, WorldLex predicted an additional 0.6% of error rate variance as compared to SUBTLEX-CAT. All analyses comparing the two dominance groups revealed the same pattern: a higher proportion of variance explained in the Catalan-dominant than in the Spanish-dominant group for both the RT (ranging between 2.4% and 4.7%) and error (between 0.7% and 1.7%) analyses.

Table 2 Percentages of explained variance in response times and accuracy by each group of predictors

The results of Experiment 1 clearly show that the subtitle metrics explained a greater variability of the dependent variables than did the frequency measures based on written texts (i.e., the CTILC norms). The best tested model included the control variables and CDALL, explaining 63.2% of the RT variance and 33.1% of the accuracy variance. Those values represent a 7.3% and 8.8% increase, respectively, as compared to the CTILC frequency estimates. CDALL also explained 2% more variance in RTs than did CDWORLDLEX. Notwithstanding, Worldlex CD was a better predictor of accuracy than was SUBTLEX-CAT, by 0.6%. These results are along the same lines as those described in the previous literature (e.g., Avdyli & Cuetos, 2013; Brysbaert et al., 2011; Cai & Brysbaert, 2010; Cuetos et al., 2011; Dimitropoulou et al., 2010; Duchon et al., 2013; Keuleers et al., 2010; Mandera et al., 2015; Soares et al., 2015; Tang, 2012; van Heuven et al., 2014).

Experiment 2

In Experiment 1, we found that any of the SUBTLEX-CAT measures was a better predictor of lexical decision performance than were the previously available norms (CTILC and WorldLex). In Experiment 2, we aimed to explore the differences between the available frequency measures by selecting those words giving the most divergent frequency estimates across the SUBTLEX-CAT and CTILC corpora—among WFCTILC, WFFILM, and WFALL, on the one hand, and CDFILM and CDALL, on the other. In this way, we tried to replicate the results of Experiment 1 with a more constrained item sample.

Method

Participants

A total of 55 Catalan college students (45 women and 10 men) from the Universitat Rovira i Virgili (Tarragona, Spain), with ages between 18 and 28 (M = 20.38 years, SD = 2.93), took part in the experiment. All participants had normal or corrected-to-normal vision, and they participated in exchange for extra course credit. The participants were living in Catalonia at the moment of the test and had acquired the Catalan language before the age of 6 (M = 2.88, SD = 1.33). Participants filled out the same language history questionnaire as in Experiment 1. Their average proficiency in Catalan was very high on a 1-to-7 scale (M = 6.6, SD = 0.58). There were 36 Catalan-dominant bilinguals, and the remaining 19 participants were Spanish-dominant bilinguals (see Appendix 2).

Materials

We began the selection of the critical words with a list of all the words that appeared at least once in the CTILC and each of the two SUBTLEX-CAT corpora, and removed all the words shorter than three or longer than ten letters. To select those words that showed the highest differences across the corpora, we carried out several regression analyses among the three measures of WF, on the one hand, and among the two CD measures, on the other, and selected those words that showed the highest residuals (i.e., the ones that differed most between measures). We also excluded inflected verbal forms, proper nouns, propositions, conjunctions, diminutives, numbers, place names, demonyms, and compound words. The final set of 475 items contained those words that maximized differences across the corpora regarding the metrics of lexical frequency and CD. Table 3 summarizes the descriptive statistics for the variables included in the analyses. Examples of the ten words with higher and lower residuals in each regression analysis are presented in Appendix 3. Finally, the same 475 pseudowords as in Experiment 1 were used as “no” responses in the lexical decision task.

Table 3 Descriptive statistics for the variables considered in Experiment 2, including the maximum and minimum values, mean, and standard deviation

Procedure

We used the same procedure as in Experiment 1.

Results and discussion

Incorrect responses (approximately 8.15% of the data) and RTs shorter than 250 ms and longer than 1,500 ms (1% of the data) were excluded from the latency data analyses. Trials with response latencies three standard deviations above or below each participant’s average were also removed (1.5% of the data). The data from one item with an accuracy rate under 75% were also removed, so all subsequent analyses included 474 items. As in Experiment 1, linear regression models were constructed in order to predict latencies and accuracy from the same control variables as in Experiment 1 and from the distinct WF and CD measures considered in the present study.

Table 4 shows the percentages of variance in the latency and error data explained by the predictors of each model. Contrary to what was observed in Experiment 1, WFFILM and CDFILM did not perform better than WFCTILC. WFWORLDLEX explained a roughly 2% more variance than the CTILC norms for both RTs and accuracy, whereas this difference increased to 3% when CDWORLDLEX was used. More importantly, WFALL explained an extra 4% of variance in RTs and 1.5% in accuracy, when compared with CTILC. In line with previous results, CDALL was the best RT predictor, surpassing WFCTILC by 5% and CDWORLDLEX by 2%. Accuracy was predicted slightly better by WorldLex than by SUBTLEX (by 0.2% for CD). The pattern of RT results was similar to that from Experiment 1 when comparing Catalan- and Spanish-dominant bilinguals. Frequency norms were better predictors for the former than for the later for both RTs (differences ranging between 6.4% and 8.6%) and accuracy (between 1.4% and 3.4%). The only noticeable difference between the two groups in Experiment 2 was that in the accuracy analysis, the WorldLex metrics outperformed SUBTLEX-CAT in the Spanish-dominant group, whereas the advantage was reversed in the Catalan-dominant group.

Table 4 Percentages of explained variance of response times and accuracy by each group of predictors

The comparison of the results of Experiments 1 and 2 reveals that all available frequency norms predicted RTs better in Experiment 2 than in Experiment 1. A possible cause of such an advantage could be the more constrained item sample in the former experiment, which resulted in a list of words known by more participants, as can be observed in the overall error rates (13% vs. 8%) and the numbers of items with more than 75% errors (nine vs. one). However, the same was not true for accuracy, in that frequency estimates were better predictors in Experiment 1 than in Experiment 2. Despite this difference, both experiments show very similar patterns of results: The best predictors of lexical decision RTs were those metrics obtained from the whole subtitle corpus, followed by WorldLex, the film subtitle corpus, and the CTILC metrics. When the two dominance groups were analyzed separately, the same pattern was observed for the Spanish-dominant groups in both experiments. Nevertheless, the analysis of accuracy in the Catalan-dominant group revealed some differences between experiments: whereas in Experiment 1 CDWORLDLEX was the best predictor, in Experiment 2 SUBTLEX-CAT explained more error variance. What is more interesting is that in both experiments performance was predicted better by the frequency norms in Catalan-dominant bilinguals than in Spanish-dominant bilinguals.

Conclusion

With SUBTLEX-CAT, we present word frequency and contextual diversity measures for the Catalan language based on one of the largest subtitle corpora in any language to the present date. We replicated two of the common findings with subtitle-based word frequency metrics: first, the superiority of these norms over the previously available norms, and second, the advantage of contextual diversity measures when compared with superficial word frequency estimates. As to the first point, the superiority of SUBTLEX-CAT over norms derived from written texts (CTILC) was observed in two separate lexical decision experiments. In addition to their different origins, another distinctive aspect of the two corpora to highlight is the temporal difference between them, as CTILC is about 20 years older than SUBTLEX-CAT. This fact can be qualitatively corroborated by the presence of relatively frequent words in the present corpus that are not present in the previous one (e.g., videojoc, video game; càsting, casting; alienígena, alien; zombi, zombie . . .). Along the same lines, when comparing the subtitle-based frequency measures to those coming from other online sources, such as Twitter, blogs, and newspapers, SUBTLEX-CAT was a better predictor of lexical access than the WorldLex estimates. Both corpora include texts that are supposed to better reflect the everyday use of language than do corpora based only on written texts (i.e., CTILC). Both SUBTLEX and WorldLex are also based on newer materials than those in CTILC. However, it remains unclear whether the advantage of subtitle-based norms simply arises from the bigger corpus size (278 million vs. 20 million words).

Furthermore, we empirically tested the benefits of including nonfictional contexts in a subtitle corpus. We observed that metrics derived from a corpus that includes exclusively subtitles from films and fiction TV series are underpowered as compared with a larger and more varied corpus that includes other types of nonfiction broadcasts. Although the observed improvement is modest and might be produced simply by the difference in size between the two corpora, it is more likely that the cause is the greater representativeness of the larger corpus. In fact, New et al. (2007) already pointed out that corpora that include only fictional TV materials are biased, and therefore the sampled materials should be made more representative of the natural language. This recommendation has already been followed by van Heuven et al. (2014) for British English, using subtitles from the BBC. To our knowledge, SUBTLEX-CAT is the second subtitle-based frequency metric from this “second generation” to include broadcasts from all kind of fictional and nonfictional TV broadcasts. We hope that this new database will facilitate psycholinguistic research in the Catalan language.