SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan

  • Roger BoadaEmail author
  • Marc Guasch
  • Juan Haro
  • Josep Demestre
  • Pilar Ferré


SUBTLEX-CAT is a word frequency and contextual diversity database for Catalan, obtained from a 278-million-word corpus based on subtitles supplied from broadcast Catalan television. Like all previous SUBTLEX corpora, it comprises subtitles from films and TV series. In addition, it includes a wider range of TV shows (e.g., news, documentaries, debates, and talk shows) than has been included in most previous databases. Frequency metrics were obtained for the whole corpus, on the one hand, and only for films and fiction TV series, on the other. Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates, computed from either written texts or texts from the Internet. Furthermore, the metrics obtained from the whole corpus were better predictors than the ones obtained from films and fiction TV series alone. In both experiments, the best predictor of response times and accuracy was contextual diversity.


Word frequency Contextual diversity Catalan language Subtitles 


Author note

This research was funded by grant PCIN-2015-165-C02-02S from MINECO/FEDER and by the Research Promotion Program of the Universitat Rovira i Virgili (2017PFR-URV-B2-32). We thank the Catalan Audio-Visual Media Corporation for kindly providing us with the subtitles.


  1. Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17, 814–823. CrossRefGoogle Scholar
  2. Alameda, J. R., & Cuetos, F. (1995). Diccionario de frecuencias de las unidades lingüísticas del castellano. Oviedo, Spain: Servicio de Publicaciones de la Universidad de Oviedo.Google Scholar
  3. Avdyli, S. R., & Cuetos, S. F. (2013). SUBTLEX-AL: Albanian word frequencies based on film subtitles. ILIRIA International Review, 3, 285–292. CrossRefGoogle Scholar
  4. Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD ROM). Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania.Google Scholar
  5. Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316. CrossRefGoogle Scholar
  6. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., . . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459.
  7. Bonin, P., Chalard, M., Méot, A., & Fayol, M. (2001). Age-of-acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition, 20, 401–443.Google Scholar
  8. Brants, T., & Franz, A. (2006). Web 1T 5-gram, version 1. Philadelphia, PA: Linguistic Data Consortium.Google Scholar
  9. Branzi, F. M., Calabria, M., Boscarino, M. L., & Costa, A. (2016). On the overlap between bilingual language control and domain-general executive control. Acta Psychologica, 166, 21–30. CrossRefGoogle Scholar
  10. Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412–424. CrossRefGoogle Scholar
  11. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990. CrossRefGoogle Scholar
  12. Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, Instruments, & Computers, 30, 272–277. CrossRefGoogle Scholar
  13. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS ONE, 5, e10729. CrossRefGoogle Scholar
  14. Calabria, M., Branzi, F. M., Marne, P., Hernández, M., & Costa, A. (2015). Age-related effects over bilingual language control and executive control. Bilingualism: Language and Cognition, 18, 65–78. CrossRefGoogle Scholar
  15. Calabria, M., Cattaneo, G., Marne, P., Hernández, M., Juncadella, M., Gascón-Bayarri, J., . . . Costa, A. (2017). Language deterioration in bilingual Alzheimer’s disease patients: A longitudinal study. Journal of Neurolinguistics, 43, 59–74.
  16. Calabria, M., Marne, P., Romero-Pinel, L., Juncadella, M., & Costa, A. (2014). Losing control of your languages: A case study. Cognitive Neuropsychology, 31, 266–286. CrossRefGoogle Scholar
  17. Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI (pp. 535–555). Hillsdale, NJ: Erlbaum.Google Scholar
  18. Comesaña, M., Ferré, P., Romero, J., Guasch, M., Soares, A. P., & García-Chico, T. (2015). Facilitative effect of cognate words vanishes when reducing the orthographic overlap: The effect of stimuli list composition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41, 614–635. Google Scholar
  19. Cortese, M. J., & Khanna, M. M. (2007). Age of acquisition predicts naming and lexical-decision performance above and beyond 22 other predictor variables: An analysis of 2,342 words. Quarterly Journal of Experimental Psychology, 60, 1072–1082. CrossRefGoogle Scholar
  20. Cortese, M. J., Khanna, M. M., & Hacker, S. (2010). Recognition memory for 2,578 monosyllabic words. Memory, 18, 595–609. CrossRefGoogle Scholar
  21. Cuetos, F., González-Nosti, M., Barbón, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica, 32, 133–143.Google Scholar
  22. De Mauro, T., Mancini, F., Vedovelli, M., & Voghera, M. (1993). Lessico di frequenza dell’italiano parlato (LIP). Milan, Italy: Etaslibri.Google Scholar
  23. Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behavior: The case of Greek. Frontiers in Psychology, 1, 218. CrossRefGoogle Scholar
  24. Duchon, A., Perea, M., Sebastián-Gallés, N., Martí, A., & Carreiras, M. (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods, 45, 1246–1258. CrossRefGoogle Scholar
  25. Equipe DELIC. (2004). Présentation du Corpus de référence du français parlé. Recherches sur le Français Parlé, 18, 11–42.Google Scholar
  26. Ferré, P., Anglada-Tort, M., & Guasch, M. (2018). Processing of emotional words in bilinguals: Testing the effects of word concreteness, task type and language status. Second Language Research, 34, 371–394. CrossRefGoogle Scholar
  27. Ferré, P., García, T., Fraga, I., Sánchez-Casas, R., & Molero, M. (2010). Memory for emotional words in bilinguals: Do words have the same emotional intensity in the first and in the second language? Cognition and Emotion, 24, 760–785. CrossRefGoogle Scholar
  28. Ferré, P., Sánchez-Casas, R., & Fraga, I. (2013). Memory for emotional words in the first and the second language: Effects of the encoding task. Bilingualism: Language and Cognition, 16, 495–507. CrossRefGoogle Scholar
  29. Ferré, P., Sánchez-Casas, R., & Guasch, M. (2006). Can a horse be a donkey? Semantic and form interference effects in translation recognition in early and late proficient and non-proficient Spanish–Catalan bilinguals. Language Learning, 56, 571–608. CrossRefGoogle Scholar
  30. Forster, K. I., & Forster, J. C. (2003). DMDX: A Windows display program with millisecond accuracy. Behavior Research Methods, Instruments, & Computers, 35, 116–124. CrossRefGoogle Scholar
  31. Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48, 963–972. CrossRefGoogle Scholar
  32. Guasch, M., Boada, R., Ferré, P., & Sánchez-Casas, R. (2013). NIM: A Web-based Swiss Army knife to select stimuli for psycholinguistic studies. Behavior Research Methods, 45, 765–771. CrossRefGoogle Scholar
  33. Guasch, M., Ferré, P., & Haro, J. (2017). Pupil dilation is sensitive to the cognate status of words: Further evidence for non-selectivity in bilingual lexical access. Bilingualism: Language and Cognition, 20, 49–54. CrossRefGoogle Scholar
  34. Guasch, M., Sánchez-Casas, R., Ferré, P., & García-Albea, J. E. (2008). Translation performance of beginning, intermediate and proficient Spanish–Catalan bilinguals: Effects of form and semantic relations. Mental Lexicon, 3, 289–308. CrossRefGoogle Scholar
  35. Guasch, M., Sánchez-Casas, R., Ferré, P., & García-Albea, J. E. (2011). Effects of the degree of meaning similarity on cross-language semantic priming in highly proficient bilinguals. Journal of Cognitive Psychology, 23, 942–961. CrossRefGoogle Scholar
  36. Howes, D. H., & Solomon, R. L. (1951). Visual duration threshold as a function of word-probability. Journal of Experimental Psychology, 41, 401–410. CrossRefGoogle Scholar
  37. Huang, X. (2017). The role of word frequency and contextual diversity in visual word recognition: A mini review. New Frontiers in Ophthalmology, 3, 1–4. CrossRefGoogle Scholar
  38. Imbs, P. (1971). Etudes statistiques sur le vocabulaire français. Dictionnaire des fréquences, Vocabulaire littéraire des XIX’ et XX’ siècles. Centre de la Recherche pour un Trésor de La Langue française (CNRS), Nancy, Paris, Librairie Marcel-Didier.Google Scholar
  39. Institut d’Estudis Catalans. (1995). Diccionari de la llengua catalana. Barcelona, Spain: IEC.Google Scholar
  40. Instituto Nacional de Estadística. (2011). Spanish Time Use Survey 2009–2010. Retrieved July 16, 2018, from
  41. Kandel, S., Burfin, S., Méary, D., Ruiz-Tada, E., Costa, A., & Pascalis, O. (2016). The impact of early bilingualism on face recognition processes. Frontiers in Psychology, 7, 1080. CrossRefGoogle Scholar
  42. Keuleers, E. (2013). vwr: Useful functions for visual word recognition research (R package version 0.3.0). Retrieved from
  43. Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42, 643–650. CrossRefGoogle Scholar
  44. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304. CrossRefGoogle Scholar
  45. Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved Jul 16, 2018, from
  46. Kučera, M., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.Google Scholar
  47. Lee, C. J. (2003). Evidence-based selection of word frequency lists. Journal of Speech-Language Pathology and Audiology, 27, 172–175.Google Scholar
  48. Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London, UK: Longman.Google Scholar
  49. Mandera, P., Keuleers, E., Wodniecka, Z., & Brysbaert, M. (2015). Subtlex-pl: Subtitle-based word frequency estimates for Polish. Behavior Research Methods, 47, 471–483. CrossRefGoogle Scholar
  50. Martin, C. D., Strijkers, K., Santesteban, M., Escera, C., Hartsuiker, R. J., & Costa, A. (2013). The impact of early bilingualism on controlling a language learned late: An ERP study. Frontiers in Psychology, 4, 815. CrossRefGoogle Scholar
  51. Moldovan, C. D., Demestre, J., Ferré, P., & Sánchez-Casas, R. (2016). The role of meaning and form similarity in translation recognition in highly proficient balanced bilinguals: A behavioral and ERP study. Journal of Neurolinguistics, 37, 1–11. CrossRefGoogle Scholar
  52. New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied PsychoLinguistics, 28, 661–667. CrossRefGoogle Scholar
  53. New, B., Pallier, C., Brysbaert, M., & Ferrand, L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers, 36, 516–524. CrossRefGoogle Scholar
  54. Perea, M., Soares, A. P., & Comesaña, M. (2013). Contextual diversity is a main determinant of word identification times in young readers. Journal of Experimental Child Psychology, 116, 37–44. CrossRefGoogle Scholar
  55. Plummer, P., Perea, M., & Rayner, K. (2014). The influence of contextual diversity on eye movements in reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 275–283. Google Scholar
  56. Rabovsky, M., Sommer, W., & Abdel Rahman, R. A. (2012). The time course of semantic richness effects in visual word recognition. Frontiers in Human Neuroscience, 6, 11. CrossRefGoogle Scholar
  57. Rafel, J. (1998). Diccionari de freqüències. Barcelona, Spain: Institut d’Estudis Catalans.Google Scholar
  58. Rodríguez-Pujadas, A., Sanjuán, A., Ventura-Campos, N., Román, P., Martin, C., Barceló, F., . . . Ávila, C. (2013). Bilinguals use language-control brain areas more than monolinguals to perform non-linguistic switching tasks. PLoS ONE, 8, e73028.
  59. Rosa, E., Tapia, J. L., & Perea, M. (2017). Contextual diversity facilitates learning new words in the classroom. PLoS ONE, 12, e0179004. CrossRefGoogle Scholar
  60. Sebastián-Gallés, N., Martí, M. A., Carreiras, M., & Cuetos, F. (2000). LEXESP: Una base de datos informatizada del español. Barcelona, Spain: Universitat de Barcelona.Google Scholar
  61. Shaoul, C., & Westbury, C. (2013). A reduced redundancy USENET corpus (2005–2011). Edmonton, AB: University of Alberta. Retrieved July 18, 2018, from
  62. Simons, G. F., & Fennig, C. D. (2018). Ethnologue: Languages of the world (21st ed.). Dallas, TX: SIL International. Retrieved July 16, 2018, from Google Scholar
  63. Sinclair, J. (2005). Corpus and text: Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow.Google Scholar
  64. Soares, A. P., Machado, J., Costa, A., Iriarte, Á., Simões, A., de Almeida, J. J., . . . Perea, M. (2015). On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese. Quarterly Journal of Experimental Psychology, 68, 680–696.
  65. Tang, K. (2012). A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research. UCL Working Papers in Linguistics, 24, 208–214.Google Scholar
  66. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190. CrossRefGoogle Scholar
  67. Vergara-Martínez, M., Comesaña, M., & Perea, M. (2017). The ERP signature of the contextual diversity effect in visual word recognition. Cognitive, Affective, & Behavioral Neuroscience, 17, 461–474. CrossRefGoogle Scholar
  68. Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60, 502–529. CrossRefGoogle Scholar
  69. Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15, 971–979. CrossRefGoogle Scholar
  70. Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone Applied Science.Google Scholar
  71. Zipf, G. K. (1949). Human behaviour and the principle of least effort. Cambridge, MA: Addison-Wesley.Google Scholar

Copyright information

© The Psychonomic Society, Inc. 2019

Authors and Affiliations

  • Roger Boada
    • 1
    Email author
  • Marc Guasch
    • 1
  • Juan Haro
    • 1
  • Josep Demestre
    • 1
  • Pilar Ferré
    • 1
  1. 1.Department of Psychology and Research Center for Behavior AssessmentUniversitat Rovira i VirgiliTarragonaSpain

Personalised recommendations