Behavior Research Methods

, Volume 47, Issue 2, pp 471–483 | Cite as

Subtlex-pl: subtitle-based word frequency estimates for Polish

  • Paweł Mandera
  • Emmanuel Keuleers
  • Zofia Wodniecka
  • Marc Brysbaert


We present SUBTLEX-PL, Polish word frequencies based on movie subtitles. In two lexical decision experiments, we compare the new measures with frequency estimates derived from another Polish text corpus that includes predominantly written materials. We show that the frequencies derived from the two corpora perform best in predicting human performance in a lexical decision task if used in a complementary way. Our results suggest that the two corpora may have unequal potential for explaining human performance for words in different frequency ranges and that corpora based on written materials severely overestimate frequencies for formal words. We discuss some of the implications of these findings for future studies comparing different frequency estimates. In addition to frequencies for word forms, SUBTLEX-PL includes measures of contextual diversity, part-of-speech-specific word frequencies, frequencies of associated lemmas, and word bigrams, providing researchers with necessary tools for conducting psycholinguistic research in Polish. The database is freely available for research purposes and may be downloaded from the authors’ university Web site at


Word frequencies Polish language Lexical decision Visual word recognition 



This study was supported by an Odysseus grant awarded by the Government of Flanders to M.B. and a subsidy from the Foundation for Polish Science (FOCUS program) awarded to Z.W. We thank Jon Andoni Duñabeitia, Gregory Francis, and an anonymous reviewer for insightful comments on an earlier draft of the manuscript, Adam Przepiórkowski for providing access to the BS–NCP word frequencies, and Jakub Szewczyk for his help with syllabification of Polish words.

Supplementary material (354.9 mb)
ESM 1 • Word frequency files • Lemma master files • Word bigram frequencies • Mappings between original TaKIPI tagset and simplified tagset used in frequency norms (ZIP 363389 kb)


  1. Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823. doi: 10.1111/j.1467-9280.2006.01787.x CrossRefPubMedGoogle Scholar
  2. Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language, 62(1), 67–82. doi: 10.1016/j.jml.2009.09.005 CrossRefGoogle Scholar
  3. Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283–316. doi: 10.1037/0096-3445.133.2.283 CrossRefGoogle Scholar
  4. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3), 445–459. Retrieved from
  5. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.Google Scholar
  6. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296–322.Google Scholar
  7. Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology (formerly Zeitschrift für Experimentelle Psychologie), 58(5), 412–424. doi: 10.1027/1618-3169/a000123 CrossRefGoogle Scholar
  8. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. doi: 10.3758/BRM.41.4.977 CrossRefPubMedGoogle Scholar
  9. Brysbaert, M., & Diependaele, K. (2013). Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods, 45(2), 422–430. doi: 10.3758/s13428-012-0270-5 CrossRefPubMedGoogle Scholar
  10. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. Retrieved from doi: 10.1371/journal.pone.0010729 CrossRefPubMedCentralPubMedGoogle Scholar
  11. Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s mechanical turk as a tool for experimental behavioral research. PLoS ONE, 8(3), e57410. doi: 10.1371/journal.pone.0057410 CrossRefPubMedCentralPubMedGoogle Scholar
  12. Cuetos Vega, F., González Nosti, M., Barbón Gutiérrez, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica: Revista de metodología y psicología experimental, 32(2), 133–143. Retrieved from Google Scholar
  13. Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek. Frontiers in Psychology, 1. doi: 10.3389/fpsyg.2010.00218
  14. Gale, W., & Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2, 217–237. Retrieved from CrossRefGoogle Scholar
  15. Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal., 52(12), 5186–5201. doi: 10.1016/j.csda.2007.11.008 CrossRefGoogle Scholar
  16. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42(3), 627–633. doi: 10.3758/BRM.42.3.627 CrossRefPubMedGoogle Scholar
  17. Keuleers, E., Brysbaert, M., & New, B. (2010a). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. doi: 10.3758/BRM.42.3.643 CrossRefPubMedGoogle Scholar
  18. Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1. doi: 10.3389/fpsyg.2010.00174
  19. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2011). The British lexicon project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304. doi: 10.3758/s13428-011-0118-4 CrossRefPubMedCentralGoogle Scholar
  20. Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved May 25, 2014, from
  21. Korpus Języka Polskiego Wydawnictwa Naukowego PWN. (n.d.). Retrieved January 9, 2014, from
  22. Kurcz, I., Lewicki, A., Sambor, J., Szafran, K., & Woroniczak, J. (1990). Słownik frekwencyjny poszczyzny współczesnej. Kraków: Instytut Języka Polskiego PAN.Google Scholar
  23. Lewis, M. P., Simons, G., & Fennig, C.D. (Eds.). (2013). Ethnologue: Languages of the World, Seventeenth edition. Dallas, Texas: SIL International. Online version:
  24. New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04). doi: 10.1017/S014271640707035X
  25. Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11(1–2), 151–167.Google Scholar
  26. Przepiórkowski, A. (2012). Narodowy Korpus Jezyka Polskiego: praca zbiorowa. Warszawa: Wydawnictwo Naukowe PWN.Google Scholar
  27. Przepiórkowski, A., & Instytut Podstaw Informatyki. (2004). The IPI PAN corpus: preliminary version. Warszawa: IPI PAN.Google Scholar
  28. Schreuder, R., & Baayen, R. H. (1997). How complex simplex words can be. Journal of Memory and Language, 37(1), 118–139. doi: 10.1006/jmla.1997.2510 CrossRefGoogle Scholar
  29. Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase “time and again” matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(3), 776–784. doi: 10.1037/a0022531 PubMedGoogle Scholar
  30. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271–295.Google Scholar
  31. Woliński, M. (2006). Morfeusz — a Practical Tool for the Morphological Analysis of Polish. In M. Kłopotek, S. Wierzchoń, & K. Trojanowski (Eds.), Intelligent Information Processing and Web Mining (Vol. 35, pp. 511–520). Springer Berlin Heidelberg. Retrieved from doi: 10.1007/3-540-33521-8_55
  32. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 0(ja), 1–36. doi: 10.1080/17470218.2013.850521 Google Scholar
  33. Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2), 307–333.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2014

Authors and Affiliations

  • Paweł Mandera
    • 1
  • Emmanuel Keuleers
    • 1
  • Zofia Wodniecka
    • 2
  • Marc Brysbaert
    • 1
  1. 1.Department of Experimental PsychologyGhent UniversityGentBelgium
  2. 2.Institute of PsychologyJagiellonian UniversityKrakówPoland

Personalised recommendations