Behavior Research Methods

, Volume 47, Issue 2, pp 471–483 | Cite as

Subtlex-pl: subtitle-based word frequency estimates for Polish

  • Paweł Mandera
  • Emmanuel Keuleers
  • Zofia Wodniecka
  • Marc Brysbaert
Article

Abstract

We present SUBTLEX-PL, Polish word frequencies based on movie subtitles. In two lexical decision experiments, we compare the new measures with frequency estimates derived from another Polish text corpus that includes predominantly written materials. We show that the frequencies derived from the two corpora perform best in predicting human performance in a lexical decision task if used in a complementary way. Our results suggest that the two corpora may have unequal potential for explaining human performance for words in different frequency ranges and that corpora based on written materials severely overestimate frequencies for formal words. We discuss some of the implications of these findings for future studies comparing different frequency estimates. In addition to frequencies for word forms, SUBTLEX-PL includes measures of contextual diversity, part-of-speech-specific word frequencies, frequencies of associated lemmas, and word bigrams, providing researchers with necessary tools for conducting psycholinguistic research in Polish. The database is freely available for research purposes and may be downloaded from the authors’ university Web site at http://crr.ugent.be/subtlex-pl.

Keywords

Word frequencies Polish language Lexical decision Visual word recognition 

Supplementary material

13428_2014_489_MOESM1_ESM.zip (354.9 mb)
ESM 1• Word frequency files • Lemma master files • Word bigram frequencies • Mappings between original TaKIPI tagset and simplified tagset used in frequency norms (ZIP 363389 kb)

References

  1. Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823. doi:10.1111/j.1467-9280.2006.01787.x CrossRefPubMedGoogle Scholar
  2. Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language, 62(1), 67–82. doi:10.1016/j.jml.2009.09.005 CrossRefGoogle Scholar
  3. Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283–316. doi:10.1037/0096-3445.133.2.283 CrossRefGoogle Scholar
  4. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3), 445–459. Retrieved from http://link.springer.com/article/10.3758/BF03193014
  5. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.Google Scholar
  6. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296–322.Google Scholar
  7. Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology (formerly Zeitschrift für Experimentelle Psychologie), 58(5), 412–424. doi:10.1027/1618-3169/a000123 CrossRefGoogle Scholar
  8. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. doi:10.3758/BRM.41.4.977 CrossRefPubMedGoogle Scholar
  9. Brysbaert, M., & Diependaele, K. (2013). Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods, 45(2), 422–430. doi:10.3758/s13428-012-0270-5 CrossRefPubMedGoogle Scholar
  10. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. Retrieved from doi:10.1371/journal.pone.0010729 CrossRefPubMedCentralPubMedGoogle Scholar
  11. Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s mechanical turk as a tool for experimental behavioral research. PLoS ONE, 8(3), e57410. doi:10.1371/journal.pone.0057410 CrossRefPubMedCentralPubMedGoogle Scholar
  12. Cuetos Vega, F., González Nosti, M., Barbón Gutiérrez, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica: Revista de metodología y psicología experimental, 32(2), 133–143. Retrieved from http://dialnet.unirioja.es/servlet/articulo?codigo=3663992 Google Scholar
  13. Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek. Frontiers in Psychology, 1. doi:10.3389/fpsyg.2010.00218
  14. Gale, W., & Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2, 217–237. Retrieved from http://www.grsampson.net/AGtf.html CrossRefGoogle Scholar
  15. Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal., 52(12), 5186–5201. doi:10.1016/j.csda.2007.11.008 CrossRefGoogle Scholar
  16. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42(3), 627–633. doi:10.3758/BRM.42.3.627 CrossRefPubMedGoogle Scholar
  17. Keuleers, E., Brysbaert, M., & New, B. (2010a). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. doi:10.3758/BRM.42.3.643 CrossRefPubMedGoogle Scholar
  18. Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1. doi:10.3389/fpsyg.2010.00174
  19. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2011). The British lexicon project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304. doi:10.3758/s13428-011-0118-4 CrossRefPubMedCentralGoogle Scholar
  20. Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved May 25, 2014, from http://www.kilgarriff.co.uk/bnc-readme.html
  21. Korpus Języka Polskiego Wydawnictwa Naukowego PWN. (n.d.). Retrieved January 9, 2014, from http://korpus.pwn.pl/
  22. Kurcz, I., Lewicki, A., Sambor, J., Szafran, K., & Woroniczak, J. (1990). Słownik frekwencyjny poszczyzny współczesnej. Kraków: Instytut Języka Polskiego PAN.Google Scholar
  23. Lewis, M. P., Simons, G., & Fennig, C.D. (Eds.). (2013). Ethnologue: Languages of the World, Seventeenth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com
  24. New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04). doi:10.1017/S014271640707035X
  25. Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11(1–2), 151–167.Google Scholar
  26. Przepiórkowski, A. (2012). Narodowy Korpus Jezyka Polskiego: praca zbiorowa. Warszawa: Wydawnictwo Naukowe PWN.Google Scholar
  27. Przepiórkowski, A., & Instytut Podstaw Informatyki. (2004). The IPI PAN corpus: preliminary version. Warszawa: IPI PAN.Google Scholar
  28. Schreuder, R., & Baayen, R. H. (1997). How complex simplex words can be. Journal of Memory and Language, 37(1), 118–139. doi:10.1006/jmla.1997.2510 CrossRefGoogle Scholar
  29. Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase “time and again” matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(3), 776–784. doi:10.1037/a0022531 PubMedGoogle Scholar
  30. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271–295.Google Scholar
  31. Woliński, M. (2006). Morfeusz — a Practical Tool for the Morphological Analysis of Polish. In M. Kłopotek, S. Wierzchoń, & K. Trojanowski (Eds.), Intelligent Information Processing and Web Mining (Vol. 35, pp. 511–520). Springer Berlin Heidelberg. Retrieved from doi:10.1007/3-540-33521-8_55
  32. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 0(ja), 1–36. doi:10.1080/17470218.2013.850521 Google Scholar
  33. Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2), 307–333.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2014

Authors and Affiliations

  • Paweł Mandera
    • 1
  • Emmanuel Keuleers
    • 1
  • Zofia Wodniecka
    • 2
  • Marc Brysbaert
    • 1
  1. 1.Department of Experimental PsychologyGhent UniversityGentBelgium
  2. 2.Institute of PsychologyJagiellonian UniversityKrakówPoland

Personalised recommendations