Subtlex-pl: subtitle-based word frequency estimates for Polish

Abstract

We present SUBTLEX-PL, Polish word frequencies based on movie subtitles. In two lexical decision experiments, we compare the new measures with frequency estimates derived from another Polish text corpus that includes predominantly written materials. We show that the frequencies derived from the two corpora perform best in predicting human performance in a lexical decision task if used in a complementary way. Our results suggest that the two corpora may have unequal potential for explaining human performance for words in different frequency ranges and that corpora based on written materials severely overestimate frequencies for formal words. We discuss some of the implications of these findings for future studies comparing different frequency estimates. In addition to frequencies for word forms, SUBTLEX-PL includes measures of contextual diversity, part-of-speech-specific word frequencies, frequencies of associated lemmas, and word bigrams, providing researchers with necessary tools for conducting psycholinguistic research in Polish. The database is freely available for research purposes and may be downloaded from the authors’ university Web site at http://crr.ugent.be/subtlex-pl.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    \( {z}_i={ \log}_{10}\left(\frac{c_i+1}{{\displaystyle \sum_{k=1}^n{c}_k+ n}}\right)+9 \) (van Heuven, Mandera, Keuleers & Brysbaert 2014) Where z i is a Zipf value for word i, c i is its raw frequency, and n is the size of the vocabulary.

  2. 2.

    For mapping between original and simplified tags, see supplementary materials.

  3. 3.

    A nonfinal version of SUBTLEX-PL, based on nearly 50 million tokens, was used when choosing stimuli for the experiment.

  4. 4.

    As an example, consider a list of 200,000 words and a list of 400,000 words. A typical characteristic of word frequency distributions is that about half of the words in each list will have a frequency of one. In that case, the base probability that any word found in both lists would have a frequency of 1 in the first list would be 1/100,000, while it would be 1/200,000 for the second list.

  5. 5.

    For practical reasons, we assume that lemma is equivalent to a concatenation of a base form of a word and an associated part of speech tag.

References

  1. Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823. doi:10.1111/j.1467-9280.2006.01787.x

    Article  PubMed  Google Scholar 

  2. Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language, 62(1), 67–82. doi:10.1016/j.jml.2009.09.005

    Article  Google Scholar 

  3. Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283–316. doi:10.1037/0096-3445.133.2.283

    Article  Google Scholar 

  4. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3), 445–459. Retrieved from http://link.springer.com/article/10.3758/BF03193014

  5. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  6. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296–322.

    Google Scholar 

  7. Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology (formerly Zeitschrift für Experimentelle Psychologie), 58(5), 412–424. doi:10.1027/1618-3169/a000123

    Article  Google Scholar 

  8. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. doi:10.3758/BRM.41.4.977

    Article  PubMed  Google Scholar 

  9. Brysbaert, M., & Diependaele, K. (2013). Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods, 45(2), 422–430. doi:10.3758/s13428-012-0270-5

    Article  PubMed  Google Scholar 

  10. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. Retrieved from doi:10.1371/journal.pone.0010729

    Article  PubMed Central  PubMed  Google Scholar 

  11. Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s mechanical turk as a tool for experimental behavioral research. PLoS ONE, 8(3), e57410. doi:10.1371/journal.pone.0057410

    Article  PubMed Central  PubMed  Google Scholar 

  12. Cuetos Vega, F., González Nosti, M., Barbón Gutiérrez, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica: Revista de metodología y psicología experimental, 32(2), 133–143. Retrieved from http://dialnet.unirioja.es/servlet/articulo?codigo=3663992

    Google Scholar 

  13. Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek. Frontiers in Psychology, 1. doi:10.3389/fpsyg.2010.00218

  14. Gale, W., & Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2, 217–237. Retrieved from http://www.grsampson.net/AGtf.html

    Article  Google Scholar 

  15. Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal., 52(12), 5186–5201. doi:10.1016/j.csda.2007.11.008

    Article  Google Scholar 

  16. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42(3), 627–633. doi:10.3758/BRM.42.3.627

    Article  PubMed  Google Scholar 

  17. Keuleers, E., Brysbaert, M., & New, B. (2010a). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. doi:10.3758/BRM.42.3.643

    Article  PubMed  Google Scholar 

  18. Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1. doi:10.3389/fpsyg.2010.00174

  19. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2011). The British lexicon project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304. doi:10.3758/s13428-011-0118-4

    Article  PubMed Central  Google Scholar 

  20. Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved May 25, 2014, from http://www.kilgarriff.co.uk/bnc-readme.html

  21. Korpus Języka Polskiego Wydawnictwa Naukowego PWN. (n.d.). Retrieved January 9, 2014, from http://korpus.pwn.pl/

  22. Kurcz, I., Lewicki, A., Sambor, J., Szafran, K., & Woroniczak, J. (1990). Słownik frekwencyjny poszczyzny współczesnej. Kraków: Instytut Języka Polskiego PAN.

  23. Lewis, M. P., Simons, G., & Fennig, C.D. (Eds.). (2013). Ethnologue: Languages of the World, Seventeenth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com

  24. New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04). doi:10.1017/S014271640707035X

  25. Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11(1–2), 151–167.

    Google Scholar 

  26. Przepiórkowski, A. (2012). Narodowy Korpus Jezyka Polskiego: praca zbiorowa. Warszawa: Wydawnictwo Naukowe PWN.

  27. Przepiórkowski, A., & Instytut Podstaw Informatyki. (2004). The IPI PAN corpus: preliminary version. Warszawa: IPI PAN.

  28. Schreuder, R., & Baayen, R. H. (1997). How complex simplex words can be. Journal of Memory and Language, 37(1), 118–139. doi:10.1006/jmla.1997.2510

    Article  Google Scholar 

  29. Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase “time and again” matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(3), 776–784. doi:10.1037/a0022531

    PubMed  Google Scholar 

  30. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271–295.

    Google Scholar 

  31. Woliński, M. (2006). Morfeusz — a Practical Tool for the Morphological Analysis of Polish. In M. Kłopotek, S. Wierzchoń, & K. Trojanowski (Eds.), Intelligent Information Processing and Web Mining (Vol. 35, pp. 511–520). Springer Berlin Heidelberg. Retrieved from doi:10.1007/3-540-33521-8_55

  32. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 0(ja), 1–36. doi:10.1080/17470218.2013.850521

    Google Scholar 

  33. Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2), 307–333.

    Article  Google Scholar 

Download references

Acknowledgments

This study was supported by an Odysseus grant awarded by the Government of Flanders to M.B. and a subsidy from the Foundation for Polish Science (FOCUS program) awarded to Z.W. We thank Jon Andoni Duñabeitia, Gregory Francis, and an anonymous reviewer for insightful comments on an earlier draft of the manuscript, Adam Przepiórkowski for providing access to the BS–NCP word frequencies, and Jakub Szewczyk for his help with syllabification of Polish words.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Paweł Mandera.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

• Word frequency files • Lemma master files • Word bigram frequencies • Mappings between original TaKIPI tagset and simplified tagset used in frequency norms (ZIP 363389 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mandera, P., Keuleers, E., Wodniecka, Z. et al. Subtlex-pl: subtitle-based word frequency estimates for Polish. Behav Res 47, 471–483 (2015). https://doi.org/10.3758/s13428-014-0489-4

Download citation

Keywords

  • Word frequencies
  • Polish language
  • Lexical decision
  • Visual word recognition