Behavior Research Methods

, Volume 50, Issue 1, pp 115–133 | Cite as

When is best-worst best? A comparison of best-worst scaling, numeric estimation, and rating scales for collection of semantic norms



Large-scale semantic norms have become both prevalent and influential in recent psycholinguistic research. However, little attention has been directed towards understanding the methodological best practices of such norm collection efforts. We compared the quality of semantic norms obtained through rating scales, numeric estimation, and a less commonly used judgment format called best-worst scaling. We found that best-worst scaling usually produces norms with higher predictive validities than other response formats, and does so requiring less data to be collected overall. We also found evidence that the various response formats may be producing qualitatively, rather than just quantitatively, different data. This raises the issue of potential response format bias, which has not been addressed by previous efforts to collect semantic norms, likely because of previous reliance on a single type of response format for a single type of semantic judgment. We have made available software for creating best-worst stimuli and scoring best-worst data. We also made available new norms for age of acquisition, valence, arousal, and concreteness collected using best-worst scaling. These norms include entries for 1,040 words, of which 1,034 are also contained in the ANEW norms (Bradley & Lang, Affective norms for English words (ANEW): Instruction manual and affective ratings (pp. 1-45). Technical report C-1, the center for research in psychophysiology, University of Florida, 1999).


Semantics Semantic judgment Best-worst scaling Rating scales Numeric estimation 



This work was funded by a Discovery grant to the second author from the Natural Sciences and Engineering Research Council of Canada.

Supplementary material

13428_2017_1009_MOESM1_ESM.csv (71 kb)
ESM 1 (CSV 71 kb)
13428_2017_1009_MOESM2_ESM.tgz (332 kb)
ESM 2 (TGZ 332 kb)


  1. Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438.Google Scholar
  2. Baayen, R. H., Milin, P., & Ramscar, M. (2016). Frequency in lexical processing. Aphasiology, 30(11), 1174-1220.CrossRefGoogle Scholar
  3. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445–459.CrossRefPubMedGoogle Scholar
  4. Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1): 238–247.Google Scholar
  5. Barsalou, L. W. (1999). Perceptions of perceptual symbols. Behavioral and brain sciences, 22(4), 637-660.CrossRefGoogle Scholar
  6. Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings (pp. 1-45). Technical report C-1, the center for research in psychophysiology, University of Florida.Google Scholar
  7. Brysbaert, M., & Biemiller, A. (2017). Test-based age-of-acquisition norms for 44 thousand English word meanings. Behavior Research Methods, 49(4), 1520-1523.Google Scholar
  8. Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3), 904-911.CrossRefPubMedGoogle Scholar
  9. Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta psychologica, 150, 80-84.CrossRefPubMedGoogle Scholar
  10. Connell, L., & Lynott, D. (2012). Strength of perceptual experience predicts word processing performance better than concreteness or imageability. Cognition, 125(3), 452-465.CrossRefPubMedGoogle Scholar
  11. Dale, E., & O'Rourke, J. (1981). The Living Word Vocabulary, the Words We Know: A National Vocabulary Inventory. Chicago: World book.Google Scholar
  12. Estes, Z., & Adelman, J.S. (2008). Automatic vigilance for negative words in lexical decision and naming: Comment on Larsen, Mercer, and Balota (2006). Emotion, 8, 441-444.CrossRefPubMedGoogle Scholar
  13. Goodman, J. C., Dale, P. S., & Li, P. (2008). Does frequency count? Parental input and the acquisition of vocabulary. Journal of Child Language, 35(3), 515-531.CrossRefPubMedGoogle Scholar
  14. Herdağdelen, A., & Marelli, M. (2017). Social media and language processing: How facebook and twitter provide the best frequency estimates for studying word recognition. Cognitive science, 41(4), 976-995.CrossRefPubMedGoogle Scholar
  15. Hollis, G., & Westbury, C. (2016). The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics. Psychonomic bulletin & review, 23(6), 1744-1756.CrossRefGoogle Scholar
  16. Hollis, G., Westbury, C., & Lefsrud, L. (2017). Extrapolating human judgments from skip-gram vector representations of word meaning. The Quarterly Journal of Experimental Psychology, 70(8), 1603-1619.CrossRefPubMedGoogle Scholar
  17. Hollis, G. (2017). Soring best-worst data in unbalanced, many-item designs, with applications to crowdsourcing semantic judgments. Behavior Research Methods, 1-19.Google Scholar
  18. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287-304.CrossRefPubMedGoogle Scholar
  19. Kiritchenko, S., & Mohammad, S. M. (2016). Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. In HLT-NAACL (pp. 811–817)
  20. Kiritchenko, S., & Mohammad, S. M. (2017). Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of The Annual Meeting of the Association for Computational Linguistics (pp. 465-470). Vancouver, Canada.
  21. Kousta, S. T., Vinson, D. P., & Vigliocco, G. (2009). Emotion words, regardless of polarity, have a processing advantage over neutral words. Cognition, 112(3), 473-481.CrossRefPubMedGoogle Scholar
  22. Kuperman, V., Estes, Z., Brysbaert, M., & Warriner, A. B. (2014). Emotion and language: valence and arousal affect word recognition. Journal of Experimental Psychology: General, 143(3), 1065.CrossRefGoogle Scholar
  23. Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978-990.CrossRefPubMedGoogle Scholar
  24. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211.CrossRefGoogle Scholar
  25. Mandera, P., Keuleers, E., & Brysbaert, M. (2015). How useful are corpus-based methods for extrapolating psycholinguistic variables?. The Quarterly Journal of Experimental Psychology, 68(8), 1623-1642.CrossRefPubMedGoogle Scholar
  26. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.Google Scholar
  27. Morrison, C. M., Chappell, T. D., & Ellis, A. W. (1997). Age of acquisition norms for a large set of object names and their relation to adult estimates and other variables. The Quarterly Journal of Experimental Psychology: Section A, 50(3), 528-559.CrossRefGoogle Scholar
  28. Paivio, A. (1990). Mental representations: A dual coding approach. New York, Oxford University Press.Google Scholar
  29. Pexman, P. M., Heard, A., Lloyd, E., & Yap, M. J. (2017). The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words. Behavior research methods, 49(2), 407-417.CrossRefPubMedGoogle Scholar
  30. Pollock, L. (2017). Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study. Behavior Research Methods, 1–19.
  31. Rescorla, R.A. & Wagner, A.R. (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical Conditioning II, A.H. Black & W.F. Prokasy, Eds., pp. 64–99. New York: Appleton-Century-Crofts.Google Scholar
  32. Schwanenflugel, P. J., Harnishfeger, K. K., & Stowe, R. W. (1988). Context availability and lexical decisions for abstract and concrete words. Journal of Memory and Language, 27(5), 499-520.CrossRefGoogle Scholar
  33. Stadthagen-Gonzalez, H., Imbault, C., Sánchez, M. A. P., & Brysbaert, M. (2017). Norms of valence and arousal for 14,031 Spanish words. Behavior research methods, 49(1), 111-123.CrossRefPubMedGoogle Scholar
  34. Vigliocco, G., Meteyard, L., Andrews, M., & Kousta, S. (2009). Toward a theory of semantic representation. Language and Cognition, 1(2), 219-247.CrossRefGoogle Scholar
  35. Vinson, D., Ponari, M., & Vigliocco, G. (2014). How does emotional content affect lexical processing?. Cognition & emotion, 28(4), 737-746.CrossRefGoogle Scholar
  36. Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods, 45(4), 1191-1207.CrossRefPubMedGoogle Scholar
  37. Westbury, C. (2016). Pay no attention to that man behind the curtain: Explaining semantics without semantics. The Mental Lexicon, 11(3), 350-374.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2017

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.Department of PsychologyUniversity of AlbertaEdmontonCanada

Personalised recommendations