When is best-worst best? A comparison of best-worst scaling, numeric estimation, and rating scales for collection of semantic norms
Large-scale semantic norms have become both prevalent and influential in recent psycholinguistic research. However, little attention has been directed towards understanding the methodological best practices of such norm collection efforts. We compared the quality of semantic norms obtained through rating scales, numeric estimation, and a less commonly used judgment format called best-worst scaling. We found that best-worst scaling usually produces norms with higher predictive validities than other response formats, and does so requiring less data to be collected overall. We also found evidence that the various response formats may be producing qualitatively, rather than just quantitatively, different data. This raises the issue of potential response format bias, which has not been addressed by previous efforts to collect semantic norms, likely because of previous reliance on a single type of response format for a single type of semantic judgment. We have made available software for creating best-worst stimuli and scoring best-worst data. We also made available new norms for age of acquisition, valence, arousal, and concreteness collected using best-worst scaling. These norms include entries for 1,040 words, of which 1,034 are also contained in the ANEW norms (Bradley & Lang, Affective norms for English words (ANEW): Instruction manual and affective ratings (pp. 1-45). Technical report C-1, the center for research in psychophysiology, University of Florida, 1999).
KeywordsSemantics Semantic judgment Best-worst scaling Rating scales Numeric estimation
This work was funded by a Discovery grant to the second author from the Natural Sciences and Engineering Research Council of Canada.
- Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438.Google Scholar
- Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1): 238–247.Google Scholar
- Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings (pp. 1-45). Technical report C-1, the center for research in psychophysiology, University of Florida.Google Scholar
- Brysbaert, M., & Biemiller, A. (2017). Test-based age-of-acquisition norms for 44 thousand English word meanings. Behavior Research Methods, 49(4), 1520-1523.Google Scholar
- Dale, E., & O'Rourke, J. (1981). The Living Word Vocabulary, the Words We Know: A National Vocabulary Inventory. Chicago: World book.Google Scholar
- Hollis, G. (2017). Soring best-worst data in unbalanced, many-item designs, with applications to crowdsourcing semantic judgments. Behavior Research Methods, 1-19.Google Scholar
- Kiritchenko, S., & Mohammad, S. M. (2016). Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. In HLT-NAACL (pp. 811–817) http://aclweb.org/anthology/N/N16/.
- Kiritchenko, S., & Mohammad, S. M. (2017). Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of The Annual Meeting of the Association for Computational Linguistics (pp. 465-470). Vancouver, Canada. http://www.aclweb.org/anthology/P/P17/
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.Google Scholar
- Paivio, A. (1990). Mental representations: A dual coding approach. New York, Oxford University Press.Google Scholar
- Pollock, L. (2017). Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study. Behavior Research Methods, 1–19. https://doi.org/10.3758/s13428-017-0938-y
- Rescorla, R.A. & Wagner, A.R. (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical Conditioning II, A.H. Black & W.F. Prokasy, Eds., pp. 64–99. New York: Appleton-Century-Crofts.Google Scholar