Best–worst scaling is a judgment format in which participants are presented with K items and must choose the best and worst items from that set, along some underlying latent dimension. Best–worst scaling has seen recent use in natural-language processing and psychology to collect lexical semantic norms. In such applications, four items have always been presented on each trial. The present study provides reasoning that values other than 4 might provide better estimates of latent values. The results from simulation experiments and behavioral research confirmed this: Both suggest that, in the general case, six items per trial better reduces errors in the latent value estimates.
This is a preview of subscription content, log in to check access.
Harald Baayen and 2 anonymous reviewers are thanked for their generous and helpful critiques on the presentation and relevance of the vector learning model. Jason Hicks’ pointed guidance as editor played an essential role in carving this manuscript out of the muck that was its prior versions. This manuscript would have never come to fruition without years of discussion on naive discriminative learning with Chris Westbury.
Auger, P., Devinney, T. M., & Louviere, J. J. (2007). Using best–worst scaling methodology to investigate consumer ethical beliefs across countries. Journal of Business Ethics, 70, 299–326.CrossRefGoogle Scholar
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., . . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459. https://doi.org/10.3758/BF03193014
Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Stimuli, instruction manual and affective ratings (Technical Report C-1). Gainesville, FL: University of Florida, NIMH Center for Research in Psychophysiology.Google Scholar
Jurgens, D., Mohammad, S., Turney, P., & Holyoak, K. (2012). Semeval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the 6th International Workshop on Semantic Evaluation (pp. 356–364). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Kiritchenko, S., & Mohammad, S. M. (2016, June). Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. Paper presented at the 2016 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), San Diego, CA.Google Scholar
Kiritchenko, S., & Mohammad, S. M. (2017). Best–worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 465–470). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Kiritchenko, S., Zhu, X., & Mohammad, S. M. (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762.CrossRefGoogle Scholar
Soutar, G. N., Sweeney, J. C., & McColl-Kennedy, J. R. (2015). Best–worst scaling: An alternative to ratings data. In J. J. Louviere, T. N. Flynn, & A. A. J. Marley (Eds.), Best–worst scaling: Theory, methods and applications (pp. 177–188). Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Westbury, C. F., Shaoul, C., Hollis, G., Smithson, L, Briesemeister, B. B., Hofmann, M. J., & Jacobs, A. M. (2013). Now you see it, now you don’t: On emotion, context, and the algorithmic prediction of human imageability judgments. Frontiers in Psychology, 4, 991. https://doi.org/10.3389/fpsyg.2013.00991CrossRefGoogle Scholar