The role of number of items per trial in best–worst scaling experiments

  • Geoff HollisEmail author


Best–worst scaling is a judgment format in which participants are presented with K items and must choose the best and worst items from that set, along some underlying latent dimension. Best–worst scaling has seen recent use in natural-language processing and psychology to collect lexical semantic norms. In such applications, four items have always been presented on each trial. The present study provides reasoning that values other than 4 might provide better estimates of latent values. The results from simulation experiments and behavioral research confirmed this: Both suggest that, in the general case, six items per trial better reduces errors in the latent value estimates.


Semantic judgment Best-worst scaling Rating scales 



Harald Baayen and 2 anonymous reviewers are thanked for their generous and helpful critiques on the presentation and relevance of the vector learning model. Jason Hicks’ pointed guidance as editor played an essential role in carving this manuscript out of the muck that was its prior versions. This manuscript would have never come to fruition without years of discussion on naive discriminative learning with Chris Westbury.


  1. Auger, P., Devinney, T. M., & Louviere, J. J. (2007). Using best–worst scaling methodology to investigate consumer ethical beliefs across countries. Journal of Business Ethics, 70, 299–326.CrossRefGoogle Scholar
  2. Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., . . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459.
  3. Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Stimuli, instruction manual and affective ratings (Technical Report C-1). Gainesville, FL: University of Florida, NIMH Center for Research in Psychophysiology.Google Scholar
  4. Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911. CrossRefGoogle Scholar
  5. David, H. A. (1987). Ranking from unbalanced paired-comparison data. Biometrika, 74, 432–436.CrossRefGoogle Scholar
  6. Elo, A. E. (1973). The International Chess Federation rating system. Chess, 38(July), 293–296; 38(August), 328–330; 39(October), 19–21.Google Scholar
  7. Finn, A., & Louviere, J. J. (1992). Determining the appropriate response to evidence of public concern: The case of food safety. Journal of Public Policy and Marketing, 12–25.Google Scholar
  8. Hollis, G. (2018). Scoring best–worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments. Behavior Research Methods, 50, 711–729. CrossRefGoogle Scholar
  9. Hollis, G., & Westbury, C. (2018). When is best–worst best? A comparison of best–worst scaling, numeric estimation, and rating scales for collection of semantic norms. Behavior Research Methods, 50, 115–133. CrossRefGoogle Scholar
  10. Hollis, G., Westbury, C., & Lefsrud, L. (2017). Extrapolating human judgments from skip-gram vector representations of word meaning. Quarterly Journal of Experimental Psychology, 70, 1603–1619. CrossRefGoogle Scholar
  11. Jurgens, D., Mohammad, S., Turney, P., & Holyoak, K. (2012). Semeval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the 6th International Workshop on Semantic Evaluation (pp. 356–364). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  12. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304. CrossRefGoogle Scholar
  13. Kiritchenko, S., & Mohammad, S. M. (2016, June). Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. Paper presented at the 2016 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), San Diego, CA.Google Scholar
  14. Kiritchenko, S., & Mohammad, S. M. (2017). Best–worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 465–470). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  15. Kiritchenko, S., Zhu, X., & Mohammad, S. M. (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762.CrossRefGoogle Scholar
  16. Kuperman, V., Estes, Z., Brysbaert, M., & Warriner, A. B. (2014). Emotion and language: Valence and arousal affect word recognition. Journal of Experimental Psychology: General, 143, 1065–1081. CrossRefGoogle Scholar
  17. Lipovetsky, S., & Conklin, M. (2014). Best–worst scaling in analytical closed-form solution. Journal of Choice Modelling, 10, 60–68.CrossRefGoogle Scholar
  18. Louviere, J. J., Flynn, T. N., & Marley, A. A. J. (2015). Best–worst scaling: Theory, methods and applications. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  19. Marley, A. A. J., Islam, T., & Hawkins, G. E. (2016). A formal and empirical comparison of two score measures for best–worst scaling. Journal of Choice Modelling, 21, 15–24.CrossRefGoogle Scholar
  20. Paivio, A. (2013). Imagery and verbal processes. Hove, UK: Psychology Press.CrossRefGoogle Scholar
  21. Pexman, P. M., Heard, A., Lloyd, E., & Yap, M. J. (2017). The Calgary Semantic Decision Project: Concrete/abstract decision data for 10,000 English words. Behavior Research Methods, 49, 407–417. CrossRefGoogle Scholar
  22. Soutar, G. N., Sweeney, J. C., & McColl-Kennedy, J. R. (2015). Best–worst scaling: An alternative to ratings data. In J. J. Louviere, T. N. Flynn, & A. A. J. Marley (Eds.), Best–worst scaling: Theory, methods and applications (pp. 177–188). Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  23. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286. CrossRefGoogle Scholar
  24. Westbury, C. F., Shaoul, C., Hollis, G., Smithson, L, Briesemeister, B. B., Hofmann, M. J., & Jacobs, A. M. (2013). Now you see it, now you don’t: On emotion, context, and the algorithmic prediction of human imageability judgments. Frontiers in Psychology, 4, 991. CrossRefGoogle Scholar

Copyright information

© The Psychonomic Society, Inc. 2019

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations