Information Retrieval

, Volume 16, Issue 2, pp 138–178 | Cite as

An analysis of human factors and label accuracy in crowdsourcing relevance judgments

  • Gabriella KazaiEmail author
  • Jaap Kamps
  • Natasa Milic-Frayling
Crowd Sourcing


Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.


Crowdsourcing Relevance judgments Study of human factors 


  1. Alonso, O., & Baeza-Yates, R. A. (2011). Design and implementation of relevance assessments using crowdsourcing. In Advances in information retrieval —33rd European conference on IR research (ECIR 2011), LNCS, Vol. 6611 (pp. 153–164). Springer.Google Scholar
  2. Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using mechanical turk for relevance assessment. In Proceedings of the SIGIR 2009 workshop on the future of IR evaluation (pp. 557–566).Google Scholar
  3. Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15.CrossRefGoogle Scholar
  4. Alonso, O., Schenkel, R., & Theobald, M. (2010). Crowdsourcing assessments for xml ranked retrieval. In Advances in information retrieval, 32nd European conference on IR research (ECIR 2010), LNCS, Vol. 5993 (pp. 602–606). Springer.Google Scholar
  5. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference (pp. 667–674). New York, NY: ACM.Google Scholar
  6. Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods.Google Scholar
  7. Carterette, B., & Soboroff, I. (2010). The effect of assessor error on ir system evaluation. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 539–546). New York, NY: ACM.Google Scholar
  8. Case, K. E., Fair, R. C., & Oster, S. C. (2011). Principles of Economics (10th ed.). Englewood Cliffs NJ: Prentice-Hall.Google Scholar
  9. Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703.MathSciNetzbMATHCrossRefGoogle Scholar
  10. Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib, 19, 173–192.CrossRefGoogle Scholar
  11. Cormack, G. V., Palmer, C. R., & Clarke, C. L. A. (1998). Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 282–289). New York, NY: ACM.Google Scholar
  12. Doan, A., Ramakrishnan, R., & Halevy, A. Y. (2011). Crowdsourcing systems on the world-wide web. Commun. ACM, 54, 86–96.CrossRefGoogle Scholar
  13. Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the 28th international conference on human factors in computing systems (CHI ’10) (pp. 2399–2402). ACM.Google Scholar
  14. Eickhoff C., & de Vries, A. P. (2011). How crowdsourcable is your task? In Proceedings of the workshop on crowdsourcing for search and data mining (CSDM 2011) (pp. 11–14). ACM.Google Scholar
  15. Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2010). Logging the search self-efficacy of Amazon mechanical turkers. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2010 workshop on crowdsourcing for search evaluation (CSE 2010) (pp. 27–30). Geneva, Switzerland.Google Scholar
  16. Festinger, L., & Carlsmith, J. M. (1959). Cognitive consequences of forced compliance. Journal of Abnormal and Social Psychology, 58(2), 203–210. Scholar
  17. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.CrossRefGoogle Scholar
  18. Grady, C., & Lease, M. (2010). Crodsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179).Google Scholar
  19. Grimes, C., Tang, D., & Russell, D. M. (2007). Query logs alone are not enough. In E. Amitay, C. G. Murray, & J. Teevan (Eds.), Query log analysis: Social and technological challenges. A workshop at the 16th International World Wide Web Conference (WWW 2007).Google Scholar
  20. Hirth, M., Hoßfeld, T., & Tran-Gia, P. (2011) Anatomy of a crowdsourcing platform—using the example of In Workshop on future internet and next generation networks (FINGNet). Seoul, Korea.Google Scholar
  21. Howe, J. (2008). Crowdsourcing: Why the power of the crowd Is driving the future of business. New York, NY: Crown Publishing Group.Google Scholar
  22. Ipeirotis, P. (2008). Mechanical turk: The demographics. Blog post.
  23. Ipeirotis, P. (2010a). The new demographics of mechanical turk. Blog post.
  24. Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21.CrossRefGoogle Scholar
  25. Ipeirotis P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10 (pp. 64–67). New York, NY: ACM.Google Scholar
  26. Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09 (pp. 58–61). New York, NY: ACM.Google Scholar
  27. Kamps, J., Koolen, M., & Trotman, A. (2009). Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the workshop on web search click data (WSCD 2009) (pp. 80–87). New York NY: ACM Press.Google Scholar
  28. Kapelner, A., & Chandler, D. (2010) Preventing satisficing in online surveys: A ‘kapcha’ to ensure higher quality data. In The world’s first conference on the future of distributed work (CrowdConf2010).Google Scholar
  29. Kasneci, G., Van Gael, J., Herbrich, R., & Graepel, T. (2010). Bayesian knowledge corroboration with logical rules and user feedback. In Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part II, ECML PKDD’10 (pp. 1–18). Berlin : Springer.Google Scholar
  30. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In Advances in Information retrieval —33rd European conference on IR research (ECIR 2011), LNCS , Vol. 6611 (pp. 165–176). Springer.Google Scholar
  31. Kazai, G., Doucet, A., & Landoni, M. (2008). Overview of the inex 2008 book track. In INEX (pp. 106–123).Google Scholar
  32. Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011a). Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval. ACM.Google Scholar
  33. Kazai, G., Kamps, J., & Milic-Frayling, N. (2011b). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1941–1944). ACM.Google Scholar
  34. Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09 (pp. 452–459). New York, NY: ACM.Google Scholar
  35. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). ACM.Google Scholar
  36. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.MathSciNetzbMATHCrossRefGoogle Scholar
  37. Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010) Ensuring quality in crowdsourced search relevance evaluation. In V. Carvalho, M. Lease, & E. Yilmaz (Eds.), SIGIR Workshop on crowdsourcing for search evaluation (pp. 17–20). New York, NY: ACM.Google Scholar
  38. Lease, M. (2011). On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd human computation workshop (HCOMP) at AAAI (pp. 97–102).Google Scholar
  39. Lease, M., & Kazai, G. (2011). Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC).Google Scholar
  40. Marsden, P. (2009). Crowdsourcing. Contagious Magazine, 18, 24–28.Google Scholar
  41. Mason, W., & Suri, S. (2011). Conducting behavioral research on amazons mechanical turk. Behavior Research Methods.Google Scholar
  42. Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In HCOMP ’09: Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). New York, NY: ACM.Google Scholar
  43. Nowak, S., & Rüger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In MIR ’10: Proceedings of the international conference on Multimedia information retrieval (pp. 557–566). New York, NY: ACM.Google Scholar
  44. Oppenheim, A. N. (1966). Questionnaire design and attitude measurement. London: Heinemann.Google Scholar
  45. Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical Report HCIL-2009-23. University of Maryland.Google Scholar
  46. Quinn, A. J., & Bederson, B. B. (2011). Human computation: A survey and taxonomy of a growing field. In Proceedings of CHI 2011.Google Scholar
  47. Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reflect retrieval quality?. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. S. Choi, & A. Chowdhury (Eds). CIKM (pp. 43–52). ACM.Google Scholar
  48. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers?: Shifting demographics in mechanical turk. In Proceedings of the 28th international conference on human factors in computing systems, CHI 2010, extended abstracts volume (pp. 2863–2872). ACM.Google Scholar
  49. Rzeszotarski, J. M., & Kittur, A. (2011). Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the 24th annual ACM symposium on User interface software and technology, UIST ’11 (pp. 13–22). New York, NY: ACM. doi: 10.1145/2047196.2047199. url:
  50. Shaw, A., Horton, J., & Chen, D. (2011). Designing incentives for inexpert human raters. In Proceedings of the ACM Conference on computer supported cooperative work, CSCW ’11.Google Scholar
  51. Silberman, M. S., Ross, J., Irani, L., & Tomlinson, B. (2010). Sellers’ problems in human computation markets. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10) (pp. 18–21). ACM.Google Scholar
  52. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 254–263). ACL.Google Scholar
  53. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’04 (pp. 319–326). New York, NY: ACM.Google Scholar
  54. Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716.CrossRefGoogle Scholar
  55. Voorhees, E. M., & Harman, D. K. (Eds.). (2005). TREC: Experimentation and evaluation in information retrieval. Cambridge, MA: MIT Press.Google Scholar
  56. Vuurens, J., Vries, A. P. D., & Eickhoff, C. (2011). How much Spam can you take? An analysis of crowdsourcing results to increase accuracy. In M. Lease, V. Hester, A. Sorokin, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR 2011) (pp. 48–55). Beijing, China.Google Scholar
  57. Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (NIPS ’10) (pp. 2424–2432).Google Scholar
  58. Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Gabriella Kazai
    • 1
    Email author
  • Jaap Kamps
    • 2
  • Natasa Milic-Frayling
    • 1
  1. 1.Microsoft ResearchCambridgeUK
  2. 2.University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations