A Study on the Influence of the Number of MTurkers on the Quality of the Aggregate Output
Recent years have seen an increased interest in crowdsourcing as a way of obtaining information from a large group of workers at a reduced cost. In general, there are arguments for and against using multiple workers to perform a task. On the positive side, multiple workers bring different perspectives to the process, which may result in a more accurate aggregate output since biases of individual judgments might offset each other. On the other hand, a larger population of workers is more likely to have a higher concentration of poor workers, which might bring down the quality of the aggregate output.
In this paper, we empirically investigate how the number of workers on the crowdsourcing platform Amazon Mechanical Turk influences the quality of the aggregate output in a content-analysis task. We find that both the expected error in the aggregate output as well as the risk of a poor combination of workers decrease as the number of workers increases.
Moreover, our results show that restricting the population of workers to up to the overall top 40 % workers is likely to produce more accurate aggregate outputs, whereas removing up to the overall worst 40 % workers can actually make the aggregate output less accurate. We find that this result holds due to top-performing workers being consistent across multiple tasks, whereas worst-performing workers tend to be inconsistent. Our results thus contribute to a better understanding of, and provide valuable insights into, how to design more effective crowdsourcing processes.
The authors acknowledge Craig Boutilier, Pascal Poupart, Daniel Lizotte, and Xi Alice Gao for useful discussions. The authors thank Carol Acton, Katherine Acheson, Stefan Rehm, Susan Gow, and Veronica Austen for providing gold-standard outputs for our experiment. The authors also thank the Natural Sciences and Engineering Research Council of Canada for funding this research.
- 2.Carvalho, A., Larson, K.: A consensual linear opinion pool. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 2518–2524. AAAI Press (2013)Google Scholar
- 4.Gao, X.A., Mao, A., Chen, Y.: Trick or treat: putting peer prediction to the test. In: Proceedings of the 1st Workshop on Crowdsourcing and Online Behavioral Experiments (2013)Google Scholar
- 5.Ho, C.J., Vaughan, J.W.: Online task assignment in crowdsourcing markets. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 45–51 (2012)Google Scholar
- 7.Lin, C.H., Weld, D.S.: Dynamically switching between synergistic workflows for crowdsourcing. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 132–133 (2012)Google Scholar
- 9.Neruda, P.: 100 Love Sonnets. Exile, Holstein (2007)Google Scholar
- 10.Quinn, A.J., Bederson, B.B.: Human computation: a survey and taxonomy of a growing field. In: Proceedings of the 2011 SIGCHI Conference on Human Factors in Computing Systems, pp. 1403–1412 (2011)Google Scholar
- 11.Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)Google Scholar
- 12.Taylor, J., Taylor, A., Greenaway, K.: Little Ann and Other Poems. Nabu Press, Charleston (2010)Google Scholar
- 13.Tran-Thanh, L., Stein, S., Rogers, A., Jennings, N.R.: Efficient crowdsourcing of unknown experts using multi-armed bandits. In: Proceedings of the 20th European Conference on Artificial Intelligence, pp. 768–773 (2012)Google Scholar
- 14.Yuen, M.C., King, I., Leung, K.S.: A survey of crowdsourcing systems. In: Proceedings of IEEE 3rd International Conference on Social Computing, pp. 766–773 (2011)Google Scholar
- 15.Zhang, H., Horvitz, E., Parkes, D.: Automated workflow synthesis. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp. 1020–1026 (2013)Google Scholar