Abstract
Recently, there has been a burst in the number of research projects on human computation via crowdsourcing. Multiple-choice (or labeling) questions could be referred to as a common type of problem which is solved by this approach. As an application, crowd labeling is applied to find true labels for large machine learning datasets. Since crowds are not necessarily experts, the labels they provide are rather noisy and erroneous. This challenge is usually resolved by collecting multiple labels for each sample and then aggregating them to estimate the true label. Although the mechanism leads to high-quality labels, it is not actually cost-effective. As a result, efforts are currently made to maximize the accuracy in estimating true labels, while fixing the number of acquired labels.
This paper surveys methods to aggregate redundant crowd labels in order to estimate unknown true labels. It presents a unified statistical latent model where the differences among popular methods in the field correspond to different choices for the parameters of the model. Afterward, algorithms to make inference on these models will be surveyed. Moreover, adaptive methods which iteratively collect labels based on the previously collected labels and estimated models will be discussed. In addition, this paper compares the distinguished methods and provides guidelines for future work required to address the current open issues.
Similar content being viewed by others
Notes
To access the source codes, datasets, and all other results, please refer to the first author’s web page at http://ce.sharif.edu/~muhammadi.
References
Albert PS, Dodd LE (2004) A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 60(2):427–435
Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J R Stat Soc Ser B (Methodological) 28:131–142
Attenberg J, Provost FJ (2011) Online active inference and learning, KDD, pp 186–194
Bachrach Y, Graepel T, Minka T, Guiver J (2012) How To grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing, ICML
Ballatore A, Bertolotto M, Wilson DC (2013) Geographic knowledge extraction and semantic similarity in OpenStreetMap. Knowl Inf Syst 37(1):61–81
Bernstein MS, Little G, Miller RC, Hartmann B, Ackerman MS, Karger DR, Crowell D, Panovich K (2010) Soylent: a word processor with a crowd inside. In: Proceedings of the 23rd annual ACM symposium on user interface software and technology, UIST ’10. ACM, New York, pp 313–322
Bernstein MS, Karger DR, Miller RC, Brandt J (2012) Analytic methods for optimizing realtime crowdsourcing. CoRR
Berry MW (1992) Large scale sparse singular value computations. Int J Supercomput Appl 6:13–49
Bishop CM (2006) Pattern recognition and machine learning, vol 1. Springer, New York, p 740
Branson S, Wah C, Babenko B, Schroff F, Welinder P, Perona P, Belongie S (2010) Visual recognition with humans in the loop, European Conference on Computer Vision (ECCV)
Brew A, Greene D, Cunningham P (2010) The interaction between supervised learning and crowdsourcing, NIPS workshop on computational social science and the Wisdom of Crowds
Chen X, Lin Q, Zhou D (2013) Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing. ICML 2013:64–72
Chilton LB, Horton JJ, Miller RC, Azenkot S (2010) Task search in a human computation market. In: Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10). ACM, New York
Dai P, Mausam, Weld DS (2011) Artificial intelligence for artificial intelligence, AAAI
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C 28(1):20–28
Dekel O, Shamir O (2009a) Good learners for evil teachers. In: ICML, vol 382, 30. ACM, New York
Dekel O, Shamir O (2009b) Vox populi, collecting high-quality labels from a crowd. In: COLT
Dreyfus HL, Dreyfus SE (1988) Mind over machine—the power of human intuition and expertise in the era of the computer. Free Press, New York
Eagle N (2009) txteagle: mobile crowdsourcing. In: Proceedings of the 3rd international conference on internationalization, design and global development: held as part of HCI international 2009, IDGD ’09. Springer, Berlin, pp 447–456
Fang M, Zhu X, Li B, Ding W, Wu X (2012) Self-taught active learning from crowds. ICDM, pp 858–863
Frank E, Hall M (2001) A simple approach to ordinal classification. In: ECML ’01: proceedings of the 12th European conference on machine learning. Springer, London, pp 145–156
Fu Y, Zhu X, Li B (2013) A survey on instance selection for active learning. Knowl Inf Syst 35(2):249–283
Grady C, Lease M (2010) Crowdsourcing document relevance assessment with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Association for Computational Linguistics, pp 172–179
Howe J (2008) Crowdsourcing: why the power of the crowd is driving the future of business, 1st edn. Crown Business. United States
Ho CJ, Vaughan JW (2012) Online task assignment in crowdsourcing markets, AAAI
Ho CJ, Jabbari S, Vaughan JW (2013) Adaptive task assignment for crowdsourced classification. JMLR W&CP 28(1):534–542
Ipeirotis PG, Provost F, Sheng V, Wang J (2010) Repeated labeling using multiple noisy labelers. In CeDER working papers
Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10), pp 64–67. ACM, New York
Janssens JHM (2010) Ranking images on semantic attributes using human computation. In: NIPS workshop on computational social science and the Wisdom of crowds
Jung HJ, Lease M (2012) Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. SIGIR 2012:1095–1096
Jung HJ, Lease M (2012) Improving quality of crowdsourced labels via probabilistic matrix factorization. In: Proceedings of the 4th human computation workshop (HCOMP) at AAAI
Kajino H, Tsuboi Y, Sato I, Kashima H (2012) Learning from crowds and experts. In: AAAI human computation technical reports WS-12-08
Kajino H, Tsuboi Y, Kashima H (2012) A convex formulation for learning from crowds. In AAAI
Karger DR, Oh S, Shah D (2011a) Budget-optimal task allocation for reliable crowdsourcing systems. CoRR abs/1110. 3564
Karger DR, Oh S, Shah D (2011b) Budget-optimal crowdsourcing using low-rank matrix approximations. In: 49th annual conference on communication, control, and computing (Allerton), pp 284–291
Law E, von Ahn L (2011) Human computation, synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, Los Altos
Little G, Chilton LB, Goldman M, Miller RC (2009) TurKit: tools for iterative tasks on mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09. ACM, Paris, pp 29–30
Little G, Chilton LB, Goldman M, Miller RC (2010) Exploring iterative and parallel human computation processes. In: Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10. ACM, New York, pp 68–76
Liu C, Wang YM (2012) TrueLabel + confusions: a spectrum of probabilistic models in analyzing multiple ratings. ICML
Liu Q, Peng J, Ihler A (2012) Variational inference for crowdsourcing. In: Advances in neural information processing systems (NIPS), pp 701–709
Mason WA, Watts DJ (2009) Financial incentives and the “performance of crowds”. KDD, pp 100–108
Mason W, Suri S (2010) Conducting behavioral research on amazon’s mechanical turk. Social Science Research Network Working Paper Series
Mccarthy J (2007) From here to human-level AI. Artif Intell 171(18):1174–1182
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (2004) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Minka TP (2001) Expectation propagation for approximate Bayesian inference. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Massachusetts, pp 362–369
Minka TP (2001) A family of algorithms for approximate Bayesian inference (Doctoral dissertation, Massachusetts Institute of Technology)
Minsky ML (1992) Future of AI technology. Toshiba Rev 47(7). https://web.media.mit.edu/~minsky/papers/CausalDiversity.html
Paquet U, Gael JV, Stern D, Kasneci G, Herbrich R, Graepel T (2010) Vuvuzelas active learning for online classification. In: NIPS workshop on computational social science and the Wisdom of Crowds
Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference, 2nd edn. Morgan Kaufmann, San Francisco
Potter A, McClure M, Sellers K (2010) Mass collaboration problem solving: a new approach to wicked problems. In: 2010 International symposium on collaborative technologies and systems, pp 398–407
Qu Y, Tan M, Kutner MH (1996) Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 52:797–810
Raddick MJ, Bracey G, Gay PL, Lintott CJ, Murray P, Schawinski K, Szalay AS et al (2009) Galaxy Zoo: exploring the motivations of citizen science volunteers. Astron Educ Rev 9(1):15
Raykar VC, Yu S, Zhao LH, Jerebko AK, Florin C, Valadez GH, Bogoni L, Moy L (2009) Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: ICML, p 112
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Resour 99:1297–1322
Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518
Ruvolo P, Whitehill J, Movel-lan JR (2010) Exploiting structure in crowdsourcing tasks via latent factor models. Technical Report TR2010. 01, Machine Perception Laboratory
Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In NIPS 2008, vol 20
Seeger M (2006) Bayesian modeling in machine learning: a tutorial review, Technical Report EPFL-161462
Shen W, Campbell JP, Straub D, Schwartz R (2011) Assessing the speaker recognition performance of naive listeners using mechanical turk. ICASSP 2011, pp 5916–5919
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08. ACM, New York, pp 614–622
Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1997) Inferring ground truth from subjective labeling of venus images. In: NIPS, pp 1–9
Snow R, O’Connor, Jurafsky D, Ng A (2008) Cheap and Fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP-08
Srebro N, Jaakkola T (2003) Weighted low-rank approximations. In: Proceedings of the Twentieth international conference of machine learning (ICML 2003). AAAI Press, pp 720–727
von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’04. ACM, Vienna, pp 319–326
Wais P, Lingamneni S, Cook D, Fennell J, Goldenberg B, Lubarov D, Marin D, Simons H (2010) Towards building a high-quality workforce with mechanical turk. In: NIPS workshop on computational social science and the Wisdom of Crowds
Waltz DL (2006) Evolution, sociobiology, and the future of artificial intelligence. IEEE Intell Syst 21(3):66–69
Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional Wisdom of crowds. Adv Neural Inf Process Syst 23:2424–2432
Whitehill J, Ruvolo P, Fan Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. Adv Neural Inf Process Syst 22:2035–2043
Yan Y, Rosales R, Fung G, Schmidt MW, Valadez GH, Bogoni L, Moy L, Dy JG (2010) Modeling annotator expertise: learning when everybody knows a bit of something. J Mach Learn Res 9:932–939
Yan M, Yang Y, Osher S (2013) Exact low-rank matrix completion from sparsely corrupted entries via adaptive outlier pursuit. J Sci Comput 56:433–449
Zellner A (1971) An introduction to Bayesian inference in econometrics, vol 17. Wiley, London
Zhu D, Carterette B (2010) An analysis of assessor behavior in crowdsourced preference judgments. In: ACM SIGIR workshop on crowdsourcing for search evaluation
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Muhammadi, J., Rabiee, H.R. & Hosseini, A. A unified statistical framework for crowd labeling. Knowl Inf Syst 45, 271–294 (2015). https://doi.org/10.1007/s10115-014-0790-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0790-7