Skip to main content

Advertisement

Log in

A unified statistical framework for crowd labeling

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Recently, there has been a burst in the number of research projects on human computation via crowdsourcing. Multiple-choice (or labeling) questions could be referred to as a common type of problem which is solved by this approach. As an application, crowd labeling is applied to find true labels for large machine learning datasets. Since crowds are not necessarily experts, the labels they provide are rather noisy and erroneous. This challenge is usually resolved by collecting multiple labels for each sample and then aggregating them to estimate the true label. Although the mechanism leads to high-quality labels, it is not actually cost-effective. As a result, efforts are currently made to maximize the accuracy in estimating true labels, while fixing the number of acquired labels.

This paper surveys methods to aggregate redundant crowd labels in order to estimate unknown true labels. It presents a unified statistical latent model where the differences among popular methods in the field correspond to different choices for the parameters of the model. Afterward, algorithms to make inference on these models will be surveyed. Moreover, adaptive methods which iteratively collect labels based on the previously collected labels and estimated models will be discussed. In addition, this paper compares the distinguished methods and provides guidelines for future work required to address the current open issues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://mturk.com.

  2. http://crowdflower.com.

  3. http://castingwords.com.

  4. http://crowdspring.com.

  5. http://microworkers.com.

  6. http://mobileworks.com.

  7. To access the source codes, datasets, and all other results, please refer to the first author’s web page at http://ce.sharif.edu/~muhammadi.

    Table 3 Results of 2-class-Duchenne dataset
    Table 4 Results of 3-class-RTE dataset
    Table 5 Results of 4-class-TEMP dataset

References

  1. Albert PS, Dodd LE (2004) A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 60(2):427–435

    Article  MATH  MathSciNet  Google Scholar 

  2. Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J R Stat Soc Ser B (Methodological) 28:131–142

  3. Attenberg J, Provost FJ (2011) Online active inference and learning, KDD, pp 186–194

  4. Bachrach Y, Graepel T, Minka T, Guiver J (2012) How To grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing, ICML

  5. Ballatore A, Bertolotto M, Wilson DC (2013) Geographic knowledge extraction and semantic similarity in OpenStreetMap. Knowl Inf Syst 37(1):61–81

    Article  Google Scholar 

  6. Bernstein MS, Little G, Miller RC, Hartmann B, Ackerman MS, Karger DR, Crowell D, Panovich K (2010) Soylent: a word processor with a crowd inside. In: Proceedings of the 23rd annual ACM symposium on user interface software and technology, UIST ’10. ACM, New York, pp 313–322

  7. Bernstein MS, Karger DR, Miller RC, Brandt J (2012) Analytic methods for optimizing realtime crowdsourcing. CoRR

  8. Berry MW (1992) Large scale sparse singular value computations. Int J Supercomput Appl 6:13–49

    Google Scholar 

  9. Bishop CM (2006) Pattern recognition and machine learning, vol 1. Springer, New York, p 740

    MATH  Google Scholar 

  10. Branson S, Wah C, Babenko B, Schroff F, Welinder P, Perona P, Belongie S (2010) Visual recognition with humans in the loop, European Conference on Computer Vision (ECCV)

  11. Brew A, Greene D, Cunningham P (2010) The interaction between supervised learning and crowdsourcing, NIPS workshop on computational social science and the Wisdom of Crowds

  12. Chen X, Lin Q, Zhou D (2013) Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing. ICML 2013:64–72

    Google Scholar 

  13. Chilton LB, Horton JJ, Miller RC, Azenkot S (2010) Task search in a human computation market. In: Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10). ACM, New York

  14. Dai P, Mausam, Weld DS (2011) Artificial intelligence for artificial intelligence, AAAI

  15. Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C 28(1):20–28

    Google Scholar 

  16. Dekel O, Shamir O (2009a) Good learners for evil teachers. In: ICML, vol 382, 30. ACM, New York

  17. Dekel O, Shamir O (2009b) Vox populi, collecting high-quality labels from a crowd. In: COLT

  18. Dreyfus HL, Dreyfus SE (1988) Mind over machine—the power of human intuition and expertise in the era of the computer. Free Press, New York

    Google Scholar 

  19. Eagle N (2009) txteagle: mobile crowdsourcing. In: Proceedings of the 3rd international conference on internationalization, design and global development: held as part of HCI international 2009, IDGD ’09. Springer, Berlin, pp 447–456

  20. Fang M, Zhu X, Li B, Ding W, Wu X (2012) Self-taught active learning from crowds. ICDM, pp 858–863

  21. Frank E, Hall M (2001) A simple approach to ordinal classification. In: ECML ’01: proceedings of the 12th European conference on machine learning. Springer, London, pp 145–156

  22. Fu Y, Zhu X, Li B (2013) A survey on instance selection for active learning. Knowl Inf Syst 35(2):249–283

    Article  Google Scholar 

  23. Grady C, Lease M (2010) Crowdsourcing document relevance assessment with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Association for Computational Linguistics, pp 172–179

  24. Howe J (2008) Crowdsourcing: why the power of the crowd is driving the future of business, 1st edn. Crown Business. United States

  25. Ho CJ, Vaughan JW (2012) Online task assignment in crowdsourcing markets, AAAI

  26. Ho CJ, Jabbari S, Vaughan JW (2013) Adaptive task assignment for crowdsourced classification. JMLR W&CP 28(1):534–542

    Google Scholar 

  27. Ipeirotis PG, Provost F, Sheng V, Wang J (2010) Repeated labeling using multiple noisy labelers. In CeDER working papers

  28. Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10), pp 64–67. ACM, New York

  29. Janssens JHM (2010) Ranking images on semantic attributes using human computation. In: NIPS workshop on computational social science and the Wisdom of crowds

  30. Jung HJ, Lease M (2012) Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. SIGIR 2012:1095–1096

    Google Scholar 

  31. Jung HJ, Lease M (2012) Improving quality of crowdsourced labels via probabilistic matrix factorization. In: Proceedings of the 4th human computation workshop (HCOMP) at AAAI

  32. Kajino H, Tsuboi Y, Sato I, Kashima H (2012) Learning from crowds and experts. In: AAAI human computation technical reports WS-12-08

  33. Kajino H, Tsuboi Y, Kashima H (2012) A convex formulation for learning from crowds. In AAAI

  34. Karger DR, Oh S, Shah D (2011a) Budget-optimal task allocation for reliable crowdsourcing systems. CoRR abs/1110. 3564

  35. Karger DR, Oh S, Shah D (2011b) Budget-optimal crowdsourcing using low-rank matrix approximations. In: 49th annual conference on communication, control, and computing (Allerton), pp 284–291

  36. Law E, von Ahn L (2011) Human computation, synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, Los Altos

    Google Scholar 

  37. Little G, Chilton LB, Goldman M, Miller RC (2009) TurKit: tools for iterative tasks on mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09. ACM, Paris, pp 29–30

  38. Little G, Chilton LB, Goldman M, Miller RC (2010) Exploring iterative and parallel human computation processes. In: Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10. ACM, New York, pp 68–76

  39. Liu C, Wang YM (2012) TrueLabel + confusions: a spectrum of probabilistic models in analyzing multiple ratings. ICML

  40. Liu Q, Peng J, Ihler A (2012) Variational inference for crowdsourcing. In: Advances in neural information processing systems (NIPS), pp 701–709

  41. Mason WA, Watts DJ (2009) Financial incentives and the “performance of crowds”. KDD, pp 100–108

  42. Mason W, Suri S (2010) Conducting behavioral research on amazon’s mechanical turk. Social Science Research Network Working Paper Series

  43. Mccarthy J (2007) From here to human-level AI. Artif Intell 171(18):1174–1182

    Article  Google Scholar 

  44. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (2004) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092

    Article  Google Scholar 

  45. Minka TP (2001) Expectation propagation for approximate Bayesian inference. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Massachusetts, pp 362–369

  46. Minka TP (2001) A family of algorithms for approximate Bayesian inference (Doctoral dissertation, Massachusetts Institute of Technology)

  47. Minsky ML (1992) Future of AI technology. Toshiba Rev 47(7). https://web.media.mit.edu/~minsky/papers/CausalDiversity.html

  48. Paquet U, Gael JV, Stern D, Kasneci G, Herbrich R, Graepel T (2010) Vuvuzelas active learning for online classification. In: NIPS workshop on computational social science and the Wisdom of Crowds

  49. Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

  50. Potter A, McClure M, Sellers K (2010) Mass collaboration problem solving: a new approach to wicked problems. In: 2010 International symposium on collaborative technologies and systems, pp 398–407

  51. Qu Y, Tan M, Kutner MH (1996) Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 52:797–810

    Article  MATH  MathSciNet  Google Scholar 

  52. Raddick MJ, Bracey G, Gay PL, Lintott CJ, Murray P, Schawinski K, Szalay AS et al (2009) Galaxy Zoo: exploring the motivations of citizen science volunteers. Astron Educ Rev 9(1):15

    Google Scholar 

  53. Raykar VC, Yu S, Zhao LH, Jerebko AK, Florin C, Valadez GH, Bogoni L, Moy L (2009) Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: ICML, p 112

  54. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Resour 99:1297–1322

    MathSciNet  Google Scholar 

  55. Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518

    MATH  MathSciNet  Google Scholar 

  56. Ruvolo P, Whitehill J, Movel-lan JR (2010) Exploiting structure in crowdsourcing tasks via latent factor models. Technical Report TR2010. 01, Machine Perception Laboratory

  57. Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In NIPS 2008, vol 20

  58. Seeger M (2006) Bayesian modeling in machine learning: a tutorial review, Technical Report EPFL-161462

  59. Shen W, Campbell JP, Straub D, Schwartz R (2011) Assessing the speaker recognition performance of naive listeners using mechanical turk. ICASSP 2011, pp 5916–5919

  60. Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08. ACM, New York, pp 614–622

  61. Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1997) Inferring ground truth from subjective labeling of venus images. In: NIPS, pp 1–9

  62. Snow R, O’Connor, Jurafsky D, Ng A (2008) Cheap and Fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP-08

  63. Srebro N, Jaakkola T (2003) Weighted low-rank approximations. In: Proceedings of the Twentieth international conference of machine learning (ICML 2003). AAAI Press, pp 720–727

  64. von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’04. ACM, Vienna, pp 319–326

  65. Wais P, Lingamneni S, Cook D, Fennell J, Goldenberg B, Lubarov D, Marin D, Simons H (2010) Towards building a high-quality workforce with mechanical turk. In: NIPS workshop on computational social science and the Wisdom of Crowds

  66. Waltz DL (2006) Evolution, sociobiology, and the future of artificial intelligence. IEEE Intell Syst 21(3):66–69

    Article  Google Scholar 

  67. Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional Wisdom of crowds. Adv Neural Inf Process Syst 23:2424–2432

    Google Scholar 

  68. Whitehill J, Ruvolo P, Fan Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. Adv Neural Inf Process Syst 22:2035–2043

  69. Yan Y, Rosales R, Fung G, Schmidt MW, Valadez GH, Bogoni L, Moy L, Dy JG (2010) Modeling annotator expertise: learning when everybody knows a bit of something. J Mach Learn Res 9:932–939

  70. Yan M, Yang Y, Osher S (2013) Exact low-rank matrix completion from sparsely corrupted entries via adaptive outlier pursuit. J Sci Comput 56:433–449

    Article  MATH  MathSciNet  Google Scholar 

  71. Zellner A (1971) An introduction to Bayesian inference in econometrics, vol 17. Wiley, London

    MATH  Google Scholar 

  72. Zhu D, Carterette B (2010) An analysis of assessor behavior in crowdsourced preference judgments. In: ACM SIGIR workshop on crowdsourcing for search evaluation

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid R. Rabiee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muhammadi, J., Rabiee, H.R. & Hosseini, A. A unified statistical framework for crowd labeling. Knowl Inf Syst 45, 271–294 (2015). https://doi.org/10.1007/s10115-014-0790-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0790-7

Keywords

Navigation