Advertisement

Soft Computing

, Volume 19, Issue 6, pp 1541–1552 | Cite as

How clumpy is my image?

Scoring in crowdsourced annotation tasks
  • Hugo Hutt
  • Richard Everson
  • Murray Grant
  • John Love
  • George Littlejohn
Focus

Abstract

The use of citizen science to obtain annotations from multiple annotators has been shown to be an effective method for annotating datasets in which computational methods alone are not feasible. The way in which the annotations are obtained is an important consideration which affects the quality of the resulting consensus annotation. In this paper, we examine three separate approaches to obtaining consensus scores for instances rather than merely binary classifications. To obtain a consensus score, annotators were asked to make annotations in one of three paradigms: classification, scoring and ranking. A web-based citizen science experiment is described which implements the three approaches as crowdsourced annotation tasks. The tasks are evaluated in relation to the accuracy and agreement among the participants using both simulated and real-world data from the experiment. The results show a clear difference in performance between the three tasks, with the ranking task obtaining the highest accuracy and agreement among the participants. We show how a simple evolutionary optimiser may be used to improve the performance by reweighting the importance of annotators.

Keywords

Web-based citizen science Classification Consensus score Crowdsourced annotation tasks Evolutionary optimiser Image clump Ranking Scoring Internet Evolutionary computation Image classification Pattern clustering Microscopy Correlation 

References

  1. Bailey RA (2008) Design of comparative experiments, 1st edn. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  2. Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the world-wide web. Commun ACM 54(4):86–96CrossRefGoogle Scholar
  3. Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27(8):861–874 ISSN 0167–8655CrossRefMathSciNetGoogle Scholar
  4. Fortson L, Masters K, Nichol R, Borne K, Edmondson E, Lintott C, Raddick J, Schawinski K, Wallin J (2012) Galaxy Zoo: morphological classification and citizen science. In: Way MJ, Scargle JD, Ali KM, Srivastava AN (eds) Advances in machine learning and data mining for astronomy, data mining and knowledge discovery series. Chapman and Hall/CRC, Boca RatonGoogle Scholar
  5. Gelman A, Carlin JB, Stren HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis, 3rd edn. Chapman and Hall/CRC, Boca RatonGoogle Scholar
  6. Heer J, Bostock M (2010) Crowdsourcing graphical perception: using mechanical turk to assess visualization design. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 203–212Google Scholar
  7. Knowles J, Corne D (1999) The pareto archived evolution strategy: a new baseline algorithm for pareto multiobjective optimisation. In: Proceedings of the Congress on evolutionary computation vol 1, pp 98–105Google Scholar
  8. Lebanon G, Lafferty J (2002) Cranking: Combining rankings using conditional probability models on permutations. In: ICML ’02 Proceedings of the nineteenth international conference on machine learning, pp 363–370Google Scholar
  9. Lehmann EL (2006) Nonparametrics: statistical methods based on ranks, 1st (revised) edn. Springer, New YorkGoogle Scholar
  10. Littlejohn GR, Gouveia JD, Edner C, Smirnoff N, Love J (2010) Perfluorodecalin substantially improves confocal depth resolution in air-filled tissues. New Phytol 186(4):1018–1025CrossRefGoogle Scholar
  11. Parent G, Eskenazi M (2010) Clustering dictionary definitions using Amazon mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk. Association for computational linguistics. Los Angeles, pp 21–29 Google Scholar
  12. Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518 ISSN 1532–4435Google Scholar
  13. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322MathSciNetGoogle Scholar
  14. Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast— but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the EMNLP conference on empirical methods in natural language processing. ACM, New York, pp 254–263Google Scholar
  15. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J Royal Stat Soc Ser B 58:267288Google Scholar
  16. Truman W, de Torres Zabala M, Grant M (2006) A complex interplay of transcriptional regulation acts to modify basal defense responses during pathogenesis. Plant J 46:14–33CrossRefGoogle Scholar
  17. Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in neural information process systems vol 22, pp 2035–2043Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Hugo Hutt
    • 1
  • Richard Everson
    • 1
  • Murray Grant
    • 2
  • John Love
    • 2
  • George Littlejohn
    • 2
  1. 1.Computer ScienceThe University of ExeterExeterUK
  2. 2.BiosciencesThe University of ExeterExeterUK

Personalised recommendations