How clumpy is my image?
The use of citizen science to obtain annotations from multiple annotators has been shown to be an effective method for annotating datasets in which computational methods alone are not feasible. The way in which the annotations are obtained is an important consideration which affects the quality of the resulting consensus annotation. In this paper, we examine three separate approaches to obtaining consensus scores for instances rather than merely binary classifications. To obtain a consensus score, annotators were asked to make annotations in one of three paradigms: classification, scoring and ranking. A web-based citizen science experiment is described which implements the three approaches as crowdsourced annotation tasks. The tasks are evaluated in relation to the accuracy and agreement among the participants using both simulated and real-world data from the experiment. The results show a clear difference in performance between the three tasks, with the ranking task obtaining the highest accuracy and agreement among the participants. We show how a simple evolutionary optimiser may be used to improve the performance by reweighting the importance of annotators.
KeywordsWeb-based citizen science Classification Consensus score Crowdsourced annotation tasks Evolutionary optimiser Image clump Ranking Scoring Internet Evolutionary computation Image classification Pattern clustering Microscopy Correlation
- Fortson L, Masters K, Nichol R, Borne K, Edmondson E, Lintott C, Raddick J, Schawinski K, Wallin J (2012) Galaxy Zoo: morphological classification and citizen science. In: Way MJ, Scargle JD, Ali KM, Srivastava AN (eds) Advances in machine learning and data mining for astronomy, data mining and knowledge discovery series. Chapman and Hall/CRC, Boca RatonGoogle Scholar
- Gelman A, Carlin JB, Stren HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis, 3rd edn. Chapman and Hall/CRC, Boca RatonGoogle Scholar
- Heer J, Bostock M (2010) Crowdsourcing graphical perception: using mechanical turk to assess visualization design. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 203–212Google Scholar
- Knowles J, Corne D (1999) The pareto archived evolution strategy: a new baseline algorithm for pareto multiobjective optimisation. In: Proceedings of the Congress on evolutionary computation vol 1, pp 98–105Google Scholar
- Lebanon G, Lafferty J (2002) Cranking: Combining rankings using conditional probability models on permutations. In: ICML ’02 Proceedings of the nineteenth international conference on machine learning, pp 363–370Google Scholar
- Lehmann EL (2006) Nonparametrics: statistical methods based on ranks, 1st (revised) edn. Springer, New YorkGoogle Scholar
- Parent G, Eskenazi M (2010) Clustering dictionary definitions using Amazon mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk. Association for computational linguistics. Los Angeles, pp 21–29 Google Scholar
- Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518 ISSN 1532–4435Google Scholar
- Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast— but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the EMNLP conference on empirical methods in natural language processing. ACM, New York, pp 254–263Google Scholar
- Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J Royal Stat Soc Ser B 58:267288Google Scholar
- Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in neural information process systems vol 22, pp 2035–2043Google Scholar