Skip to main content
Log in

Perspectives on crowdsourcing annotations for natural language processing

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Crowdsourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner’s perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. http://wing.comp.nus.edu.sg/crowdsourcing-lrej/.

  2. In MTurk, the notion of a “qualification test” can be viewed this way.

  3. http://en.wikipedia.org/wiki/Main_Page.

  4. https://www.mturk.com/mturk/welcome.

  5. These statistics for worker base size were current as of November 2011.

  6. e.g., elance.com and rentacoder.com.

  7. editz.com, formerly goosegrade.com.

  8. http://www.inforsense.com.

  9. http://www.2pirad.com.

  10. http://www.ificlaims.com.

  11. http://www.crowdflower.com.

  12. http://www.samasource.com.

  13. http://www.cloudcrowd.com.

  14. cf Wordnik http://www.wordnik.com/ and Quora http://www.quora.com.

  15. It would have been an interesting exercise to crowdsource the ratings task itself and achieve statistically significant sample size to give more definitive results, but our time and budget did not allow this.

References

  • Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon mechanical turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 195–203). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0731.

  • Ambati, V., & Vogel, S. (2010). Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 62–65). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0710.

  • Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of COLING-ACL (pp. 86–90). Montreal, Canada.

  • Bloodgood, M., & Callison-Burch, C. (2010). Using mechanical turk to build machine translation evaluation sets. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 208–211). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0733.

  • Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 conference on em-pirical methods in natural language processing (EMNLP 2009) (pp. 286–295). Singapore, Singapore.

  • Callison-Burch, C., & Dredze, M. (2010). Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 1–12). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0701.

  • Chamberlain, J., Poesio, M., & Kruschwitz, U. (2008). Phrase detectives: A web-based collaborative annotation game. In Proceeding of the international conference on semantic systems. Austria: iSemantics 2008. http://www.anawiki.org/phrasedetectives_isem08.pdf.

  • Chamberlain, J., Kruschwitz, U., & Poesio, M. (2009). Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 57–62). Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-3309, n.

  • Chang, J. (2010). Not-so-latent dirichlet allocation: Collapsed gibbs sampling using human judgments. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 131–138). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0720.

  • Chklovski, T. (2005). Collecting paraphrase corpora from volunteer contributors. In K-CAP ’05: Proceedings of the 3rd international conference on Knowledge capture (pp. 115–120). New York, NY, USA: ACM. doi:10.1145/1088622.108864.

  • Eagle, N. (2009). txteagle: Mobile crowdsourcing. In Internationalization, design and global development, Lecture notes in computer science (Vol. 5623). Berlin: Springer.

  • Feng, D., Besana, S., & Zajac, R. (2009). Acquiring high quality non-expert knowledge from on-demand workforce. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 51–56). Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-3308.

  • Filatova, E. (2009). Multilingual wikipedia, summarization, and information trustworthiness. In SIGIR workshop on information access in a multilingual world. Boston, Massachusetts. http://storm.cis.fordham.edu/~filatova/PDFfiles/FilatovaCLIR2009.pdf.

  • Fiscus, J. G. (1997). A post-processing system to yield word error rates: Recognizer output voting error reduction (rover). In Proceedings of IEEE workshop on automatic speech recognition and understanding (pp. 347–354).

  • Fleiss, J., et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

    Article  Google Scholar 

  • Griesi, D., Pazienza, M. T., & Stellato, A. (2007). Semantic turkey – a semantic bookmarking tool. In The semantic web: Research and applications, 4th European semantic web conference (ESWC 2007), Lecture notes in computer science (Vol. 4519, pp. 779–788). Berlin: Springer (System description).

  • Gurevych, I., & Zesch, T. (Eds.) (2009). Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources. Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-33.

  • Gurevych, I., & Zesch, T. (Eds.) (2010). Proceedings of the 2nd workshop on the people’s web meets NLP: Collaboratively constructed semantic resources. Beijing, China: COLING.

  • Ho, C. J., Chang, T. H., Lee, J. C., Hsu, J. Y. J, & Chen, K. T. (2009). Kisskissban: a competitive human computation game for image annotation. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’09) (pp. 11–14). New York, NY, USA: ACM. doi:10.1145/1600150.1600153.

  • Huberman, B., Romero, D., & Wu, F. (2009). Crowdsourcing, attention and productivity. Journal of Information Science , 35(6), 758–765.

    Article  Google Scholar 

  • Ipeirotis, P. (2008). Mechanical turk: The demographics. http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html.

  • Ipeirotis, P. (2010). New demographics of Mechanical Turk. http://behind-the-enemy-line.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html.

  • Irvine, A., & Klementiev, A. (2010). Using mechanical turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 108–113). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0717.

  • Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In P. Bennett, R. Chandrasekar, M. Chickering, P. G. Ipeirotis, E. Law, A. Mityagin, F. J. Provost, & von Ahn, L. (Eds.) KDD workshop on human computation (pp. 58–61), ACM. http://dblp.uni-trier.de/db/conf/kdd/hcomp2009.html#JainP09.

  • Kaisser, M., & Lowe, J. (2008). Creating a research collection of question answer sentence pairs with Amazon’s Mechanical Turk. In European language resources association (Ed.) Proceedings of the sixth international language resources and evaluation (LREC ’08). Morocco: Marrakech. http://www.lrec-conf.org/proceedings/lrec2008/pdf/565_paper.pdf.

  • Kingsbury, P., & Palmer, M. (2002). From treebank to propbank. In Proceedings of the 3rd international conference on language resources and evaluation (LREC ’02). Spain: Las Palmas.

  • Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). New York, NY, USA: ACM. doi:10.1145/1357054.1357127.

  • Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., Moore, J., & Oberlander, J. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th international natural language generation conference (INLG), Dublin, Ireland.

  • Kunath, S., & Weinberger, S. (2010). The wisdom of the crowds ear: Speech accent rating and annotation with Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 168–171). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0726.

  • Kuo, Y. L., Lee, J. C., Chiang, K. Y., Wang, R., Shen, E., Chan, C. W., & Hsu, J. Y. J. (2009). Community-based game design: experiments on social games for commonsense data collection. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’09) (pp. 15–22). New York, NY, USA: ACM. doi:10.1145/1600150.1600154.

  • Law, E. L. M., von Ahn, L., Dannenberg, R. B., & Crawford, M. (2007). TagATune: A game for music and sound annotation. In Proceedings of the 8th international conference on music information retrieval, ISMIR. http://www.cs.cmu.edu/~elaw/papers/ISMIR2007.pdf.

  • Lawson, N., Eustice, K., Perkowitz, M., & Yetisgen-Yildiz, M. (2010). Annotating large email datasets for named entity recognition with Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 71–79). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0712.

  • Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 45–52). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0707.

  • Loup, J., & Ponterio, R. (2006). On the net—wikipedia: A multilingual treasure trove. Language Learning and Technology, 10, 4–7.

    Google Scholar 

  • Madnani, N., Boyd-Graber, J., & Resnik, P. (2010). Measuring transitivity using untrained annotators. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 188–194). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0730.

  • Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In P. Bennett, R. Chandrasekar, M. Chickering, P. G. Ipeirotis, E. Law, A. Mityagin, F. J. Provost, & L. von Ahn (Eds.). KDD workshop on human computation (pp. 77–85). ACM, http://dblp.uni-trier.de/db/conf/kdd/hcomp2009.html#MasonW09.

  • Massa, P., & Avesani, P. (2007). Trust-aware recommender systems. In Proceedings of the 2007 ACM conference on recommender systems (RecSys ’07) (pp. 17–24). New York, NY, USA: ACM. doi:10.1145/1297231.1297235.

  • Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., R Costa-Jussà, M., & Banchs, R. (2010). Opinion mining of spanish customer comments with non-expert annotations on mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 114–121). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0718.

  • Novotney, S., & Callison-Burch, C. (2010). Crowdsourced accessibility: Elicitation of wikipedia articles. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 41–44). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0706.

  • O’Donovan, J., & Smyth, B. (2005). Trust in recommender systems. In Proceedings of the 10th international conference on intelligent user interfaces (IUI ’05) (pp. 167–174). New York, NY, USA: ACM. doi:10.1145/1040830.1040870.

  • Pradhan, S. S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2007). OntoNotes: A unified relational semantic representation. In Proceedings of the first IEEE international conference on semantic computing. Irvine, CA, USA.

  • Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse Treebank 2.0. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008).

  • Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. College Park: Tech. rep., University of Maryland.

  • Quinn, A. J., Bederson, B. B., Yeh, T., & Lin, J. (2010). CrowdFlow: Integrating machine learning with mechanical turk for speed-cost-quality flexibility. College Park: Tech. Rep. HCIL-2010-09, University of Maryland.

  • Settles, B. (2009). Active learning literature survey. Computer sciences technical report 1648. Madison: University of Wisconsin.

  • Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 614–622). New York, NY, USA: ACM. doi:10.1145/1401890.1401965.

  • Siorpaes, K., & Hepp, M. (2008). OntoGame: Weaving the semantic web by online games. In The semantic web: Research and applications, Lecture notes in computer science (Vol. 5021, pp 751–766). Berlin/Heidelberg: Springer. http://www.springerlink.com/content/k0q415u721011510/.

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast—but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 254–263). Honolulu, Hawaii: Association for Computational Linguistics. http://www.aclweb.org/anthology-new/D/D08/D08-1027.pdf.

  • Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. In IEEE computer society conference on computer vision and pattern recognition workshops. CVPRW ’08 (pp. 1–8). AK: Anchorage. http://vision.cs.uiuc.edu/~sorokin2/papers/cvpr08_annotation.pdf.

  • Suh, B., Convertino, G., Chi, E. H., & Pirolli, P. (2009). The singularity is not near: Slowing growth of wikipedia. In WikiSym ’09: Proceedings of the 5th international symposium on wikis and open collaboration (pp. 1–10). New York, NY, USA: ACM. doi:10.1145/1641309.1641322.

  • von Ahn, L. (2005). Human computation. USA: PhD thesis, CMU, URL http://reports-archive.adm.cs.cmu.edu/anon/2005/CMU-CS-05-193.pdf.

  • von Ahn, L. (2006). Invisible computing—games with a purpose. IEEE Computer Magazine, 39(6), 92–94, URL http://www.cs.cmu.edu/~biglou/ieee-gwap.pdf.

  • von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 319–326). New York, NY, USA: ACM. doi:10.1145/985692.985733.

  • von Ahn, L., & Dabbish, L. (2008a). Designing games with a purpose. Communications of the ACM 51(8), 58–67. doi: 10.1145/1378704.1378719.

    Google Scholar 

  • von Ahn, L., & Dabbish, L. (2008b). General techniques for designing games with a purpose. Communications of the ACM 51(8), 58–67. doi:10.1145/1378704.1378719. http://www.cs.cmu.edu/~biglou/GWAP_CACM.pdf.

  • von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Human-based character recognition via web security measures. Science, 1160379. http://www.cs.cmu.edu/~biglou/reCAPTCHA_Science.pdf.

  • Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A., & Koller, D. (2008) Online word games for semantic data collection. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 533–542). Honolulu, Hawaii: Association for Computational Linguistics. doi:http://www.aclweb.org/anthology/D08-1056, URL http://www.stanford.edu/~dvickrey/game.pdf.

  • Voss, J. (2005). Measuring Wikipedia. International conference of the international society for scientometrics and informetrics: 10th, Stockholm (Sweden).

  • Yuen, M. C., Chen, L. J., & King, I. (2009). A survey of human computation systems. IEEE International Conference on Computational Science and Engineering 4, 723–728. doi:10.1109/CSE.2009.395.

  • Zesch, T., Gurevych, I., & Mühlhäuser, M. (2007). Analyzing and accessing wikipedia as a lexical semantic resource. In Biannual conference of the society for computational linguistics and language technology (pp. 213–221).

Download references

Acknowledgments

We would like to thank many of our colleagues who have taken time off their tight schedules to help review and improve to this paper, including Yee Fan Tan, Jesse Prabawa Gozali, Jun-Ping Ng, Jin Zhao and Ziheng Lin. This research is done for CSIDM Project No. CSIDM-200805 partially funded by a grant from the National Research Foundation (NRF) administered by the Media Development Authority (MDA) of Singapore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min-Yen Kan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, A., Hoang, C.D.V. & Kan, MY. Perspectives on crowdsourcing annotations for natural language processing. Lang Resources & Evaluation 47, 9–31 (2013). https://doi.org/10.1007/s10579-012-9176-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9176-1

Keywords

Navigation