Advertisement

Language Resources and Evaluation

, Volume 47, Issue 1, pp 9–31 | Cite as

Perspectives on crowdsourcing annotations for natural language processing

  • Aobo Wang
  • Cong Duy Vu Hoang
  • Min-Yen KanEmail author
Original Paper

Abstract

Crowdsourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner’s perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts.

Keywords

Human computation Crowdsourcing NLP Wikipedia Mechanical Turk Games with a purpose Annotation 

Notes

Acknowledgments

We would like to thank many of our colleagues who have taken time off their tight schedules to help review and improve to this paper, including Yee Fan Tan, Jesse Prabawa Gozali, Jun-Ping Ng, Jin Zhao and Ziheng Lin. This research is done for CSIDM Project No. CSIDM-200805 partially funded by a grant from the National Research Foundation (NRF) administered by the Media Development Authority (MDA) of Singapore.

References

  1. Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon mechanical turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 195–203). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0731.
  2. Ambati, V., & Vogel, S. (2010). Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 62–65). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0710.
  3. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of COLING-ACL (pp. 86–90). Montreal, Canada.Google Scholar
  4. Bloodgood, M., & Callison-Burch, C. (2010). Using mechanical turk to build machine translation evaluation sets. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 208–211). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0733.
  5. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 conference on em-pirical methods in natural language processing (EMNLP 2009) (pp. 286–295). Singapore, Singapore.Google Scholar
  6. Callison-Burch, C., & Dredze, M. (2010). Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 1–12). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0701.
  7. Chamberlain, J., Poesio, M., & Kruschwitz, U. (2008). Phrase detectives: A web-based collaborative annotation game. In Proceeding of the international conference on semantic systems. Austria: iSemantics 2008. http://www.anawiki.org/phrasedetectives_isem08.pdf.
  8. Chamberlain, J., Kruschwitz, U., & Poesio, M. (2009). Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 57–62). Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-3309, n.
  9. Chang, J. (2010). Not-so-latent dirichlet allocation: Collapsed gibbs sampling using human judgments. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 131–138). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0720.
  10. Chklovski, T. (2005). Collecting paraphrase corpora from volunteer contributors. In K-CAP ’05: Proceedings of the 3rd international conference on Knowledge capture (pp. 115–120). New York, NY, USA: ACM. doi: 10.1145/1088622.108864.
  11. Eagle, N. (2009). txteagle: Mobile crowdsourcing. In Internationalization, design and global development, Lecture notes in computer science (Vol. 5623). Berlin: Springer.Google Scholar
  12. Feng, D., Besana, S., & Zajac, R. (2009). Acquiring high quality non-expert knowledge from on-demand workforce. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 51–56). Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-3308.
  13. Filatova, E. (2009). Multilingual wikipedia, summarization, and information trustworthiness. In SIGIR workshop on information access in a multilingual world. Boston, Massachusetts. http://storm.cis.fordham.edu/~filatova/PDFfiles/FilatovaCLIR2009.pdf.
  14. Fiscus, J. G. (1997). A post-processing system to yield word error rates: Recognizer output voting error reduction (rover). In Proceedings of IEEE workshop on automatic speech recognition and understanding (pp. 347–354).Google Scholar
  15. Fleiss, J., et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.CrossRefGoogle Scholar
  16. Griesi, D., Pazienza, M. T., & Stellato, A. (2007). Semantic turkey – a semantic bookmarking tool. In The semantic web: Research and applications, 4th European semantic web conference (ESWC 2007), Lecture notes in computer science (Vol. 4519, pp. 779–788). Berlin: Springer (System description).Google Scholar
  17. Gurevych, I., & Zesch, T. (Eds.) (2009). Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources. Singapore: Association for Computational Linguistics, Suntec. http://www.aclweb.org/anthology/W/W09/W09-33.
  18. Gurevych, I., & Zesch, T. (Eds.) (2010). Proceedings of the 2nd workshop on the people’s web meets NLP: Collaboratively constructed semantic resources. Beijing, China: COLING.Google Scholar
  19. Ho, C. J., Chang, T. H., Lee, J. C., Hsu, J. Y. J, & Chen, K. T. (2009). Kisskissban: a competitive human computation game for image annotation. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’09) (pp. 11–14). New York, NY, USA: ACM. doi: 10.1145/1600150.1600153.
  20. Huberman, B., Romero, D., & Wu, F. (2009). Crowdsourcing, attention and productivity. Journal of Information Science , 35(6), 758–765.CrossRefGoogle Scholar
  21. Ipeirotis, P. (2008). Mechanical turk: The demographics. http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html.
  22. Ipeirotis, P. (2010). New demographics of Mechanical Turk. http://behind-the-enemy-line.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html.
  23. Irvine, A., & Klementiev, A. (2010). Using mechanical turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 108–113). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0717.
  24. Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In P. Bennett, R. Chandrasekar, M. Chickering, P. G. Ipeirotis, E. Law, A. Mityagin, F. J. Provost, & von Ahn, L. (Eds.) KDD workshop on human computation (pp. 58–61), ACM. http://dblp.uni-trier.de/db/conf/kdd/hcomp2009.html#JainP09.
  25. Kaisser, M., & Lowe, J. (2008). Creating a research collection of question answer sentence pairs with Amazon’s Mechanical Turk. In European language resources association (Ed.) Proceedings of the sixth international language resources and evaluation (LREC ’08). Morocco: Marrakech. http://www.lrec-conf.org/proceedings/lrec2008/pdf/565_paper.pdf.
  26. Kingsbury, P., & Palmer, M. (2002). From treebank to propbank. In Proceedings of the 3rd international conference on language resources and evaluation (LREC ’02). Spain: Las Palmas.Google Scholar
  27. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). New York, NY, USA: ACM. doi: 10.1145/1357054.1357127.
  28. Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., Moore, J., & Oberlander, J. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th international natural language generation conference (INLG), Dublin, Ireland.Google Scholar
  29. Kunath, S., & Weinberger, S. (2010). The wisdom of the crowds ear: Speech accent rating and annotation with Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 168–171). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0726.
  30. Kuo, Y. L., Lee, J. C., Chiang, K. Y., Wang, R., Shen, E., Chan, C. W., & Hsu, J. Y. J. (2009). Community-based game design: experiments on social games for commonsense data collection. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’09) (pp. 15–22). New York, NY, USA: ACM. doi: 10.1145/1600150.1600154.
  31. Law, E. L. M., von Ahn, L., Dannenberg, R. B., & Crawford, M. (2007). TagATune: A game for music and sound annotation. In Proceedings of the 8th international conference on music information retrieval, ISMIR. http://www.cs.cmu.edu/~elaw/papers/ISMIR2007.pdf.
  32. Lawson, N., Eustice, K., Perkowitz, M., & Yetisgen-Yildiz, M. (2010). Annotating large email datasets for named entity recognition with Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 71–79). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0712.
  33. Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 45–52). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0707.
  34. Loup, J., & Ponterio, R. (2006). On the net—wikipedia: A multilingual treasure trove. Language Learning and Technology, 10, 4–7.Google Scholar
  35. Madnani, N., Boyd-Graber, J., & Resnik, P. (2010). Measuring transitivity using untrained annotators. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 188–194). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0730.
  36. Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In P. Bennett, R. Chandrasekar, M. Chickering, P. G. Ipeirotis, E. Law, A. Mityagin, F. J. Provost, & L. von Ahn (Eds.). KDD workshop on human computation (pp. 77–85). ACM, http://dblp.uni-trier.de/db/conf/kdd/hcomp2009.html#MasonW09.
  37. Massa, P., & Avesani, P. (2007). Trust-aware recommender systems. In Proceedings of the 2007 ACM conference on recommender systems (RecSys ’07) (pp. 17–24). New York, NY, USA: ACM. doi: 10.1145/1297231.1297235.
  38. Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., R Costa-Jussà, M., & Banchs, R. (2010). Opinion mining of spanish customer comments with non-expert annotations on mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 114–121). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0718.
  39. Novotney, S., & Callison-Burch, C. (2010). Crowdsourced accessibility: Elicitation of wikipedia articles. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 41–44). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0706.
  40. O’Donovan, J., & Smyth, B. (2005). Trust in recommender systems. In Proceedings of the 10th international conference on intelligent user interfaces (IUI ’05) (pp. 167–174). New York, NY, USA: ACM. doi: 10.1145/1040830.1040870.
  41. Pradhan, S. S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2007). OntoNotes: A unified relational semantic representation. In Proceedings of the first IEEE international conference on semantic computing. Irvine, CA, USA.Google Scholar
  42. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse Treebank 2.0. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008).Google Scholar
  43. Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. College Park: Tech. rep., University of Maryland.Google Scholar
  44. Quinn, A. J., Bederson, B. B., Yeh, T., & Lin, J. (2010). CrowdFlow: Integrating machine learning with mechanical turk for speed-cost-quality flexibility. College Park: Tech. Rep. HCIL-2010-09, University of Maryland.Google Scholar
  45. Settles, B. (2009). Active learning literature survey. Computer sciences technical report 1648. Madison: University of Wisconsin.Google Scholar
  46. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 614–622). New York, NY, USA: ACM. doi: 10.1145/1401890.1401965.
  47. Siorpaes, K., & Hepp, M. (2008). OntoGame: Weaving the semantic web by online games. In The semantic web: Research and applications, Lecture notes in computer science (Vol. 5021, pp 751–766). Berlin/Heidelberg: Springer. http://www.springerlink.com/content/k0q415u721011510/.
  48. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast—but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 254–263). Honolulu, Hawaii: Association for Computational Linguistics. http://www.aclweb.org/anthology-new/D/D08/D08-1027.pdf.
  49. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. In IEEE computer society conference on computer vision and pattern recognition workshops. CVPRW ’08 (pp. 1–8). AK: Anchorage. http://vision.cs.uiuc.edu/~sorokin2/papers/cvpr08_annotation.pdf.
  50. Suh, B., Convertino, G., Chi, E. H., & Pirolli, P. (2009). The singularity is not near: Slowing growth of wikipedia. In WikiSym ’09: Proceedings of the 5th international symposium on wikis and open collaboration (pp. 1–10). New York, NY, USA: ACM. doi: 10.1145/1641309.1641322.
  51. von Ahn, L. (2005). Human computation. USA: PhD thesis, CMU, URL http://reports-archive.adm.cs.cmu.edu/anon/2005/CMU-CS-05-193.pdf.
  52. von Ahn, L. (2006). Invisible computing—games with a purpose. IEEE Computer Magazine, 39(6), 92–94, URL http://www.cs.cmu.edu/~biglou/ieee-gwap.pdf.
  53. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 319–326). New York, NY, USA: ACM. doi: 10.1145/985692.985733.
  54. von Ahn, L., & Dabbish, L. (2008a). Designing games with a purpose. Communications of the ACM 51(8), 58–67. doi:  10.1145/1378704.1378719.Google Scholar
  55. von Ahn, L., & Dabbish, L. (2008b). General techniques for designing games with a purpose. Communications of the ACM 51(8), 58–67. doi: 10.1145/1378704.1378719. http://www.cs.cmu.edu/~biglou/GWAP_CACM.pdf.
  56. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Human-based character recognition via web security measures. Science, 1160379. http://www.cs.cmu.edu/~biglou/reCAPTCHA_Science.pdf.
  57. Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A., & Koller, D. (2008) Online word games for semantic data collection. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 533–542). Honolulu, Hawaii: Association for Computational Linguistics. doi:http://www.aclweb.org/anthology/D08-1056, URL http://www.stanford.edu/~dvickrey/game.pdf.
  58. Voss, J. (2005). Measuring Wikipedia. International conference of the international society for scientometrics and informetrics: 10th, Stockholm (Sweden).Google Scholar
  59. Yuen, M. C., Chen, L. J., & King, I. (2009). A survey of human computation systems. IEEE International Conference on Computational Science and Engineering 4, 723–728. doi: 10.1109/CSE.2009.395.
  60. Zesch, T., Gurevych, I., & Mühlhäuser, M. (2007). Analyzing and accessing wikipedia as a lexical semantic resource. In Biannual conference of the society for computational linguistics and language technology (pp. 213–221).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.AS6 04-13 Computing 113 Computing Drive National University of SingaporeSingaporeSingapore
  2. 2.Human Language Technology DepartmentInstitute for Infocomm Research (I²R), A*STARSingaporeSingapore
  3. 3.AS6 05-12 Computing 113 Computing Drive National University of SingaporeSingaporeSingapore

Personalised recommendations