Abstract
Recent years brought tremendous advancements in the area of automated information extraction. But still, problem scenarios remain where even state-of-the-art algorithms do not provide a satisfying solution. In these cases, another aspiring recent trend can be exploited to achieve the required extraction quality: explicit crowdsourcing of human intelligence tasks. In this paper, we discuss the synergies between information extraction and crowdsourcing. In particular, we methodically identify and classify the challenges and fallacies that arise when combining both approaches. Furthermore, we argue that for harnessing the full potential of either approach, true hybrid techniques must be considered. To demonstrate this point, we showcase such a hybrid technique, which tightly interweaves information extraction with crowdsourcing and machine learning to vastly surpass the abilities of either technique.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For more detailed information, see http://crowdflower.com/docs/gold.
References
Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: ACM SIGMOD symp on principles of database systems (PODS), Indianapolis, USA, pp 65–76
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18:1411–1428
Amer-Yahia S, Doan A, Kleinberg JM, Koudas N, Franklin MJ, (2010) Crowds, clouds, and algorithms: exploring the human side of “big data” applications. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 1259–1260
Surowiecki J (2004) The wisdom of crowds. Doubleday, Anchor
Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the world-wide web. Commun ACM 54:86–96
Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: ACM SIGMOD int conf on management of data, Athens, Greece
Selke J, Lofi C, Balke W-T (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: 38th int conf on very large data bases (VLDB). PVLDB 5(2), Istanbul, Turkey, pp 538–549
Goodchild M, Glennon JA (2010) Crowdsourcing geographic information for disaster response: a research frontier. Int J Digit Earth 3:231
Marcus A, Wu E, Karger DR, Madden S, Miller RC (2011) Crowdsourced databases: query processing with people. In: Conf on innovative data systems research (CIDR). Asilomar, California, USA
Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the Web. Commun ACM 51:68–74
Getoor L, Taskar B (2007) Introduction to statistical relational learning. MIT Press, Cambridge
Suchanek FM, Kasneci G, Weikum G (2008) YAGO: a large ontology from Wikipedia and WordNet. J Web Semant 6:203–217
Wu F, Weld DS (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the international conference on the world wide web (WWW), pp 635–644
Chai X, Gao BJ, Shen W, Doan AH, Bohannon P, Zhu X (2008) Building community Wikipedias: a machine-human partnership approach. In: Int conf on data engineering (ICDE), Cancun, Mexico
DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan AH, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community In: Conf on innovative data systems research (CIDR) Asilomar, California, USA
Chai X, Vuong B-q, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD int conf on management of data, Providence, Rhode Island, USA
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 99:1297–1322
Ipeirotis PG (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads 17:16–21
Mason W, Watts DJ (2010) Financial incentives and the performance of crowds. ACM SIGKDD Explor Newsl 11:100–108
von Ahn L (2006) Games with a purpose. Computer 39:92–94
von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: SIGCHI conf on human factors in computing systems (CHI), Vienna, Austria
Paolacci G, Chandler J, Ipeirotis PG (2010) Running experiments on amazon mechanical turk. Judgm Decis Mak 5:411–419
Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: SIGCHI conf on human factors in computing systems
Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in mechanical turk. In: Int conf on extended abstracts on human factors in computing systems (CHI EA), Atlanta, USA
Ipeirotis PG (2010) Demographics of mechanical turk. NYU stern school of business research paper series
Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inf Process Syst 54:155–161
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222
Jäkel F, Schölkopf B, Wichmann FA (2009) Does cognitive science need kernels? Trends Cogn Sci 13:381–388
Keeney RL, Raiffa H (1993) Decisions with multiple objectives: preferences and value tradeoffs. Cambridge University Press, Cambridge
Kahneman D, Tversky A (1982) The psychology of preferences. Sci Am 246:160–173
Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115
Koren Y, Bell R (2011) Advances in collaborative filtering. Recommender Systems Handbook, 145–186
Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM SIGKDD int conf on knowledge discovery and data mining (KDD), San Diego, USA. Technical report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lofi, C., Selke, J. & Balke, WT. Information Extraction Meets Crowdsourcing: A Promising Couple. Datenbank Spektrum 12, 109–120 (2012). https://doi.org/10.1007/s13222-012-0092-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0092-8