Skip to main content

Advertisement

Log in

Information Extraction Meets Crowdsourcing: A Promising Couple

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Recent years brought tremendous advancements in the area of automated information extraction. But still, problem scenarios remain where even state-of-the-art algorithms do not provide a satisfying solution. In these cases, another aspiring recent trend can be exploited to achieve the required extraction quality: explicit crowdsourcing of human intelligence tasks. In this paper, we discuss the synergies between information extraction and crowdsourcing. In particular, we methodically identify and classify the challenges and fallacies that arise when combining both approaches. Furthermore, we argue that for harnessing the full potential of either approach, true hybrid techniques must be considered. To demonstrate this point, we showcase such a hybrid technique, which tightly interweaves information extraction with crowdsourcing and machine learning to vastly surpass the abilities of either technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. For more detailed information, see http://crowdflower.com/docs/gold.

  2. http://samasource.org/.

  3. http://www.facebook.com/press/info.php?statistics.

References

  1. Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: ACM SIGMOD symp on principles of database systems (PODS), Indianapolis, USA, pp 65–76

    Google Scholar 

  2. Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18:1411–1428

    Article  Google Scholar 

  3. Amer-Yahia S, Doan A, Kleinberg JM, Koudas N, Franklin MJ, (2010) Crowds, clouds, and algorithms: exploring the human side of “big data” applications. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 1259–1260

    Google Scholar 

  4. Surowiecki J (2004) The wisdom of crowds. Doubleday, Anchor

    Google Scholar 

  5. Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the world-wide web. Commun ACM 54:86–96

    Article  Google Scholar 

  6. Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: ACM SIGMOD int conf on management of data, Athens, Greece

    Google Scholar 

  7. Selke J, Lofi C, Balke W-T (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: 38th int conf on very large data bases (VLDB). PVLDB 5(2), Istanbul, Turkey, pp 538–549

    Google Scholar 

  8. Goodchild M, Glennon JA (2010) Crowdsourcing geographic information for disaster response: a research frontier. Int J Digit Earth 3:231

    Article  Google Scholar 

  9. Marcus A, Wu E, Karger DR, Madden S, Miller RC (2011) Crowdsourced databases: query processing with people. In: Conf on innovative data systems research (CIDR). Asilomar, California, USA

    Google Scholar 

  10. Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the Web. Commun ACM 51:68–74

    Article  Google Scholar 

  11. Getoor L, Taskar B (2007) Introduction to statistical relational learning. MIT Press, Cambridge

    MATH  Google Scholar 

  12. Suchanek FM, Kasneci G, Weikum G (2008) YAGO: a large ontology from Wikipedia and WordNet. J Web Semant 6:203–217

    Article  Google Scholar 

  13. Wu F, Weld DS (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the international conference on the world wide web (WWW), pp 635–644

    Google Scholar 

  14. Chai X, Gao BJ, Shen W, Doan AH, Bohannon P, Zhu X (2008) Building community Wikipedias: a machine-human partnership approach. In: Int conf on data engineering (ICDE), Cancun, Mexico

    Google Scholar 

  15. DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan AH, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community In: Conf on innovative data systems research (CIDR) Asilomar, California, USA

    Google Scholar 

  16. Chai X, Vuong B-q, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD int conf on management of data, Providence, Rhode Island, USA

    Google Scholar 

  17. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 99:1297–1322

    MathSciNet  Google Scholar 

  18. Ipeirotis PG (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads 17:16–21

    Google Scholar 

  19. Mason W, Watts DJ (2010) Financial incentives and the performance of crowds. ACM SIGKDD Explor Newsl 11:100–108

    Article  Google Scholar 

  20. von Ahn L (2006) Games with a purpose. Computer 39:92–94

    Article  Google Scholar 

  21. von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: SIGCHI conf on human factors in computing systems (CHI), Vienna, Austria

    Google Scholar 

  22. Paolacci G, Chandler J, Ipeirotis PG (2010) Running experiments on amazon mechanical turk. Judgm Decis Mak 5:411–419

    Google Scholar 

  23. Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: SIGCHI conf on human factors in computing systems

    Google Scholar 

  24. Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in mechanical turk. In: Int conf on extended abstracts on human factors in computing systems (CHI EA), Atlanta, USA

    Google Scholar 

  25. Ipeirotis PG (2010) Demographics of mechanical turk. NYU stern school of business research paper series

  26. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inf Process Syst 54:155–161

    Google Scholar 

  27. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222

    Article  MathSciNet  Google Scholar 

  28. Jäkel F, Schölkopf B, Wichmann FA (2009) Does cognitive science need kernels? Trends Cogn Sci 13:381–388

    Article  Google Scholar 

  29. Keeney RL, Raiffa H (1993) Decisions with multiple objectives: preferences and value tradeoffs. Cambridge University Press, Cambridge

    Google Scholar 

  30. Kahneman D, Tversky A (1982) The psychology of preferences. Sci Am 246:160–173

    Article  Google Scholar 

  31. Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115

    Article  Google Scholar 

  32. Koren Y, Bell R (2011) Advances in collaborative filtering. Recommender Systems Handbook, 145–186

  33. Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM SIGKDD int conf on knowledge discovery and data mining (KDD), San Diego, USA. Technical report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Lofi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lofi, C., Selke, J. & Balke, WT. Information Extraction Meets Crowdsourcing: A Promising Couple. Datenbank Spektrum 12, 109–120 (2012). https://doi.org/10.1007/s13222-012-0092-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0092-8

Keywords

Navigation