, Volume 12, Issue 2, pp 109–120 | Cite as

Information Extraction Meets Crowdsourcing: A Promising Couple

  • Christoph LofiEmail author
  • Joachim Selke
  • Wolf-Tilo Balke


Recent years brought tremendous advancements in the area of automated information extraction. But still, problem scenarios remain where even state-of-the-art algorithms do not provide a satisfying solution. In these cases, another aspiring recent trend can be exploited to achieve the required extraction quality: explicit crowdsourcing of human intelligence tasks. In this paper, we discuss the synergies between information extraction and crowdsourcing. In particular, we methodically identify and classify the challenges and fallacies that arise when combining both approaches. Furthermore, we argue that for harnessing the full potential of either approach, true hybrid techniques must be considered. To demonstrate this point, we showcase such a hybrid technique, which tightly interweaves information extraction with crowdsourcing and machine learning to vastly surpass the abilities of either technique.


Majority Vote Information Extraction Perceptual Space Malicious User Genre Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: ACM SIGMOD symp on principles of database systems (PODS), Indianapolis, USA, pp 65–76 Google Scholar
  2. 2.
    Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18:1411–1428 CrossRefGoogle Scholar
  3. 3.
    Amer-Yahia S, Doan A, Kleinberg JM, Koudas N, Franklin MJ, (2010) Crowds, clouds, and algorithms: exploring the human side of “big data” applications. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 1259–1260 Google Scholar
  4. 4.
    Surowiecki J (2004) The wisdom of crowds. Doubleday, Anchor Google Scholar
  5. 5.
    Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the world-wide web. Commun ACM 54:86–96 CrossRefGoogle Scholar
  6. 6.
    Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: ACM SIGMOD int conf on management of data, Athens, Greece Google Scholar
  7. 7.
    Selke J, Lofi C, Balke W-T (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: 38th int conf on very large data bases (VLDB). PVLDB 5(2), Istanbul, Turkey, pp 538–549 Google Scholar
  8. 8.
    Goodchild M, Glennon JA (2010) Crowdsourcing geographic information for disaster response: a research frontier. Int J Digit Earth 3:231 CrossRefGoogle Scholar
  9. 9.
    Marcus A, Wu E, Karger DR, Madden S, Miller RC (2011) Crowdsourced databases: query processing with people. In: Conf on innovative data systems research (CIDR). Asilomar, California, USA Google Scholar
  10. 10.
    Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the Web. Commun ACM 51:68–74 CrossRefGoogle Scholar
  11. 11.
    Getoor L, Taskar B (2007) Introduction to statistical relational learning. MIT Press, Cambridge zbMATHGoogle Scholar
  12. 12.
    Suchanek FM, Kasneci G, Weikum G (2008) YAGO: a large ontology from Wikipedia and WordNet. J Web Semant 6:203–217 CrossRefGoogle Scholar
  13. 13.
    Wu F, Weld DS (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the international conference on the world wide web (WWW), pp 635–644 Google Scholar
  14. 14.
    Chai X, Gao BJ, Shen W, Doan AH, Bohannon P, Zhu X (2008) Building community Wikipedias: a machine-human partnership approach. In: Int conf on data engineering (ICDE), Cancun, Mexico Google Scholar
  15. 15.
    DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan AH, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community In: Conf on innovative data systems research (CIDR) Asilomar, California, USA Google Scholar
  16. 16.
    Chai X, Vuong B-q, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD int conf on management of data, Providence, Rhode Island, USA Google Scholar
  17. 17.
    Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 99:1297–1322 MathSciNetGoogle Scholar
  18. 18.
    Ipeirotis PG (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads 17:16–21 Google Scholar
  19. 19.
    Mason W, Watts DJ (2010) Financial incentives and the performance of crowds. ACM SIGKDD Explor Newsl 11:100–108 CrossRefGoogle Scholar
  20. 20.
    von Ahn L (2006) Games with a purpose. Computer 39:92–94 CrossRefGoogle Scholar
  21. 21.
    von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: SIGCHI conf on human factors in computing systems (CHI), Vienna, Austria Google Scholar
  22. 22.
    Paolacci G, Chandler J, Ipeirotis PG (2010) Running experiments on amazon mechanical turk. Judgm Decis Mak 5:411–419 Google Scholar
  23. 23.
    Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: SIGCHI conf on human factors in computing systems Google Scholar
  24. 24.
    Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in mechanical turk. In: Int conf on extended abstracts on human factors in computing systems (CHI EA), Atlanta, USA Google Scholar
  25. 25.
    Ipeirotis PG (2010) Demographics of mechanical turk. NYU stern school of business research paper series Google Scholar
  26. 26.
    Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inf Process Syst 54:155–161 Google Scholar
  27. 27.
    Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222 MathSciNetCrossRefGoogle Scholar
  28. 28.
    Jäkel F, Schölkopf B, Wichmann FA (2009) Does cognitive science need kernels? Trends Cogn Sci 13:381–388 CrossRefGoogle Scholar
  29. 29.
    Keeney RL, Raiffa H (1993) Decisions with multiple objectives: preferences and value tradeoffs. Cambridge University Press, Cambridge Google Scholar
  30. 30.
    Kahneman D, Tversky A (1982) The psychology of preferences. Sci Am 246:160–173 CrossRefGoogle Scholar
  31. 31.
    Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115 CrossRefGoogle Scholar
  32. 32.
    Koren Y, Bell R (2011) Advances in collaborative filtering. Recommender Systems Handbook, 145–186 Google Scholar
  33. 33.
    Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM SIGKDD int conf on knowledge discovery and data mining (KDD), San Diego, USA. Technical report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Christoph Lofi
    • 1
    Email author
  • Joachim Selke
    • 1
  • Wolf-Tilo Balke
    • 1
  1. 1.Institut für InformationssystemeTechnische Universität BraunschweigBraunschweigGermany

Personalised recommendations