Distributed and Parallel Databases

, Volume 33, Issue 1, pp 95–122 | Cite as

Crowdsourcing large scale wrapper inference

  • Valter Crescenzi
  • Paolo Merialdo
  • Disheng Qiu


We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (\({\textsc {alf}}_{\eta }\)) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.


Data extraction Wrapper induction Crowdsourcing 


  1. 1.
    Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)Google Scholar
  2. 2.
    Angluin, D.: Queries revisited. Theor. Comput. Sci. 313(2), 175–194 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  3. 3.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003, pp. 337–348 (2003)Google Scholar
  4. 4.
    Balcan, M.F., Hanneke, S., Vaughan, J.W.: The true sample complexity of active learning. Mach. Learn. 80(2–3), 111–139 (2010)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Extraction and integration of partially overlapping web sources. PVLDB 6(10), 805–816 (2013)Google Scholar
  6. 6.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  7. 7.
    Crescenzi, V., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: WWW 2013, pp. 261–272 (2013)Google Scholar
  8. 8.
    Crescenzi, V., Merialdo, P., Qiu, D.: Wrapper generation supervised by a noisy crowd. In: DBCrowd, CEUR Workshop Proceedings 1025, 8–13 (2013)Google Scholar
  9. 9.
    Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artifi. Intell. 22(1&2), 21–52 (2008)CrossRefGoogle Scholar
  10. 10.
    Dalvi, N.N., Kumar, R., Soliman, M.A.: Automatic wrappers for large scale web extraction. PVLDB 4(4), 219–230 (2011)Google Scholar
  11. 11.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)zbMATHMathSciNetGoogle Scholar
  12. 12.
    Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide web. Commun. ACM 54(4), 86–96 (2011)CrossRefGoogle Scholar
  13. 13.
    Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. CoRR (2012). arXiv:1207.0246
  14. 14.
    Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (Companion Volume), pp. 267–270 (2012)Google Scholar
  15. 15.
    Gao, C., Zhou, D.: Minimax optimal convergence rates for estimating ground truth from crowdsourced labels (2013). arXiv:1310.5764
  16. 16.
    Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project: back and forth between theory and practice. In: PODS, pp. 1–12. ACM (2004)Google Scholar
  17. 17.
    Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE 2011, pp. 1209–1220. IEEE Computer Society (2011)Google Scholar
  18. 18.
    Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)Google Scholar
  19. 19.
    Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR 2011, pp. 775–784. ACM (2011)Google Scholar
  20. 20.
    Ipeirotis, P.G.: Analyzing the amazon mechanical turk marketplace. XRDS 17(2), 16–21 (2010)CrossRefGoogle Scholar
  21. 21.
    Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563. ACM (2006)Google Scholar
  22. 22.
    Karger, D.R., Oh, S., Shah, D.: Budget-optimal task allocation for reliable crowdsourcing systems. CoRR (2011). arXiv:1110.3564
  23. 23.
    Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS 2011, 1953–1961 (2011)Google Scholar
  24. 24.
    Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)CrossRefzbMATHMathSciNetGoogle Scholar
  25. 25.
    Lee, M.D., Steyvers, M., De Young, M., Miller, B.: Inferring expertise in knowledge and prediction ranking tasks. Topics Cogn. Sci. 4(1), 151–163 (2012)CrossRefGoogle Scholar
  26. 26.
    Li, H., Yu, B., Zhou, D.: Error rate bounds in crowdsourcing models. CoRR (2013). arXiv:1307.2674
  27. 27.
    Liu, Q., Ihler, A.T., Steyvers, M.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, 1914–1922 (2013)Google Scholar
  28. 28.
    Liu, Q., Steyvers, M., Ihler, A.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, pp. 1914–1922 (2013)Google Scholar
  29. 29.
    Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)Google Scholar
  30. 30.
    Marcus, A., Karger, D.R., Madden, S., Miller, R., Oh, S.: Counting with the crowd. PVLDB 6(2), 109–120 (2012)Google Scholar
  31. 31.
    McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley series in probability and statistics., vol. 2. Wiley, Hoboken (2008)CrossRefzbMATHGoogle Scholar
  32. 32.
    Muslea, I., Minton, S., Knoblock, C.A.: Active learning with multiple views. J. Artif. Intell. Res. (JAIR) 27, 203–233 (2006)zbMATHMathSciNetGoogle Scholar
  33. 33.
    Parameswaran, A.G., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. PVLDB 4(11), 980–991 (2011)Google Scholar
  34. 34.
    Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)Google Scholar
  35. 35.
    Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD 2008, pp. 614–622, ACM (2008)Google Scholar
  36. 36.
    Wong, T.L.: Learning to adapt cross language information extraction wrapper. Appl. Intell. 36(4), 918–931 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Dipartimento di IngegneriaUniversità degli Studi Roma TreRomeItaly

Personalised recommendations