Skip to main content
Log in

Crowdsourcing large scale wrapper inference

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (\({\textsc {alf}}_{\eta }\)) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Actually, we do not consider the whole set of pages \({U}\), but a much smaller sample set including the initial annotated page and at most 100 pages randomly chosen from \({U}\).

  2. The problem reduces to finding the smallest set of pages such that the union of the sets of rules differentiated from them equals the set of rules differentiated directly by \(U\).

  3. We choose the rule with the shortest path, but other strategies, such as those discussed in [33] could be applied.

  4. Notice that this approach can be seen as a special case of the previous one, as a task with ground truth information can be seen as a redundant task solved by a perfect worker.

  5. In Sect. 6 we consider workers producing t.s. for several attributes in a single task.

  6. In our implementation, we stop when all the error rates do not change, in absolute value, more than \(\Delta _{\eta }=10^{-4}\).

  7. We neglect any budget issue at alfred’s level where the goal is to reach the quality target \(\lambda _r\); however \({\textsc {alf}}_{\eta }\) bounds to \(\lambda _{MQ}\) the budget per attribute spent for each worker.

  8. For the sake of simplicity we are assuming that \(|\mathcal {U}|\) is a multiple of \(N\). Otherwise, the tasks can be completed by inserting control attributes with known answers to better estimate the workers error rate.

  9. We rely on CrowdFlower, a popular meta-platform that offers services to recruit workers on AMT.

  10. We use Selenium (http://docs.seleniumhq.org/projects/webdriver) and Phantomjs (http://phantomjs.org).

  11. http://swde.codeplex.com.

  12. The datasets are available upon request.

References

  1. Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)

    Google Scholar 

  2. Angluin, D.: Queries revisited. Theor. Comput. Sci. 313(2), 175–194 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  3. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003, pp. 337–348 (2003)

  4. Balcan, M.F., Hanneke, S., Vaughan, J.W.: The true sample complexity of active learning. Mach. Learn. 80(2–3), 111–139 (2010)

    Article  MathSciNet  Google Scholar 

  5. Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Extraction and integration of partially overlapping web sources. PVLDB 6(10), 805–816 (2013)

    Google Scholar 

  6. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  7. Crescenzi, V., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: WWW 2013, pp. 261–272 (2013)

  8. Crescenzi, V., Merialdo, P., Qiu, D.: Wrapper generation supervised by a noisy crowd. In: DBCrowd, CEUR Workshop Proceedings 1025, 8–13 (2013)

  9. Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artifi. Intell. 22(1&2), 21–52 (2008)

    Article  Google Scholar 

  10. Dalvi, N.N., Kumar, R., Soliman, M.A.: Automatic wrappers for large scale web extraction. PVLDB 4(4), 219–230 (2011)

    Google Scholar 

  11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  12. Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide web. Commun. ACM 54(4), 86–96 (2011)

    Article  Google Scholar 

  13. Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. CoRR (2012). arXiv:1207.0246

  14. Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (Companion Volume), pp. 267–270 (2012)

  15. Gao, C., Zhou, D.: Minimax optimal convergence rates for estimating ground truth from crowdsourced labels (2013). arXiv:1310.5764

  16. Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project: back and forth between theory and practice. In: PODS, pp. 1–12. ACM (2004)

  17. Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE 2011, pp. 1209–1220. IEEE Computer Society (2011)

  18. Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)

    Google Scholar 

  19. Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR 2011, pp. 775–784. ACM (2011)

  20. Ipeirotis, P.G.: Analyzing the amazon mechanical turk marketplace. XRDS 17(2), 16–21 (2010)

    Article  Google Scholar 

  21. Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563. ACM (2006)

  22. Karger, D.R., Oh, S., Shah, D.: Budget-optimal task allocation for reliable crowdsourcing systems. CoRR (2011). arXiv:1110.3564

  23. Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS 2011, 1953–1961 (2011)

  24. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  25. Lee, M.D., Steyvers, M., De Young, M., Miller, B.: Inferring expertise in knowledge and prediction ranking tasks. Topics Cogn. Sci. 4(1), 151–163 (2012)

    Article  Google Scholar 

  26. Li, H., Yu, B., Zhou, D.: Error rate bounds in crowdsourcing models. CoRR (2013). arXiv:1307.2674

  27. Liu, Q., Ihler, A.T., Steyvers, M.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, 1914–1922 (2013)

  28. Liu, Q., Steyvers, M., Ihler, A.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, pp. 1914–1922 (2013)

  29. Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)

    Google Scholar 

  30. Marcus, A., Karger, D.R., Madden, S., Miller, R., Oh, S.: Counting with the crowd. PVLDB 6(2), 109–120 (2012)

    Google Scholar 

  31. McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley series in probability and statistics., vol. 2. Wiley, Hoboken (2008)

    Book  MATH  Google Scholar 

  32. Muslea, I., Minton, S., Knoblock, C.A.: Active learning with multiple views. J. Artif. Intell. Res. (JAIR) 27, 203–233 (2006)

    MATH  MathSciNet  Google Scholar 

  33. Parameswaran, A.G., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. PVLDB 4(11), 980–991 (2011)

    Google Scholar 

  34. Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)

  35. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD 2008, pp. 614–622, ACM (2008)

  36. Wong, T.L.: Learning to adapt cross language information extraction wrapper. Appl. Intell. 36(4), 918–931 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Merialdo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Crescenzi, V., Merialdo, P. & Qiu, D. Crowdsourcing large scale wrapper inference. Distrib Parallel Databases 33, 95–122 (2015). https://doi.org/10.1007/s10619-014-7163-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7163-9

Keywords

Navigation