Crowdsourcing large scale wrapper inference

Crescenzi, Valter; Merialdo, Paolo; Qiu, Disheng

doi:10.1007/s10619-014-7163-9

Crowdsourcing large scale wrapper inference

Published: 29 October 2014

Volume 33, pages 95–122, (2015)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Valter Crescenzi¹,
Paolo Merialdo¹ &
Disheng Qiu¹

463 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (\({\textsc {alf}}_{\eta }\)) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Actually, we do not consider the whole set of pages \({U}\), but a much smaller sample set including the initial annotated page and at most 100 pages randomly chosen from \({U}\).
The problem reduces to finding the smallest set of pages such that the union of the sets of rules differentiated from them equals the set of rules differentiated directly by \(U\).
We choose the rule with the shortest path, but other strategies, such as those discussed in [33] could be applied.
Notice that this approach can be seen as a special case of the previous one, as a task with ground truth information can be seen as a redundant task solved by a perfect worker.
In Sect. 6 we consider workers producing t.s. for several attributes in a single task.
In our implementation, we stop when all the error rates do not change, in absolute value, more than \(\Delta _{\eta }=10^{-4}\).
We neglect any budget issue at alfred’s level where the goal is to reach the quality target \(\lambda _r\); however \({\textsc {alf}}_{\eta }\) bounds to \(\lambda _{MQ}\) the budget per attribute spent for each worker.
For the sake of simplicity we are assuming that \(|\mathcal {U}|\) is a multiple of \(N\). Otherwise, the tasks can be completed by inserting control attributes with known answers to better estimate the workers error rate.
We rely on CrowdFlower, a popular meta-platform that offers services to recruit workers on AMT.
We use Selenium (http://docs.seleniumhq.org/projects/webdriver) and Phantomjs (http://phantomjs.org).
http://swde.codeplex.com.
The datasets are available upon request.

References

Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)
Google Scholar
Angluin, D.: Queries revisited. Theor. Comput. Sci. 313(2), 175–194 (2004)
Article MATH MathSciNet Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003, pp. 337–348 (2003)
Balcan, M.F., Hanneke, S., Vaughan, J.W.: The true sample complexity of active learning. Mach. Learn. 80(2–3), 111–139 (2010)
Article MathSciNet Google Scholar
Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Extraction and integration of partially overlapping web sources. PVLDB 6(10), 805–816 (2013)
Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Crescenzi, V., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: WWW 2013, pp. 261–272 (2013)
Crescenzi, V., Merialdo, P., Qiu, D.: Wrapper generation supervised by a noisy crowd. In: DBCrowd, CEUR Workshop Proceedings 1025, 8–13 (2013)
Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artifi. Intell. 22(1&2), 21–52 (2008)
Article Google Scholar
Dalvi, N.N., Kumar, R., Soliman, M.A.: Automatic wrappers for large scale web extraction. PVLDB 4(4), 219–230 (2011)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
MATH MathSciNet Google Scholar
Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide web. Commun. ACM 54(4), 86–96 (2011)
Article Google Scholar
Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. CoRR (2012). arXiv:1207.0246
Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (Companion Volume), pp. 267–270 (2012)
Gao, C., Zhou, D.: Minimax optimal convergence rates for estimating ground truth from crowdsourced labels (2013). arXiv:1310.5764
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project: back and forth between theory and practice. In: PODS, pp. 1–12. ACM (2004)
Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE 2011, pp. 1209–1220. IEEE Computer Society (2011)
Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)
Google Scholar
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR 2011, pp. 775–784. ACM (2011)
Ipeirotis, P.G.: Analyzing the amazon mechanical turk marketplace. XRDS 17(2), 16–21 (2010)
Article Google Scholar
Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563. ACM (2006)
Karger, D.R., Oh, S., Shah, D.: Budget-optimal task allocation for reliable crowdsourcing systems. CoRR (2011). arXiv:1110.3564
Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS 2011, 1953–1961 (2011)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Lee, M.D., Steyvers, M., De Young, M., Miller, B.: Inferring expertise in knowledge and prediction ranking tasks. Topics Cogn. Sci. 4(1), 151–163 (2012)
Article Google Scholar
Li, H., Yu, B., Zhou, D.: Error rate bounds in crowdsourcing models. CoRR (2013). arXiv:1307.2674
Liu, Q., Ihler, A.T., Steyvers, M.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, 1914–1922 (2013)
Liu, Q., Steyvers, M., Ihler, A.: Scoring workers in crowdsourcing: how many control questions are enough? In: NIPS 2013, pp. 1914–1922 (2013)
Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)
Google Scholar
Marcus, A., Karger, D.R., Madden, S., Miller, R., Oh, S.: Counting with the crowd. PVLDB 6(2), 109–120 (2012)
Google Scholar
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley series in probability and statistics., vol. 2. Wiley, Hoboken (2008)
Book MATH Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Active learning with multiple views. J. Artif. Intell. Res. (JAIR) 27, 203–233 (2006)
MATH MathSciNet Google Scholar
Parameswaran, A.G., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. PVLDB 4(11), 980–991 (2011)
Google Scholar
Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD 2008, pp. 614–622, ACM (2008)
Wong, T.L.: Learning to adapt cross language information extraction wrapper. Appl. Intell. 36(4), 918–931 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria, Università degli Studi Roma Tre, Via della Vasca Navale, 79, 00146, Rome, Italy
Valter Crescenzi, Paolo Merialdo & Disheng Qiu

Authors

Valter Crescenzi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Merialdo
View author publications
You can also search for this author in PubMed Google Scholar
Disheng Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Merialdo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Crescenzi, V., Merialdo, P. & Qiu, D. Crowdsourcing large scale wrapper inference. Distrib Parallel Databases 33, 95–122 (2015). https://doi.org/10.1007/s10619-014-7163-9

Download citation

Published: 29 October 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10619-014-7163-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crowdsourcing large scale wrapper inference

Abstract

Access this article

Similar content being viewed by others

Uncertainty in big data analytics: survey, opportunities, and challenges

From distributed machine learning to federated learning: a survey

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Crowdsourcing large scale wrapper inference

Abstract

Access this article

Similar content being viewed by others

Uncertainty in big data analytics: survey, opportunities, and challenges

From distributed machine learning to federated learning: a survey

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation