OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Furche, Tim; Gottlob, Georg; Grasso, Giovanni; Schallhart, Christian; Sellers, Andrew

doi:10.1007/s00778-012-0286-6

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Special Issue Paper
Published: 10 October 2012

Volume 22, pages 47–72, (2013)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Tim Furche¹,
Georg Gottlob¹,
Giovanni Grasso¹,
Christian Schallhart¹ &
…
Andrew Sellers¹

2345 Accesses
51 Citations
9 Altmetric
Explore all metrics

Abstract

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Current Challenges in Web Crawling

Deep Web crawling: a survey

Article 05 June 2018

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Notes

However, classical results [41] on rewriting reverse axes such as ancestor in XPath do not extend to OXPath.
Thus, (path) *[qp] = \(\left(\bigcup _{i=0}^\infty \mathtt{\textit{path} }^i\right)\) [qp] always holds, but (path) *[qp] = \(\bigcup _{i=0}^\infty \) path \(^i\) [qp] does not hold necessarily, since [qp] is applied to each of the \(i\)-th copy of \(path\).
Simple OXPath is the restriction of OXPath to simple OXPath expression, but we allow a doc() action at the start of the expression to set the document to be queried.

References

http://www.iopus.com/iMacros
http://www.newprosoft.com/web-content-extractor.htm
http://www.visualwebripper.com
http://www.web-harvest.sourceforge.net
http://www.w3.org/TR/CSS2/selector.html
Alba, A., Bhagwan, V., Grandison, T.: Accessing the deep web: when good ideas go bad. In: OOPSLA (2008)
Anton, T.: XPath—wrapper induction by generalizing tree traversal patterns. In: LWA (2005)
Anupam, V., Freire, J., Kumar, B., Lieuwen, D.: Automating web navigation with the webvcr. In: WWW (2000)
Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: ICDE (1998)
Badica, C., Badica, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction: A declarative approach to data extraction from web sources. Soft Comput. 11(8), 753–772 (2007)
Article Google Scholar
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the Web. In: IJCAI (2007)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)
Benedikt, M., Koch, C.: Xpath leashed. CSUR 41, 3:1–3:54 (2009)
Google Scholar
Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)
Article Google Scholar
Bigham, J.P., Cavender, A.C., Kaminsky, R.S., Prince, C.M., Obison T.S.: Transcendence: enabling a personal view of the deep web. In: IUI (2008)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Practice Experience 34, 711–726 (2004)
Article Google Scholar
Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.:. Automation and customization of rendered web pages. In: UIST (2005)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Google Scholar
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wy, E., Zhang, Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
Google Scholar
Centeno, V.L., Kloos, C.D., Fernández, L.S.: García, N.F.: Intelligent automated navigation through the deep web. In: Advances in Web Intelligence (2004)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18(10), 1411–1428 (2006)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD (2002)
Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell. 165(1), 91–134 (2005)
Article Google Scholar
Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A., Wang, C.: DIADEM: Domain-centric, intelligent, automated data extraction methodology. In: WWW (2012)
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. PVLDB 4(11), 1016–1027 (2011)
Google Scholar
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: TODS (2005)
Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a webfountain: an architecture for very large-scale text analytics. IBM Syst. J. 43, 64–77 (2004)
Article Google Scholar
He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)
Article Google Scholar
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Kranzdorf, J., Sellers, A., Grasso, G., Schallhart, C., Furche, T: Spotting the tracks on the oxpath. In: WWW (2012)
Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating& sharing how-to knowledge in the enterprise. In: CHI (2008)
Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: IUI (2009)
Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: ICDE (2000)
Liu, M., Ling, T.W.: A rule-based query language for html. In: DASFAA (2001)
Marx, M.: Conditional XPath. ACM Trans. Database Syst. 30(4), 929–959 (2005)
Article Google Scholar
Marx, M., de Rijke, M.: Semantic characterizations of navigational XPath. ACM SIGMOD Rec. 34(2), 41–46 (2005)
Google Scholar
Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. Digit. Libr. 1(1), 54–67 (1997)
Google Scholar
Mir, S., Staab, S., Rojas, I.: Web-prospector—an automatic, site-wide wrapper induction approach for scientific deep-web databases. In: BTW (2009)
Montoto, P., Pan, A., Raposo, J., Bellas, F., López, J: Automating navigation sequences in ajax websites. In: ICWE (2009)
Myllymaki, J.: Effective web data extraction with standard xml technologies. Comput. Netw. 39(5), 635–644 (2002)
Article Google Scholar
Olteanu, D., Meuss, H., Furche, T., Bry, F.: XPath: looking Forward. In: EDBT-XML-Based Data Management, LNCS 2490 (2002)
Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña., A.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA (2002)
Safonov, A.: Web macros by example: users managing the www of applications. In: CHI, pp. 71–72. ACM (1999)
Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: VLDB, pp. 738–741 (1999)
Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: SITIS, pp. 387–394 (2007)
Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB (2007)
Su, J.-Y., Sun, D.-J., Wu, I.-C., Chen, L.-P.: On design of browser-oriented data extraction system and plug-ins. J. Mar. Sci. Technol. 18(2), 189–200 (2010)
Google Scholar
Wang, Y., Hornung, T.: Deep web navigation by example. Scalable Comput. Practice Experience 9, 281–292 (2008)
Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the European Research Council under the European Community’s 7th Framework Programme (FP7/2007–2013)/ERC grant agreement no. 246858 (DIADEM). This work was carried out in the wider context of the networking programme FoX—Foundations of XML, FET-Open grant agreement number FP7-ICT-233599. The views expressed in this article are solely those of the authors.

Author information

Authors and Affiliations

Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford, OX1 3QD, UK
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart & Andrew Sellers

Authors

Tim Furche
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gottlob
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Grasso
View author publications
You can also search for this author in PubMed Google Scholar
Christian Schallhart
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Sellers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Furche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Furche, T., Gottlob, G., Grasso, G. et al. OXPath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal 22, 47–72 (2013). https://doi.org/10.1007/s00778-012-0286-6

Download citation

Received: 10 February 2012
Revised: 04 June 2012
Accepted: 13 July 2012
Published: 10 October 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s00778-012-0286-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Abstract

Access this article

Similar content being viewed by others

Current Challenges in Web Crawling

Deep Web crawling: a survey

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Abstract

Access this article

Similar content being viewed by others

Current Challenges in Web Crawling

Deep Web crawling: a survey

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation