The VLDB Journal

, Volume 22, Issue 1, pp 47–72 | Cite as

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

  • Tim Furche
  • Georg Gottlob
  • Giovanni Grasso
  • Christian Schallhart
  • Andrew Sellers
Special Issue Paper

Abstract

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

Keywords

Web extraction Crawling Data extraction Automation XPath DOM AJAX Web applications 

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
    Alba, A., Bhagwan, V., Grandison, T.: Accessing the deep web: when good ideas go bad. In: OOPSLA (2008)Google Scholar
  7. 7.
    Anton, T.: XPath—wrapper induction by generalizing tree traversal patterns. In: LWA (2005)Google Scholar
  8. 8.
    Anupam, V., Freire, J., Kumar, B., Lieuwen, D.: Automating web navigation with the webvcr. In: WWW (2000)Google Scholar
  9. 9.
    Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: ICDE (1998)Google Scholar
  10. 10.
    Badica, C., Badica, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction: A declarative approach to data extraction from web sources. Soft Comput. 11(8), 753–772 (2007)CrossRefGoogle Scholar
  11. 11.
    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the Web. In: IJCAI (2007)Google Scholar
  12. 12.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)Google Scholar
  13. 13.
    Benedikt, M., Koch, C.: Xpath leashed. CSUR 41, 3:1–3:54 (2009)Google Scholar
  14. 14.
    Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)CrossRefGoogle Scholar
  15. 15.
    Bigham, J.P., Cavender, A.C., Kaminsky, R.S., Prince, C.M., Obison T.S.: Transcendence: enabling a personal view of the deep web. In: IUI (2008)Google Scholar
  16. 16.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Practice Experience 34, 711–726 (2004)CrossRefGoogle Scholar
  17. 17.
    Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.:. Automation and customization of rendered web pages. In: UIST (2005)Google Scholar
  18. 18.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)Google Scholar
  19. 19.
    Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wy, E., Zhang, Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)Google Scholar
  20. 20.
    Centeno, V.L., Kloos, C.D., Fernández, L.S.: García, N.F.: Intelligent automated navigation through the deep web. In: Advances in Web Intelligence (2004)Google Scholar
  21. 21.
    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18(10), 1411–1428 (2006)Google Scholar
  22. 22.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD (2002)Google Scholar
  23. 23.
    Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell. 165(1), 91–134 (2005)CrossRefGoogle Scholar
  24. 24.
    Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A., Wang, C.: DIADEM: Domain-centric, intelligent, automated data extraction methodology. In: WWW (2012)Google Scholar
  25. 25.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. PVLDB 4(11), 1016–1027 (2011)Google Scholar
  26. 26.
    Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: TODS (2005)Google Scholar
  27. 27.
    Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a webfountain: an architecture for very large-scale text analytics. IBM Syst. J. 43, 64–77 (2004)CrossRefGoogle Scholar
  28. 28.
    He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)CrossRefGoogle Scholar
  29. 29.
    Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  30. 30.
    Kranzdorf, J., Sellers, A., Grasso, G., Schallhart, C., Furche, T: Spotting the tracks on the oxpath. In: WWW (2012)Google Scholar
  31. 31.
    Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating& sharing how-to knowledge in the enterprise. In: CHI (2008)Google Scholar
  32. 32.
    Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: IUI (2009)Google Scholar
  33. 33.
    Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: ICDE (2000)Google Scholar
  34. 34.
    Liu, M., Ling, T.W.: A rule-based query language for html. In: DASFAA (2001)Google Scholar
  35. 35.
    Marx, M.: Conditional XPath. ACM Trans. Database Syst. 30(4), 929–959 (2005)CrossRefGoogle Scholar
  36. 36.
    Marx, M., de Rijke, M.: Semantic characterizations of navigational XPath. ACM SIGMOD Rec. 34(2), 41–46 (2005) Google Scholar
  37. 37.
    Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. Digit. Libr. 1(1), 54–67 (1997)Google Scholar
  38. 38.
    Mir, S., Staab, S., Rojas, I.: Web-prospector—an automatic, site-wide wrapper induction approach for scientific deep-web databases. In: BTW (2009)Google Scholar
  39. 39.
    Montoto, P., Pan, A., Raposo, J., Bellas, F., López, J: Automating navigation sequences in ajax websites. In: ICWE (2009)Google Scholar
  40. 40.
    Myllymaki, J.: Effective web data extraction with standard xml technologies. Comput. Netw. 39(5), 635–644 (2002)CrossRefGoogle Scholar
  41. 41.
    Olteanu, D., Meuss, H., Furche, T., Bry, F.: XPath: looking Forward. In: EDBT-XML-Based Data Management, LNCS 2490 (2002)Google Scholar
  42. 42.
    Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña., A.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA (2002)Google Scholar
  43. 43.
    Safonov, A.: Web macros by example: users managing the www of applications. In: CHI, pp. 71–72. ACM (1999)Google Scholar
  44. 44.
    Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: VLDB, pp. 738–741 (1999)Google Scholar
  45. 45.
    Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: SITIS, pp. 387–394 (2007)Google Scholar
  46. 46.
    Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB (2007)Google Scholar
  47. 47.
    Su, J.-Y., Sun, D.-J., Wu, I.-C., Chen, L.-P.: On design of browser-oriented data extraction system and plug-ins. J. Mar. Sci. Technol. 18(2), 189–200 (2010)Google Scholar
  48. 48.
    Wang, Y., Hornung, T.: Deep web navigation by example. Scalable Comput. Practice Experience 9, 281–292 (2008)Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Tim Furche
    • 1
  • Georg Gottlob
    • 1
  • Giovanni Grasso
    • 1
  • Christian Schallhart
    • 1
  • Andrew Sellers
    • 1
  1. 1.Department of Computer ScienceOxford UniversityOxfordUK

Personalised recommendations