WebSelF: A Web Scraping Framework

  • Jakob G. Thomsen
  • Erik Ernst
  • Claus Brabrand
  • Michael Schwartzbach
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7387)

Abstract

We present WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We conducted an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that composition of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing techniques alone.

Keywords

Selection Function Validation Function News Site XPath Expression Fast Path 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Brabrand, Thomsen: Typed and unambiguous pattern matching on strings using regular expressions. In: Proc. of PPDP (2010)Google Scholar
  2. 2.
    Cohen: Recognizing structure in web pages using similarity queries. In: AAAI/IAAI. AAAI (1999)Google Scholar
  3. 3.
    Cohen, Fan: Learning page-independent heuristics for extracting data from web pages. CN 31(11-16) (1999)Google Scholar
  4. 4.
    Bex, et al.: Inference of concise DTDs from XML data. In: Proc. of VLDB (2006)Google Scholar
  5. 5.
    Bray, et al.: DTD: Document type definition. World Wide Web Consortium (November 1996), http://www.w3.org/TR/xml/#sec-prolog-dtd
  6. 6.
    Chang, et al.: A survey of web information extraction systems. TKDE (2006)Google Scholar
  7. 7.
    Dalvi, et al.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: Proc. of SIGMOD (2009)Google Scholar
  8. 8.
    Fazzinga, et al.: Schema-based web wrapping. In: KAIS (2009)Google Scholar
  9. 9.
    Kushmerick, et al.: Wrapper induction for information extraction. In: IJCAI (1997)Google Scholar
  10. 10.
    Lerman, et al.: Wrapper maintenance: A machine learning approach. JAIR (2003)Google Scholar
  11. 11.
    Meng, et al.: Schema-guided data extraction from the web. JCST 17(4) (2002)Google Scholar
  12. 12.
    Meng, et al.: Schema-guided wrapper maintenance for web-data extraction. In: Proc. of WIDM (2003)Google Scholar
  13. 13.
    Mohapatra, et al.: Efficient wrapper reinduction from dynamic web sources. In: Proc. of WI. IEEE Computer Society (2004)Google Scholar
  14. 14.
    Muslea, et al: Hierarchical wrapper induction for semistructured information sources. AAMAS 4(1) (2001)Google Scholar
  15. 15.
    Nakatoh, et al.: Automatic generation of deep web wrappers based on discovery of repetition. In: Proc. of AIRS (2004)Google Scholar
  16. 16.
    Parameswaran et al.: Optimal schemes for robust web extraction. In: Proc. of VLDB (2011)Google Scholar
  17. 17.
    Raposo et al.: Automatic wrapper maintenance for semi-structured web sources using results from previous queries. In: Proc. of SAC (2005)Google Scholar
  18. 18.
    Thomsen et al.: WebSelf: A web selection framework. Tech. report, Computer Science. Aarhus University (2012)Google Scholar
  19. 19.
    Kistler, Marais: Webl - a programming language for the web. CN 30(1-7) (1998)Google Scholar
  20. 20.
    Kushmerick: Wrapper verification. In: WWW (2000)Google Scholar
  21. 21.
    Lingam, Elbaum: Supporting end-users in the creation of dependable web clips. In: WWW (2007)Google Scholar
  22. 22.
    Liu, Ling: A conceptual model and rule-based query language for HTML. In: WWW (2001)Google Scholar
  23. 23.
    Myllymaki: Effective web data extraction with standard XML technologies. CN 39(5) (2002)Google Scholar
  24. 24.
    Myllymaki, Jackson: Robust web data extraction with xml path expressions. IBM Research Report, RJ10245 (2002)Google Scholar
  25. 25.
    Sahuguet, Azavant: Building intelligent web applications using lightweight wrappers. DKE 36(3) (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Jakob G. Thomsen
    • 1
  • Erik Ernst
    • 1
  • Claus Brabrand
    • 2
  • Michael Schwartzbach
    • 1
  1. 1.Aarhus UniversityDenmark
  2. 2.IT University of CopenhagenDenmark

Personalised recommendations