Encyclopedia of Database Systems

2009 Edition

Web Data Extraction System

  • Robert Baumgartner
  • Wolfgang Gatterbauer
  • Georg Gottlob
Reference work entry
DOI: https://doi.org/10.1007/978-0-387-39940-9_1154



A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (i) web interaction, which comprises mainly the navigation to usually pre-determined target web pages containing the desired information; (ii) support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (iii) scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (iv) data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Anupam V., Freire J., Kumar B., and Lieuwen D. Automating web navigation with the WebVCR. Comput. Network., 33(1–6):503–517, 2000.CrossRefGoogle Scholar
  2. 2.
    Baumgartner R., Flesca S., and Gottlob G. Visual web information extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128.Google Scholar
  3. 3.
    Crescenzi V., Mecca G., and Merialdo P. Road runner: towards automatic data extraction from large Web sites. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 109–118.Google Scholar
  4. 4.
    Etzioni O., Cafarella M.J., Downey D., Kok S., Popescu A., Shaked T., Soderland S., Weld D.S., and Yates Y. Web-scale information extraction in KnowItAll: (preliminary results). In Proc. 12th Int. World Wide Web Conference, 2004, pp. 100–110.Google Scholar
  5. 5.
    Gatterbauer W., Bohunsky P., Herzog M., Krüpl B., and Pollak B. Towards domain-independent information extraction from web tables. In Proc. 16th Int. World Wide Web Conference, 2007, pp.71–80.Google Scholar
  6. 6.
    Gottlob G. and Koch C. Monadic datalog and the expressive power of languages for web information extraction. J. ACM 51(1):74–113, 2002.MathSciNetCrossRefGoogle Scholar
  7. 7.
    Gottlob G. and Koch C.A. Formal comparison of visual web wrapper generators. In Proc. 32nd Conf. Current Trends in Theory and Practice of Computer Science, 2006, pp. 30–48.Google Scholar
  8. 8.
    Kuhlins S. and Tredwell R. Toolkits for generating wrappers: a survey of software toolkits for automated data extraction from Websites. NODe 2002, LNCS:2591, 2003.Google Scholar
  9. 9.
    Kushmerick N., Weld D.S., and Doorenbos R.B. Wrapper induction for information extraction. In Proc. 15th Int. Joint Conf. on AI, 1997, pp. 729–737.Google Scholar
  10. 10.
    Laender A.H.F., Ribeiro-Neto B.A., and da Silva A.S. DEByE – data extraction by example. Data Knowl. Eng., 40(2):121–154, 2000.CrossRefGoogle Scholar
  11. 11.
    Liu L., Pu C., and Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In Proc. 16th Int. Conf. on Data Engineering, 2000, pp. 611–621.Google Scholar
  12. 12.
    Liu B., Grossman R.L., and Zhai Y. Mining web pages for data records. IEEE Intell. Syst., 19(6):49–55, 2004.CrossRefGoogle Scholar
  13. 13.
    Muslea I., Minton S., and Knoblock C.A. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonom. Agents Multi-Agent Syst., 4(1/2):93–114, 2001.CrossRefGoogle Scholar
  14. 14.
    Pan A., Raposo J., Álvarez M., Montoto P., Orjales V., Hidalgo J., Ardao L., Molano A., and Viña Á. The Denodo data integration platform. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002.Google Scholar
  15. 15.
    Sahuguet A. and Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl. Eng., 36(3):283–316, 2001.zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Robert Baumgartner
    • 1
    • 2
  • Wolfgang Gatterbauer
    • 3
  • Georg Gottlob
    • 4
  1. 1.Vienna University of TechnologyViennaAustria
  2. 2.Lixto Software GmbHViennaAustria
  3. 3.University of WashingtonSeattleUSA
  4. 4.Oxford UniversityOxfordUK