Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Web Data Extraction System

  • Robert Baumgartner
  • Wolfgang Gatterbauer
  • Georg Gottlob
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1154

Synonyms

Web information extraction system; Web macros; Web scraper; Wrapper generator

Definition

A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (i) Web interaction, which comprises mainly the navigation to usually pre-determined target web pages containing the desired information; (ii) Support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (iii) Scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (iv) Data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources and...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Anupam V, Freire J, Kumar B, Lieuwen D. Automating web navigation with the WebVCR. Comput Netw. 2000;33(1–6):503–17.CrossRefGoogle Scholar
  2. 2.
    Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 119–28.Google Scholar
  3. 3.
    Crescenzi V, Mecca G, Merialdo P. Road runner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 109–18.Google Scholar
  4. 4.
    Etzioni O, Cafarella MJ, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS, Yates Y. Web-scale information extraction in KnowItAll: (preliminary results). In: Proceedings of the 12th International World Wide Web Conference; 2004. p. 100–10.Google Scholar
  5. 5.
    Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B. Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International World Wide Web Conference; 2007. p. 71–80.Google Scholar
  6. 6.
    Gottlob G, Koch C. Monadic datalog and the expressive power of languages for web information extraction. J ACM. 2002;51(1):74–113.MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Gottlob G, Koch CA. Formal comparison of visual web wrapper generators. In: Proceedings of the 32nd International Current Trends in Theory and Practice of Computer Science; 2006. p. 30–48.CrossRefGoogle Scholar
  8. 8.
    Kuhlins S, Tredwell R. Toolkits for generating wrappers: a survey of software toolkits for automated data extraction from websites. In: Objects, Components, Architectures, Services, and Applications for a Networked World. International Conference NetObjectDays; 2003.Google Scholar
  9. 9.
    Kushmerick N, Weld DS, Doorenbos RB. Wrapper induction for information extraction. In: Proceedings of the 15th International Joint Conference on AI; 1997. p. 729–37.Google Scholar
  10. 10.
    Laender AHF, Ribeiro-Neto BA, da Silva AS. DEByE – data extraction by example. Data Knowl Eng. 2000;40(2):121–54.zbMATHCrossRefGoogle Scholar
  11. 11.
    Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering; 2000. p. 611–21.Google Scholar
  12. 12.
    Liu B, Grossman RL, Zhai Y. Mining web pages for data records. IEEE Intell Syst. 2004;19(6):49–55.CrossRefGoogle Scholar
  13. 13.
    Muslea I, Minton S, Knoblock CA. Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst. 2001;4(1/2):93–114.CrossRefGoogle Scholar
  14. 14.
    Pan A, Raposo J, Álvarez M, Montoto P, Orjales V, Hidalgo J, Ardao L, Molano A, Viña Á. The Denodo data integration platform. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.Google Scholar
  15. 15.
    Sahuguet A, Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. 2001;36(3):283–316.zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Robert Baumgartner
    • 1
  • Wolfgang Gatterbauer
    • 2
  • Georg Gottlob
    • 3
  1. 1.Vienna University of TechnologyViennaAustria
  2. 2.University of WashingtonSeattleUSA
  3. 3.Computing LaboratoryOxford UniversityOxfordUK

Section editors and affiliations

  • Georg Gottlob
    • 1
  1. 1.Computing Lab.Oxford Univ.OxfordUK