Little Knowledge Rules the Web: Domain-Centric Result Page Extraction

Furche, Tim; Gottlob, Georg; Grasso, Giovanni; Orsi, Giorgio; Schallhart, Christian; Wang, Cheng

doi:10.1007/978-3-642-23580-1_6

Tim Furche¹⁸,
Georg Gottlob¹⁸,
Giovanni Grasso¹⁸,
Giorgio Orsi¹⁸,
Christian Schallhart¹⁸ &
…
Cheng Wang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6902))

Included in the following conference series:

International Conference on Web Reasoning and Rule Systems

586 Accesses
5 Citations

Abstract

Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision.

In this paper, we introduce the Amber (“Adaptable Model-based Extraction of Result Pages”) system that replaces the human interaction with a domain ontology applicable to all sites of a domain. It models domain knowledge about (1) records and attributes of the domain, (2) low-level (textual) representations of these concepts, and (3) constraints linking representations to records and attributes. Parametrized with these constraints, otherwise domain-independent heuristics exploit the repeated structure of result pages to derive attributes and records. Amber is implemented in logical rules to allow an explicit formulation of the heuristics and easy adaptation to different domains.

We apply Amber to the UK real estate domain where we achieve near perfect accuracy on a representative sample of 50 agency websites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB (2001)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18(10) (2006)
Google Scholar
Hsu, C., Dung, M.: Generating finite-state transducers for semistructured data extraction from the web. IS 23(8) (1998)
Google Scholar
Kayed, M., Chang, C.-H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. TKDE 22(2) (2010)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: VLDB (1997)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2) (2002)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: WebDB (2006)
Google Scholar
Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: WIDM (2008)
Google Scholar
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In: CIKM (2005)
Google Scholar
Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-Assisted Data Extraction. TODS, vol. 34(2) (2009)
Google Scholar
Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site?. In: KDD (2009)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for Web databases. In: WWW (2003)
Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. TKDE 18(12) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Oxford, UK
Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart & Cheng Wang

Authors

Tim Furche
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gottlob
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Grasso
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Orsi
View author publications
You can also search for this author in PubMed Google Scholar
Christian Schallhart
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Karlsruhe Institute of Technology, Institute AIFB, Karlsruhe, 76128, Germany
Sebastian Rudolph
Computer Science Department, Universidad de Chile, Blanco Enclada, 2120, Santiago, Chile
Claudio Gutierrez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C. (2011). Little Knowledge Rules the Web: Domain-Centric Result Page Extraction. In: Rudolph, S., Gutierrez, C. (eds) Web Reasoning and Rule Systems. RR 2011. Lecture Notes in Computer Science, vol 6902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23580-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-23580-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23579-5
Online ISBN: 978-3-642-23580-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics