Integrating Data from the Web by Machine-Learning Tree-Pattern Queries

Habegger, Benjamin; Debarbieux, Denis

doi:10.1007/11914853_59

Benjamin Habegger¹⁸ &
Denis Debarbieux¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4275))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

840 Accesses
3 Citations

Abstract

Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.

An erratum to this chapter can be found at http://dx.doi.org/10.1007/11914853_71.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–40. Springer, Heidelberg (2001)
Google Scholar
Carme, J., Lemay, A., Niehren, J.: Learning Node Selecting Tree Transducer from Completely Annotated Examples. In: Int. Conf. on Grammar Induction, pp. 29–102 (2004)
Google Scholar
Gilleron, R., Marty, P., Tommasi, M., Torre, F.: Adaptive Relation Extraction from Semi-Structured Data. In: 6émes Journées Francophones. Extraction et Gestion des Connaissances (2006)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: Proc. of the 12th WWW Conference. Elsevier Science, Amsterdam (2003)
Google Scholar
Habegger, B., Quafafou, M.: Context generalization for information extraction from the web. In: Proc. of the ACM/IEEE Web Intelligence Conference (2004)
Google Scholar
Hsu, C., Dung, M.: Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems 23(8) (1998)
Google Scholar
Knoblock, C., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Data Engineering Bulletin 23(4) (2003)
Google Scholar
Kosala, R., Bruynooghe, M., den Bussche, J.V., Blockeel, H.: Information Extraction from web documents based on local unranked tree automaton inference. In: Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI-2003), pp. 403–408 (2003)
Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. In: Artificial Intelligence (2000)
Google Scholar
Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington (August 2001)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent System 4(1-2) (March 2001)
Google Scholar
XML Path Language (XPath) (1999), Available at: http://www.w3.org/TR/xpath

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica e Sistemistica, Università di Roma 1 – “La Sapienza”, 00198, Roma, Italy
Benjamin Habegger
LIFL, UMR 8022 CNRS, Lille University (France), Mostrare project, RU INRIA Futurs,
Denis Debarbieux

Authors

Benjamin Habegger
View author publications
You can also search for this author in PubMed Google Scholar
Denis Debarbieux
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB), Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, 3001, Melbourne, VIC, Australia
Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Habegger, B., Debarbieux, D. (2006). Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. OTM 2006. Lecture Notes in Computer Science, vol 4275. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11914853_59

Download citation

DOI: https://doi.org/10.1007/11914853_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48287-1
Online ISBN: 978-3-540-48289-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics