Integrating Data from the Web by Machine-Learning Tree-Pattern Queries

  • Benjamin Habegger
  • Denis Debarbieux
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4275)


Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.


Information Extraction Tree Pattern Tree Automaton Tree Transducer Result Tuple 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–40. Springer, Heidelberg (2001)Google Scholar
  2. 2.
    Carme, J., Lemay, A., Niehren, J.: Learning Node Selecting Tree Transducer from Completely Annotated Examples. In: Int. Conf. on Grammar Induction, pp. 29–102 (2004)Google Scholar
  3. 3.
    Gilleron, R., Marty, P., Tommasi, M., Torre, F.: Adaptive Relation Extraction from Semi-Structured Data. In: 6émes Journées Francophones. Extraction et Gestion des Connaissances (2006)Google Scholar
  4. 4.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: Proc. of the 12th WWW Conference. Elsevier Science, Amsterdam (2003)Google Scholar
  5. 5.
    Habegger, B., Quafafou, M.: Context generalization for information extraction from the web. In: Proc. of the ACM/IEEE Web Intelligence Conference (2004)Google Scholar
  6. 6.
    Hsu, C., Dung, M.: Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems 23(8) (1998)Google Scholar
  7. 7.
    Knoblock, C., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Data Engineering Bulletin 23(4) (2003)Google Scholar
  8. 8.
    Kosala, R., Bruynooghe, M., den Bussche, J.V., Blockeel, H.: Information Extraction from web documents based on local unranked tree automaton inference. In: Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI-2003), pp. 403–408 (2003)Google Scholar
  9. 9.
    Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. In: Artificial Intelligence (2000)Google Scholar
  10. 10.
    Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington (August 2001)Google Scholar
  11. 11.
    Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent System 4(1-2) (March 2001)Google Scholar
  12. 12.
    XML Path Language (XPath) (1999), Available at:

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Benjamin Habegger
    • 1
  • Denis Debarbieux
    • 2
  1. 1.Dipartimento di Informatica e SistemisticaUniversità di Roma 1 – “La Sapienza”RomaItaly
  2. 2.LIFL, UMR 8022 CNRSLille University (France), Mostrare project, RU INRIA Futurs

Personalised recommendations