Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

  • Luo Xiao
  • Dieter Wissmann
  • Michael Brown
  • Stefan Jablonski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2070)


This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [CALI97]
    Relational Learning of Pattern-Match Rules for Information Extraction, M E Califf and R J Mooney, Proceedings ACL-97: Workshop on Natural Language Learning, 1997Google Scholar
  2. [COHE99]
    A Simple, Fast, and Effective Rule Learner, W Cohen, AAAI-99 Proceeding, 1999Google Scholar
  3. [CUNN99]
    Information Extraction a User Guide, H Cunningham, CS-99-07, 1999Google Scholar
  4. [FREI98]
    Information Extraction from HTML: Application of a General Machine Learning Approch, D Freitag, AAAI-98 Proceeding, 1998Google Scholar
  5. [KNOB98]
    Trends and controversies: Information Integration, A Levy, C Knoblock, S Minton, W Cohen, IEEE Intelligent Systems 13 (5), 1998Google Scholar
  6. [KUSH00]
    Wrapper induction: Efficiency and expressiveness, N Kushmeric, Artificial Intelligence 118, 15–68, 2000Google Scholar
  7. [MUSL98]
    STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources-Muslea I, Minton S, AAAI’98 Workshop “AI and Information Integration”Google Scholar
  8. [RILO94]
    Information Extraction as a Basis for High-Precision Text Classification, E Riloff and W Lehnert, ACM Transactions on Information Systems vol. 12 no. 3 1994.Google Scholar
  9. [SODE99]
    Learning Information Extraction Rules for Semi-Structured and Free Text, S Sonderland, Machine Learning 34, 233–272, 1999Google Scholar
  10. [XIAO00]
    Where to Position the Precision in Knowledge Extraction from Text, L Xiao, 2000Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Luo Xiao
    • 1
  • Dieter Wissmann
    • 1
  • Michael Brown
    • 2
  • Stefan Jablonski
    • 3
  1. 1.Interprice Technologies GmbHBerlinGermany
  2. 2.Dept. of Computer Sciences VI (IMMD VI)University of Erlangen-NurembergGermany
  3. 3.Siemens AGErlangenGermany

Personalised recommendations