Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

  • Luo Xiao
  • Dieter Wissmann
  • Michael Brown
  • Stefan Jablonski
Conference paper

DOI: 10.1007/3-540-45517-5_20

Part of the Lecture Notes in Computer Science book series (LNCS, volume 2070)
Cite this paper as:
Xiao L., Wissmann D., Brown M., Jablonski S. (2001) Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web. In: Monostori L., Váncza J., Ali M. (eds) Engineering of Intelligent Systems. IEA/AIE 2001. Lecture Notes in Computer Science, vol 2070. Springer, Berlin, Heidelberg

Abstract

This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Luo Xiao
    • 1
  • Dieter Wissmann
    • 1
  • Michael Brown
    • 2
  • Stefan Jablonski
    • 3
  1. 1.Interprice Technologies GmbHBerlinGermany
  2. 2.Dept. of Computer Sciences VI (IMMD VI)University of Erlangen-NurembergGermany
  3. 3.Siemens AGErlangenGermany

Personalised recommendations