Engineering of Intelligent Systems

Volume 2070 of the series Lecture Notes in Computer Science pp 165-174


Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

  • Luo XiaoAffiliated withInterprice Technologies GmbH
  • , Dieter WissmannAffiliated withInterprice Technologies GmbH
  • , Michael BrownAffiliated withDept. of Computer Sciences VI (IMMD VI), University of Erlangen-Nuremberg
  • , Stefan JablonskiAffiliated withSiemens AG

* Final gross prices may vary according to local VAT.

Get Access


This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.