Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web
- Cite this paper as:
- Xiao L., Wissmann D., Brown M., Jablonski S. (2001) Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web. In: Monostori L., Váncza J., Ali M. (eds) Engineering of Intelligent Systems. IEA/AIE 2001. Lecture Notes in Computer Science, vol 2070. Springer, Berlin, Heidelberg
This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.
Unable to display preview. Download preview PDF.