Skip to main content

Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2070))

Abstract

This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Relational Learning of Pattern-Match Rules for Information Extraction, M E Califf and R J Mooney, Proceedings ACL-97: Workshop on Natural Language Learning, 1997

    Google Scholar 

  2. A Simple, Fast, and Effective Rule Learner, W Cohen, AAAI-99 Proceeding, 1999

    Google Scholar 

  3. Information Extraction a User Guide, H Cunningham, CS-99-07, 1999

    Google Scholar 

  4. Information Extraction from HTML: Application of a General Machine Learning Approch, D Freitag, AAAI-98 Proceeding, 1998

    Google Scholar 

  5. Trends and controversies: Information Integration, A Levy, C Knoblock, S Minton, W Cohen, IEEE Intelligent Systems 13 (5), 1998

    Google Scholar 

  6. Wrapper induction: Efficiency and expressiveness, N Kushmeric, Artificial Intelligence 118, 15–68, 2000

    Google Scholar 

  7. STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources-Muslea I, Minton S, AAAI’98 Workshop “AI and Information Integration”

    Google Scholar 

  8. Information Extraction as a Basis for High-Precision Text Classification, E Riloff and W Lehnert, ACM Transactions on Information Systems vol. 12 no. 3 1994.

    Google Scholar 

  9. Learning Information Extraction Rules for Semi-Structured and Free Text, S Sonderland, Machine Learning 34, 233–272, 1999

    Google Scholar 

  10. Where to Position the Precision in Knowledge Extraction from Text, L Xiao, 2000

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xiao, L., Wissmann, D., Brown, M., Jablonski, S. (2001). Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web. In: Monostori, L., Váncza, J., Ali, M. (eds) Engineering of Intelligent Systems. IEA/AIE 2001. Lecture Notes in Computer Science(), vol 2070. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45517-5_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-45517-5_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42219-8

  • Online ISBN: 978-3-540-45517-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics