Advertisement

Extracting Table Information from the Web

  • Yeon-Seok Kim
  • Kyong-Ho Lee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)

Abstract

With the ubiquity of the Web, the volume of Web documents continues to grow at a rapid speed. Since the Web is a vast source of information, extracting useful information from Web documents is important.

HTML (Hypertext Markup Language), which is a format for visual rendering of Web documents, defines tag <TABLE> for representation of a table. On the other hand, most of the existing HTML documents use <TABLE> tags to present a formatting layout of a document. As a prerequisite for information extraction from the Web, it is required to determine whether <TABLE> tags are used to present genuine tables or not.

Generally, a table is a facility for presenting relational information structurally and concisely. This paper defines a table as an array of relational data. Specifically, we regard a table that relates an attribute and its value, as a genuine table as reported in previous works. In this paper, set of attribute cells and set of value cells are defined as an attribute area and a value area, respectively.

Most previous works concerning table identification in HTML documents are based on a specific domain or take a lot of training data and time. This paper presents an efficient method for identifying tables in HTML documents prior to extracting information from the Web.

References

  1. 1.
    Wang, Y., Hu, J.: Detecting Tables in HTML Documents. In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 249–260. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Yeon-Seok Kim
    • 1
  • Kyong-Ho Lee
    • 1
  1. 1.Dept. Computer ScienceYonsei UniversitySeoulKorea

Personalised recommendations