Extracting Table Information from the Web
With the ubiquity of the Web, the volume of Web documents continues to grow at a rapid speed. Since the Web is a vast source of information, extracting useful information from Web documents is important.
HTML (Hypertext Markup Language), which is a format for visual rendering of Web documents, defines tag <TABLE> for representation of a table. On the other hand, most of the existing HTML documents use <TABLE> tags to present a formatting layout of a document. As a prerequisite for information extraction from the Web, it is required to determine whether <TABLE> tags are used to present genuine tables or not.
Generally, a table is a facility for presenting relational information structurally and concisely. This paper defines a table as an array of relational data. Specifically, we regard a table that relates an attribute and its value, as a genuine table as reported in previous works. In this paper, set of attribute cells and set of value cells are defined as an attribute area and a value area, respectively.
Most previous works concerning table identification in HTML documents are based on a specific domain or take a lot of training data and time. This paper presents an efficient method for identifying tables in HTML documents prior to extracting information from the Web.