Rule-Based Table Analysis and Interpretation
Today, a huge amount of tables are presented in web pages, word documents, and spreadsheets. Many of them are unstructured tabular data. They are intended to be understood by humans but not to be interpreted by machines. At the same time, we often need to have that information in a structured form, e.g. relational databases. We propose a rule-based approach to table analysis and interpretation and demonstrate how it can be applied to transform tabular data from unstructured (spreadsheets) to structured (relational databases) form. The paper discusses representing tabular data as facts in the working memory of a rule engine, a formal language for defining rules of table analysis and interpretation, and its implementation.
KeywordsTable analysis and interpretation Table understanding Information extraction from tables Unstructured tabular data integration
The research work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042) and the Council for grants of the President of the Russian Federation (Grant No. SP-3387.2013.5).
- 1.Hurst, M.: Layout and language: challenges for table understanding on the web. In: 1st International Workshop on Web Document Analysis, pp. 27–30, Seattle (2001)Google Scholar
- 4.Drools Expert. http://www.drools.org
- 9.Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: 16th International Conference on World Wide Web, pp. 71–80. ACM, Banff (2007)Google Scholar
- 13.Embley, D.W., Nagy, G., Seth, S.: Transforming web tables to a relational database. In: 22nd International Conference on Pattern Recognition, pp. 2781–2786. IEEE Computer Society, Washington (2014)Google Scholar
- 14.Nagy, G., Embley, D.W., Seth, S.: End-to-end conversion of HTML tables for populating a relational database. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 222–226. IEEE Computer Society, Troy (2014)Google Scholar
- 15.Wang, X.: Tabular Abstraction, Editing, and Formatting. PhD Thesis. University of Waterloo, Waterloo (1996)Google Scholar