Table Extraction from Text Documents
- 139 Downloads
Tables appear in text documents in almost every form imaginable, from simple lists to nested, hierarchical, and multidimensional layouts. They are primarily designed for human consumption and therefore can require a wide variety of visual cues and interpretive capabilities to be fully understood. This chapter deals with the challenges machines face when attempting to process and understand tables, along with state-of-the-art methods and performance on this task.
KeywordsText Document Simple List Table Extraction Gold Standard Reference Table Object
The objective of table extraction is to convert human-focused notation, to a logical, machine-readable, and machine-understandable form. This task is closely related to and could be viewed as a subproblem of document structure extraction. It is generally considered a higher level natural language processing problem, requiring a pipeline of capabilities to address.
Motivation and Background
Tables appear in text documents in almost every form imaginable, from simple lists to nested, hierarchical, and multidimensional layouts. They are primarily designed for human consumption and therefore can require a wide variety of visual cues and interpretive capabilities to be fully understood. In fact, the assumption of human consumption allows for a breadth of content presentation that is practically limitless. Often, critical information that is relevant to the interpretation is assumed, provided in short-hand notation, or inferred from other aspects of the content or layout.
Given that the problem of table extraction is motivated by the presence of electronic documents, it has only been formally studied since the early 1990s, as the prevalence of computer-based document storage, editing, and retrieval increased (Laurentini et al. 1992; Guthrie et al. 1993). With hundreds of document formats, layout preferences, and established customs for data interchange, the problem has only become worse at web-scale, with very few document originators choosing machine-readable syntax over visual layouts (i.e., drawing).
This article explores the genesis of the problem domain, how to formalize and break down the various tasks involved in building a table extraction solution, and the methodologies generally used.
There have been several attempts at formalizing the table model and notation. Some of these were designed independently of automatic table extraction research (Association of American Publishers 1986) and pertain to the best practices for tabular data design. Computationally driven table models generally refer to the widely used Augmented Wang Notation (Wang 1996) which specifies a hierarchical schema for describing types, classes, and relations among cells. Common table models are necessary for the interoperation of different stages of the extraction pipeline as well as the common evaluation of different approaches with the same gold standard reference data (Govindaraju et al. 2013). As in most machine learning pipelines, it is often convenient to isolate component parts for algorithmic development and testing.
Approaches to evaluation of table extraction techniques vary widely, and can be looked at from multiple perspectives document level, table level, access level, and cell level. Each stage of the extraction pipeline can be evaluated separately, or one can look at the overall goal achievement measures. Vanessa Long (2010) adopts a multilevel structural evaluation approach which can be particularly informative.
Recent work is part of a more sparse literature, with consistently decreased focus since the early 2000s. In spite of this, table extraction is not a problem that has any broadly adopted solution. It is a fragmented environment and often viewed as a practitioner’s problem as part of larger systems. However, certain industries (e.g., finance) and the rise of web-scale information extraction have led to a renewed focus on these technologies in a research and applied setting (Mitchell et al. 2015).
Structure of the Learning System
We will consider each of the logical steps that form part of a complete table extraction system. Hurst (2000) and Fang et al. (2012) both propose pipelines that allow for the evaluation of discrete capabilities. Starting from a raw text document, each subsequent pass adds more and more structure, getting us closer to the final goal – a disambiguated relational table object. Approaches at each stage can consider not only textual features but also layout and other visual cues. In fact, it is often the case that table extraction techniques on text documents will use a variety of methodologies from the computer vision community.
Given a text document, the objective is to identify whether or not it contains a table object. It should be possible to signal when a document contains multiple such distinct objects and their rough contiguous location (Kornfeld and Wattecamps 1998). Often, this step is combined with the next (boundary detection) to perform joint detection and delineation of tabular areas.
In the case where detection is performed in isolation, it may be viewed as a binary classification, sequence labeling, or clustering task over the document. Lopresti et al. (2000) approached this problem from a text density/clustering perspective over single-columned ASCII text documents, though more recent efforts in industrial applications tend to benefit from cascading binary (SVM) classifiers or random forest approaches.
Table Boundary Identification
Table boundary identification recognizes the boundaries of detected tables such that they could be isolated from the surrounding information. Laurentini et al. (1992) makes use of the Hough transform to identify connected shapes and components that represent the margins of tables. These must be separated from charts, images, and other visual components, which is the aim of the table detection step mentioned above. The identification of table boundaries can also benefit the table detection task by providing additional structural features on which to predicate the distinction from other visual objects in a document.
For each recognized table, identify the column and row structure, such that each cell could be uniquely identified. In practice, the methods applied to this task mirror those of table boundary identification. However, there are additional constraints that often make it worthwhile to consider this step separately. For example, tables are structurally constrained to maintain linear relationships among cells – rows and columns must remain broadly coherent. Furthermore, the task may be recursive, where tables contain tables, or other structural items as inserts. It is important that this step provide the most accurate microstructure possible. As such, it can often be beneficial to look at measures of content coherence for merging or splitting neighbors, at the same time as optimizing overall coherence.
The logical definition of a table is that of a set of associated keys and values. Headers, or groupings of headers, represent keys, which intersect along the axes of a table. The intersects of these header cells represent the values of interest. Headers provide the information necessary to understand the type of data as well as uniquely pinpoint the location of each value. Functional classification, therefore, identifies for each cell, whether it represents a key or a value (Liu 2009).
For each cell representing a value, classify its type (e.g., weight, location, distance, revenue), according to its associated headers. In addition, many tables rely on auxiliary information and interpretation, such as footnotes or implied coherence (e.g., all adjacent cells have the same property, but do not explicitly define it). These additional structures need to be identified and associated with each cell. Furthermore, cell values should be fully normalized according to the available information. If a header states that all values are in $M, all numbers should take this into account.
In most cases, the reason for reading and extracting a table from text is to be able to work with the information held therein. Comparing values to prior years’ numbers, reasoning about them, and filtering all require that the data fit into some logical representation of the domain of interest, whether implicitly or explicitly defined. Disambiguating the values allows them to be used consistently and stored uniformly for later querying (Liu 2009; Hurst 2000).
Disambiguation requires some desired final representation, whether a formal ontology or a relational database schema. Ideally the representation would cover the entire universe of interest, allowing every possible value type to be logically captured. However, it is often necessary to account for content that has not been encountered before.
Generally, disambiguation can be viewed as a supervised classification problem, whereby explicit or implicit (latent, structural) features are mapped probabilistically to available outcomes, constrained by meta-schemas such as length, primitive type, and relative position. Additionally, structural factors (number of values, etc.) can be used, within an iterative framework, to further limit the output space. That is, as more of the table has been disambiguated, fewer options remain that would be consistent with the prior results. As such, this can be viewed as a constrained optimization, where the schema is sufficiently well defined.
- Association Of American Publishers (1986) Markup of tabular material. Technical report. Association of American Publishers, Manuscript SeriesGoogle Scholar
- Fang, J, Mitra P, Tang Z, Lee GC (2012) Table header detection and classification. In: Proceedings of AAAI, TorontoGoogle Scholar
- Göbel M, Hassan T, Oro E, Orsi G (2012) A methodology for evaluating algorithms for table understanding in PDF documents. In: Proceedings of the 2012 ACM symposium on document engineering, Atlanta, pp 45–48Google Scholar
- Govindaraju V, Zhang C, Ré C (2013) Understanding tables in context using standard NLP toolkits. In: Proceedings of the ACL, SofiaGoogle Scholar
- Hurst MF (2000) The interpretation of tables in texts. Ph.D. thesis, University of Edinburgh, EdinburghGoogle Scholar
- Kornfeld W, Wattecamps J (1998) Automatically locating, extracting and analyzing tabular data. In: SIGIR’98: Proceedings of the 21st annual international ACM SIGIR conference, Melbourne, pp 347–348Google Scholar
- Laurentini A, Viada P (1992) Identifying and understanding tabular material in compound documents. In: Proceedings of 11th IAPR international conference on pattern recognition. Conference B: pattern recognition methodology and systems, IEEE, The Hague, vol II, pp 405–409Google Scholar
- Liu Y (2009) Tableseer: automatic table extraction, search, and understanding. Ph.D. thesis, The Pennsylvania State UniversityGoogle Scholar
- Long V (2010) An agent-based approach to table recognition and interpretation. Ph.D. thesis, Macquarie University, SydneyGoogle Scholar
- Lopresti D, Hu J, Kashi R, Wilfong G (2000) Medium-independent table detection. In: SPIE document recognition and retrieval VII, San Jose, pp 291–302Google Scholar
- Mitchell T, Cohen W, Hruschka E, Talukdar P, Betteridge J, Carlson A, Dalvi B, Gardner M, Kisiel B, Krishnamurthy J, Lao N, Mazaitis K, Mohamed T, Nakashole N, Platanios E, Ritter A, Samadi M, Settles B, Wang R, Wijaya D, Gupta A, Chen X, Saparov A, Greaves M, Welling J (2015) In Proceedings of the Conference on Artificial Intelligence (AAAI)Google Scholar
- Padmanabhan R, Jandhyala RC, Krishnamoorthy M, Nagy G, Seth S, Silversmith W (2009) Interactive conversion of Large web tables. In: Proceedings of eighth international workshop on graphics recognition, GREC 2009. City University of La Rochelle, La RochelleGoogle Scholar
- Shafait F, Smith R (2010) Table detection in heterogeneous documents. In: Proceedings of the 9th IAPR international workshop on document analysis systems, Boston, pp 65–72Google Scholar
- Thompson M (1996) A tables manifesto. In: Proceedings of SGMK Europe, Munich, pp 151–153Google Scholar
- Wang X (1996) Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo, WaterlooGoogle Scholar