Robust Detection of Tables in Documents Using Scores from Table Cell Cores

Table detection is an essential step in many document analysis systems. Tabular data are a pivotal form of information representation that can organize data in a conventional structure for comfortable and quick information retrieval and comparison. Detection of table structures in PDF files or images is a challenging task because of the variability of table layouts, and sometimes the tabular structures’ similarities with non-tabular elements like charts, plots, etc. In this work, we have presented a table detection method using a geometric analysis of the table cell cores that represents the table cell texts. The proposed method works by analyzing the text gap information, and hence it can detect the table cell cores, irrespective of the presence of the table boundary lines and cell-separating rule-lines. Experimentations have been done on various document images of complex structures from well-known datasets. The detection accuracies obtained by us corroborate the usefulness of the proposed method.


Introduction
One of the significant challenges of document layout analysis is table understanding in the document image. Tables are broadly present in a prodigious variety of documents such as official documents, bills, scientific articles, reports, or archival documents among others; and, hence, techniques for table analysis are instrumental to automatically extract important information kept in a tabular form from numerous sources [1]. Tables facilitate readers to easily compare, analyze and understand facts present in documents [2]. So, table detection is an essential task as the accurate table detection will enhance document analysis addressing important information extraction. Due to the diversity of table styles, table detection and extraction is a popular and challenging task. There is no such general algorithm that can detect the presence of the tables in the document irrespective of the styles of the tables.
A conventional optical character recognition (OCR) system consists of three significant steps, i.e., layout analysis, character recognition and text string generation using a language modeling tool [3]. Since layout analysis is the first step in such a process, all subsequent stages rely on layout analysis to work correctly. One of the significant difficulties faced by layout analysis is detecting table regions. Tables are made of horizontal and vertical lines or by introducing uniform spaces to differentiate the cells within it. The variety of styles makes it difficult to to provide a generic algorithm for table detection [2]. Our main contribution in this paper is writing a generalized algorithm for table detection followed by information extraction. The rest of the paper is organized as follows: we discuss related work in "Related Works and Motivation". In "Our Proposed Method", we present our proposed approach and further details on checking components are in "Score Computation". Results of the method are shown in "Results and Discussions" and we conclude with "Conclusions".

Related Works and Motivation
Several methods have been proposed for table detection and are available in the literature. There are approaches which use purely geometric features extracted from the ruled lines, pixel distributions, white gaps and finally those features help detecting tables using machine learning. Our approach is based on a geometric analysis of the table cell centers. In general, these methods can be divided into two main categories, text analysis based and ruling line based.
Anh et al. [4] proposed a hybrid approach for the detection of table structures, irrespective of the style, a ruling  line table or a non-ruling line table. Experimental results are shown by them for the ICDAR-2013 table competition dataset. Jahan et al. [5] proposed a method where local thresholds for word gaps and the line-heights have been used to locate and extract all categories of tables. The system shows a 75% overall detection rate which was not very promising. Bansal et al. [6] presented a learningbased framework which identifies tables from scanned document images. They proposed a scheme for analyzing and labeling different document elements, their contexts, and finally to define and understand the table boundaries from the context informations. Kasar et al. [7] presented a method which works by identifying the column and row line separators. The horizontal and vertical-aligned lines are extracted first using run-length thresholds and then those aligned lines are used for feature generation and subsequent classification in to tables and non-tables.
Many recent works are available for the proposed problem, which use neural networks or deep learning models. For example, Forczmański et al. [8] presented an object detection approach using a Convolutional Neural Network. They focused on automatic segmentation of elements from documents. The elements considered by them were stamps, logos, text blocks, tables, and signatures. The authors have collected various documents from internet and created their own dataset. The method presented by them works in two stages. In the first stage, a rough classification of the detected regions of interest is done, and then in the second stage, verification of found elements are done. They experimented on public datasets and obtained a table detection accuracy of 97.79%. In another recent work, the authors have used a Convolutional Neural Network with 28 layers for the detection of tables [9]. In a study by Shah Rukh Qasim et al. [10], a graph model is used for the structural analysis problem of documents.  [13] solved the automated table or chart detection task by a combination of deep convolutional neural networks, graphical models and saliency concepts are presented in this article. Localization of tables and charts in documents was carried out using the saliency-based fullyconvolutional neural network followed by a fully-connected conditional random field (CRF). Performance was tested on the ICDAR 2013 dataset and they observed a precision of 97.5% , and a recall of 98.1% . Arif et al. [14] suggested a novel data-driven approach for table detection from document images using foreground and background features. The observations the authors were that the tables normally contain more numeric data, they focused on differentiating the numerical and other text data. They obtained a precision of 86.33% and a recall of 93.21% when applied on the UNLV dataset. Schreiber et al. [15] presented a system for table detection using deep learning which works by analyzing the cell positions after detection of rows and columns present in the tables. The accuracy for table detection and structure recognition by their method was 91.44% when applied on ICDAR 2013 dataset. Li et al. [16] proposed a convolutional neural networks based method which applies some loose heuristic rules to extract meta-information from the PDF documents and used those meta-informations for table detection purposes. The crucial limitation of the method is that it only works for PDF documents.

Our Contributions
• From the current works available, we can see that the approaches are based on the table lines' geometry or gap between contents only. The method in [17] is made only for ruled tables. A document may have ruled, non-ruled tables, and partially ruled tables, and our proposed method aims to work for all of them. In our proposed method, we are not relying on any horizontal or vertical lines for detecting tables. • A hybrid method to detect both ruled and non-ruled tables has been proposed in [4]. However, this method is very complicated and time consuming. It categorizes the tables as ruled and non-ruled and processes them differently. We do not classify tables into ruled and non-ruled tables, and neither do we classify text and non-text elements in the documents. Hence our method is more simple and yet useful. • The method in [18] relies on graphic lines, which sometimes leads to false detections of tables when there is a line in a paragraph with sparsely populated text. Our proposed method of score computation for the recognition solves this problem to some extent.

SN Computer Science
• Mandal et al. [19] have proposed a method based on the fact that the gaps between the fields have to be larger than the gap between the words in text lines. Though, this may not always be true as tables can be densely populated. Our proposed method also uses gaps between elements in a page, but relies on a more accurate assumption that tables can be recognized seeing the well-structured point set representing the table cell cores. Cores are understood as the text blocks' centers in the tabular cells and represented as a set of points.

Our Proposed Method
Our objective is to find tables present in a document image using simple methods. We start with a gray-scale image of the page. It is assumed that the image is already skew corrected. The image is then binarized using adaptive thresholding. Then the average character height is estimated. The next step's goal is to separate the page into regions, each containing a single component, such as a paragraph, image, table, figure, etc. It can be shown that the gap between components is significantly larger than the gap between text lines. Next, we examine the elements inside each component and try to group them into rows and columns. These elements' relative positions are further examined to categorize the components into two categories, tables and non-tables. These steps are described in more detail below. The proposed methodology is shown in Fig. 1.

Pre-processing
We start with a grayscale image of the document. If the image is skewed, it must be skew corrected for this method to work. We assume that the given image is already skew corrected. The input image is binarized using the method proposed by Sauvola et al. [20]. The method uses adaptive thresholding and can produce good quality binarized images even for input images that have a change of illumination or noise issues. A sample output image is shown in Fig. 2. We have slightly modified the method to extract an estimate of character height in the document image. In the final step of the binarization method, the connected black elements (say a text character) are plotted on the resulting final image (which was initially taken to be white). While plotting these elements, we keep track of the height of each of them. We find the mean and median of these heights and estimate the (1). We take this mode to be the estimated character height h.

Component Extraction
A document consists of a variety of components or regions such as text blocks, paragraphs, images, tables, figures, etc. It is helpful to separate them before further processing. Document structure and layout analysis can be used to decompose these components from a document image. Various such techniques exist and are mentioned in [21]. We use a simple smoothing-based technique.

Component Bounding Box Detection
To detect the bounding boxes, we start by smearing the foreground pixels, like coalescing nearby black pixels and forming blobs. We use the run-length smoothing algorithm (RLSA) [22] for this. The RLSA can be used for block segmentation and text discrimination. The algorithm converts white pixels in the input image to black if the number of adjacent white pixels is less than or equal to some predefined limit l. We set this limit l to be some multiple of the estimated character height h. RLSA is applied both horizontally and vertically with respective parameter values l h and l v , respectively (horizontal or vertical run-lengths).
We then traverse the edge of the blobs and find the four extreme points of each blob, namely x min , y min , x max and y max . These four points are enough to define a bounding box (see Fig. 5). We call these bounding boxes outer bounding boxes as they represent the outer boundary of each component. The steps are shown in Fig. 3.

Inner Elements Detection
To detect the elements inside a component boundary, we use a similar approach to detect components on a page. We start with a copy of the binarized image combining nearby black pixels and forming blobs but this time only horizontally. That is, RLSA is applied only in the horizontal direction. In this way, elements in separate lines do not coalesce into a single blob. We then find the extreme points of each blob and store them as an array of bounding boxes. We call these boxes as inner bounding boxes because these are obtained from the elements inside each component. We found that rough removal of the long vertical or horizontal lines (table boundaries or separator lines) before applying RLSA in this step gives better results. Occasionally, the cell contents in a ruled table are too close to the table boundary lines. The steps are shown in Fig. 4.

Combining Inner and Outer Bounding Boxes
We now have a list of outer bounding boxes for each component and a list of inner bounding boxes (Fig. 5). These are now combined into a list of components. Each component has an outer bounding box which contains the smaller inner bounding boxes as shown in Fig. 6. These inner components are processed individually for all outer boxes one by one.

Component Representation
We represent each component with the following attributes as shown in the example in Fig. 6.
• Outer bounding box: a rectangular box which contains all inner components. • An array of inner bounding boxes: an array of bounding boxes of all the elements inside the Outer bounding box. These can be text lines for paragraphs, a cell for a table, or some arbitrary region from graphics.

Table Detection
Once the components have been extracted, we need to identify the ones that could be tables. We will examine the relative position of all the inner elements to see if their structure is close to that of the table cells in any way. However, to do that, we need to find a more straightforward way to represent the inner elements. So, we attempt to group all the inner elements into rows and columns by testing if their X or Y axes projections overlap. Then, based on their overlapping areas, each inner element that could be successfully grouped into rows and columns is allocated one single 2D point. We call this point the overlapping center. Now that a single point can represent each element and by comparing their relative positions and further examination of their layout structure would be more comfortable (Fig. 7).

Row-Column Grouping
In this step, each component's inner bounding boxes are grouped into rows and columns and marked accordingly. We ignore any inner bounding box with a width more than 75% of the width of the outer bounding box as cells are generally not of this big size. It may be a header, but we only focus on the cells. Two inner bounding boxes A and B can be said to be in the same row if their projections on the Y axis overlap. This can be checked easily with the following formula given in Eq. (2).

The result F x (A, B) indicates whether the two boxes A and B
have an X-axis projection intersection or not.

Representing Cells by Single Points
Once the groups are formed, for each element, we check the minimum area that is overlapping with all the elements in the same row group and the same column group. Then we assign the center of this overlapped area to the element as shown in Fig. 8. To describe this more formally, let us suppose that the inner elements I 1 , I 2 , I 3 , … I k are there in the same column, and the elements J 1 , J 2 , J 3 , … J m are coming from the same row. Then the coordinates of the center p of (2)   4) and (5) The collection of all p values from all overlapping regions is referred to as the core C. Hence, C represents the wellstructured point set representing the table cell cores which are the text blocks' centers within the tabular cells. where, where,

Score Computation
Now, we have a set of points, each representing an inner element of the outer components. In tables, these points would represent cells and hence would be arranged as the core structure representing the table as a whole. The cells of a table are generally group-wise uniformly spaced. This helps us in identifying a table even in the absence of ruling lines. We define a score based examination method from the relative distances of these points to identify if a given component is to be considered as a table or not.

Vertical and Horizontal Relations and Distances
For two given points p and q in C, we say, pR v q (p is vertically related to q) if |p ⋅ y − q ⋅ y| ≤ where is some distance threshold. Similarly, we define pR h q if |p ⋅ x − q ⋅ x| ≤ . Next, we do the following using the point set to calculate the tabular structure score: 1. For every point p find a point q, if any, such that pR h q holds, p ⋅ x > q ⋅ x , and q is closest to p.
2. For every point p find a point q, if any, such that pR h q holds, p ⋅ x < q ⋅ x , and q is closest to p. 3. For every point p find a point q, if any, such that pR v q holds, p ⋅ y > q ⋅ y , and q is closest to p. 4. For every point p find a point q, if any, such that pR v q holds, p ⋅ y < q ⋅ y , and q is closest to p.
Here, q is closest to p in terms of distances and these distances are stored. The patterns of this distances will reveal the nature of the component. Some distance values will repeat (or close enough) many times in case of tables. We define our score for recognition of tables using the frequencies of these distances as shown in Eq. 6.
Here, n d i denotes the frequency of the distance d i , r d i denotes the number of points involved with distance d i , |pR h q| and |pR v q| denotes, respectively, the number of horizontal and vertical relations, and |C| denotes the size of core C (in terms of number of points). For example, with respect to the following point set shown in Fig. 9, we have, r d 1 = 6 , n d 1 = 3 , r d 2 = 6 , n d 2 = 3 , r d 3 = 9 , n d 3 = 6 , |pR h q| = 6 , |pR v q| = 6 , and |C| = 9 . Therefore, for this example we have, Score = 7.5.

Results and Discussions
Our program was tested on a computer with Intel Core i3 − 6098P Processor which has a base frequency of 3.60 GHz. Our method was tested on 80 input documents images taken from various scholarly articles. the document pages contain various types of tables along with other types of graphics elements like plots, equations, images etc.    Figure 14 shows the required CPU times for detection of tables for some sample documents of different size. The computed scores from the core points for some documents are shown in Figs. 10, 11, 12, 13. Based on experiment observations, we have classified components as tables when the score exceeds 5.00 (Fig. 14).
Results are shown for various types of document images in Figs. 15, 16 and 17. Here, Figs. 15 and 16 show pages containing ruled tables, whereas Fig. 17 shows document pages with tables where cell contents are not separated by ruled lines.

Evaluation Metrics
To evaluate the classification accuracy, four metrics have been used in our work: Precision, Recall, F1 score, and Accuracy. Respective definitions are shown in Eq. (7) where TP, FP, FN, and TN represent true positives, false positives, and false negatives, and true negatives, respectively. Here, TP represents the count of tables correctly predicted as a table. The figure FP shows the number of non-tables (plots, graphs, graphics) predicted as a table. FN represents the count of tables not detected as tables, and TN denotes the count of non-tables predicted as a non-table.
Another metric we used for evaluation, Intersection over Union (IoU), which is used widely in the object detection benchmarks [11]. It measures the overlap between predicted and ground truth tables' covering rectangles or polygons. The value of IoU lies in the range of [0, 1]. The higher value of IoU designates maximum match in the ground truth and predicted tables.

Our Results
Initially, we tested our method on our own dataset. Our dataset contains 80 document images with 99 tables in total. We obtained FP = 8 , FN = 6 , and TP = 93 thereby giving precision = 0.921, recall = 0.939, and F 1 score equal to 0.93.
We have also tested our method on datasets like ICDAR-2013, Marmot, TableBank, and ICDAR-2019. ICDAR-2013 [12] is one of the most popular datasets for table detection and structure recognition. This dataset is created by documents obtained from web pages. This dataset was made for a competition concentrated on detecting figures, tables, and mathematical equations from document images. The dataset is composed of PDF files which we converted to images to be used within our research work. The dataset contains 59 PDF files, a total of 117 tables. To give the algorithms ample possibility to find false positives, approximately two pages before and after the table included as excerpts. A comparative discussion on accuracy figures with respect to other methods applied on ICDAR-2013 is shown in Table 1, which clearly shows that the proposed method outperforms the other methods. Our proposed method detects well irrespective of the presence of the table boundary rule lines, which is the major contribution of our work.
The Marmot [23] dataset comprises English and Chinese documents. The dataset consists of 2000 images, where a ratio of almost 1:1 is present between the positive to negative samples. The pages show a great variety in language type, page layout, and table styles. Over 1500 conference and journal papers pages were taken into the dataset, covering various fields, spanning from the year 1970 to the latest 2011 publications. The e-Book pages are primarily in one-column layout, while the English pages are mixed with both one-column and two-column layouts. Our method was tested on English pages, and it achieved the precision of 0.960, recall of 0.984, and F1 score equal to 0.972.
The TableBank dataset [24] consists of 417, 234 high quality labeled tables as well as their original documents in a variety of domains. Our method achieved the precision of  [25]. Our approach achieved the precision  Fig. 18.
For the proposed method, we obtained the IoU values as 0.90, 0.65 and 0.60 with respect to our dataset, the Marmot dataset, and the TableBank (Part 1), respectively (Tables 2, 3, 4).

Challenges
Sample document pages where our proposed method failed are shown in Fig. 19 from the Marmot dataset. In the image shown in Fig. 19a (Marmot-123), horizontal spacing is not there in the first column, and this fact lowers the score. In our experimentation, the horizontal ( l h ) and vertical ( l v )  Fig. 19b (Marmot-190), the presence of the vertical separator line and its proximity with the text words are the reasons for failure.
For the sample shown in Fig. 20

Conclusions
This paper presents a novel method for the detection of tables from document images. Table detection in documents is, for instance, necessary to convert tables in a document image into an editable format. Once the tables are detected, the table cells can be localized, and finally, using OCR, we can extract cell contents. While it is straightforward to detect ruled tables and is often done by identifying the horizontal and vertical lines of the table borders, it is more challenging to detect unruled or partially ruled tables. We presented a method that can recognize tables irrespective of whether it is ruled, partially ruled or unruled. We do not look for any lines or boundaries; instead, we rely on the fact that cells of a table are arranged in well structured way. We have shown that the structure can be represented as a set of core points if the cell contents are replaced by representative points. In this work, we have shown the score computation method only for the detection table structures. Further, the work can be extended to design the score formulae for other types of graphics elements like plots, graphs, equations, etc. Further automated selection of run-length parameters for smoothing will be worthy to explore for fine-tuning.