In this section, we describe in detail the approaches and algorithms used to implement all the individual workflow steps.
Layout analysis
Layout analysis is the initial phase of the entire workflow. Its goal is to create a hierarchical structure of the document preserving the entire text content of the input document and features related to the way the text is displayed in the PDF file.
Layout analysis is composed of the following steps:
-
1.
Character extraction (A1)—extracting individual characters from a PDF document.
-
2.
Page segmentation (A2)—joining characters into words, lines and zones.
-
3.
Reading order determination (A3)—calculating the reading order for all the structure levels.
Character extraction
The purpose of the character extraction step is to extract individual characters from the PDF stream along with their positions on the page, widths and heights. These geometric parameters play important role in further steps, in particular page segmentation and content classification.
The implementation of character extraction is based on open-source iText [30] library. We use iText to iterate over PDF’s text-showing operators. During the iteration, we extract text strings along with their size and position on the page. Next, extracted strings are split into individual characters and their individual widths and positions are calculated. The result is an initial flat structure of the document, which consists only of pages and characters. The widths and heights computed for individual characters are approximate and can slightly differ from the exact values depending on the font, style and characters used. Fortunately, those approximate values are sufficient for further steps.
Page segmentation
The goal of page segmentation step is to create a geometric hierarchical structure storing the document’s content. As a result the document is represented by a list of pages, each page contains a set of zones, each zone contains a set of text lines, each line contains a set of words, and finally each word contains a set of individual characters. Each object in the structure has its content, position and dimensions. The structure is heavily used in further steps, especially zone classification and bibliography extraction.
Page segmentation is implemented with the use of a bottom-up Docstrum algorithm [31]:
-
1.
The algorithm is based to a great extent on the analysis of the nearest-neighbour pairs of individual characters. In the first step, five nearest components for every character are identified (red lines in Fig. 2).
-
2.
In order to calculate the text orientation (the skew angle), we analyse the histogram of the angles between the elements of all nearest-neighbour pairs. The peak value is assumed to be the angle of the text. Since in the case of born-digital documents, the skew is almost always horizontal, and this step is mostly useful for documents containing scanned pages.
-
3.
Next, within-line spacing is estimated by detecting the peak of the histogram of distances between the nearest neighbours. For this histogram, we use only those pairs, in which the angle between components is similar to the estimated text orientation angle (blue lines in Fig. 2). All the histograms used in Docstrum are smoothed to avoid detecting local abnormalities. An example of a smoothed distance histogram is shown in Fig. 3.
-
4.
Similarly, between-line spacing is also estimated with the use of a histogram of the distances between the nearest-neighbour pairs. In this case, we include only those pairs that are placed approximately in the line perpendicular to the text line orientation (green lines in Fig. 2).
-
5.
Next, line segments are found by performing a transitive closure on within-line nearest-neighbour pairs. To prevent joining line segments belonging to different columns, the components are connected only if the distance between them is sufficiently small.
-
6.
The zones are then constructed by grouping the line segments on the basis of heuristics related to spatial and geometric characteristics: parallelness, distance and overlap.
-
7.
The segments belonging to the same zone and placed in one line horizontally are merged into final text lines.
-
8.
Finally, we divide the content of each text line into words based on within-line spacing.
A few improvements were added to the Docstrum-based implementation of page segmentation:
-
the distance between connected components, which is used for grouping components into lines, has been split into horizontal and vertical distance (based on estimated text orientation angle),
-
fixed maximum distance between lines that belong to the same zone has been replaced with a value scaled relatively to the line height,
-
merging of lines belonging to the same zone has been added,
-
rectangular smoothing window has been replaced with Gaussian smoothing window,
-
merging of highly overlapping zones has been added,
-
words determination based on within-line spacing has been added.
Reading order resolving
A PDF file contains by design a stream of strings that undergoes extraction and segmentation process. As a result, we obtain pages containing characters grouped into zones, lines and words, all of which have a form of unsorted bag of items. The aim of setting the reading order is to determine the right sequence in which all the structure elements should be read. This information is used in zone classifiers and also allows to extract the full text of the document in the right order. An example document page with a reading order of the zones is shown in Fig. 4.
Reading order resolving algorithm is based on a bottom-up strategy: first characters are sorted within words and words within lines horizontally, then lines are sorted vertically within zones, and finally we sort zones. The fundamental principle for sorting zones was taken from [32]. We make use of an observation that the natural reading order in most modern languages descends from top to bottom, if successive zones are aligned vertically, otherwise it traverses from left to right. There are few exceptions to this rule, for example, Arabic script, and such cases would not be handled properly by the algorithm. This observation is reflected in the distances counted for all zone pairs: the distance is calculated using the angle of the slope of the vector connecting zones. As a result, zones aligned vertically are in general closer than those aligned horizontally. Then, using an algorithm similar to hierarchical clustering methods, we build a binary tree by repeatedly joining the closest zones and groups of zones. After that, for every node its children are swapped, if needed. Finally, an in order tree traversal gives the desired zones order.
Content classification
The goal of content classification is to determine the role played by every zone in the document. This is done in two steps: initial zone classification (A4) and metadata zone classification (B1).
The goal of initial classification is to label each zone with one of four general classes: metadata (document’s metadata, e.g. title, authors, abstract, keywords, and so on), references (the bibliography section), body (publication’s text, sections, section titles, equations, figures and tables, captions) or other (acknowledgments, conflicts of interests statements, page numbers, etc.).
The goal of metadata zone classification is to classify all metadata zones into specific metadata classes: title (the title of the document), author (the names of the authors), affiliation (authors’ affiliations), editor (the names of the editors), correspondence (addresses and emails), type (the type specified in the document, such as “research article”, “editorial” or “case study”, abstract (document’s abstract), keywords (keywords listed in the document), bib_info (for zones containing bibliographic information, such as journal name, volume, issue, DOI, etc.), dates (the dates related to the process of publishing the article).
The classifiers are implemented in a similar way. They both employ support vector machines, and the implementation is based on LibSVM library [33]. They differ in target zone labels, extracted features and SVM parameters used. The features, as well as SVM parameters were selected using the same procedure, described in Sects. 4.2.1 and 4.2.2.
Support vector machines is a very powerful classification technique able to handle a large variety of input and work effectively even with training data of a small size. The algorithm is based on finding the optimal separation hyperplane and is little prone to overfitting. It does not require a lot of parameters and can deal with highly dimensional data. SVM is widely used for content classification and achieves very good results in practice.
The decision of splitting content classification into two separate classification steps, as opposed to implementing only one zone classification step, was based mostly on aspects related to the workflow architecture and maintenance. In fact both tasks have different characteristics and needs. The goal of the initial classifier is to divide the article’s content into three general areas of interest, which can be then analysed independently in parallel, while metadata classifier performs far more detailed analysis of only a small subset of all zones.
The implementation of the initial classifier is more stable: the target label set does not change, and once trained on a reasonably large and diverse dataset, the classifier performs well on other layouts as well. On the other hand, metadata zones have much more variable characteristics across different layouts, and from time to time there is a need to tune the classifier or retrain it using a wider document set. What is more, sometimes the classifier has to be extended to be able to capture new labels, not considered before (for example a special label for zones containing both author and affiliation, a separate label for categories or general terms).
For these reasons, we decided to implement content classification in two separate steps. As a result, we can maintain them independently, and for example adding another metadata label to the system does not change the performance of recognizing the bibliography sections. It is also possible that in the future the metadata classifier will be reimplemented using a different technique, allowing to add new training cases incrementally, for example using a form of online learning.
For completeness, we compared the performance of a single zone classifier assigning all needed labels in one step to the classifier containing two separate classifiers executed in a sequence (our current solution). The results can be found in Sect. 5.3.
Feature selection
The features used by the classifiers were selected with the use of the zone validation dataset (all the datasets used for experiments are described in Sect. 5.1). For each classifier, we analysed 97 features in total. The features capture various aspects of the content and surroundings of the zones and can be divided into the following categories:
-
geometric—based on geometric attributes, some examples include: zone’s height and width, height to width ratio, zone’s horizontal and vertical position, the distance to the nearest zone, empty space below and above the zone, mean line height, whether the zone is placed at the top, bottom, left or right side of the page;
-
lexical—based upon keywords characteristic for different parts of narration, such as: affiliations, acknowledgments, abstract, keywords, dates, references, or article type; these features typically check, whether the text of the zone contains any of the characteristic keywords;
-
sequential—based on sequence-related information, some examples include the label of the previous zone (according to the reading order) and the presence of the same text blocks on the surrounding pages, whether the zone is placed in the first/last page of the document;
-
formatting—related to text formatting in the zone, examples include font size in the current and adjacent zones, the amount of blank space inside zones, mean indentation of text lines in the zone;
-
heuristics—based on heuristics of various nature, such as the count and percentage of lines, words, uppercase words, characters, letters, upper/lowercase letters, digits, whitespaces, punctuation, brackets, commas, dots, etc; also whether each line starts with enumeration-like tokens, or whether the zone contains only digits.
In general, feature selection was performed by analysing the correlations between the features and between features and expected labels. For simplicity, we treat all the features as numerical variables; the values of binary features are decoded as 0 or 1. The labels, on the other hand, are an unordered categorical variable.
Let L be a set of zone labels for a given classifier, n the number of the observations (zones) in the validation dataset and \(k = 97\) the initial number of analysed features. For ith feature, where \(0 \le i < k\), we can define \(f_i \in R^n\), a vector of the values of the feature ith for subsequent observations. Let also \(l \in L^n\) be the corresponding vector of zone labels.
In the first step, we removed redundant features, highly correlated with other features. For each pair of feature vectors, we calculated the Pearson’s correlation score and identified all the pairs \(f_i, f_j \in R^n\), such that
$$\begin{aligned} |{\textit{corr}}(f_i, f_j)| > 0.9 \end{aligned}$$
Next, for every feature from highly correlated pairs, we calculated the mean absolute correlation:
$$\begin{aligned} {\textit{meanCorr}}(f_i) = \frac{1}{k}\sum _{j=0}^{k-1} {\textit{corr}}(f_i, f_j) \end{aligned}$$
and from each highly correlated pair, the feature with higher meanCorr was eliminated. This left us with 78 and 75 features for initial and metadata classifiers, respectively. Let’s denote the number of remaining features as \(k'\).
After eliminating features using correlations between them, we analysed the features using their associations with the expected zone labels vector l. To calculate the correlation between a single feature vector \(f_i\) (numeric) and label vector l (unordered categorical), we employed Goodman and Kruskal’s \(\tau \) (tau) measure [34]. Let’s denote it as \(\tau (f_i, l)\).
Let \(f_0, f_1, \ldots f_{k'-1}\) be the sequence of the feature vectors ordered by non-decreasing \(\tau \) measure, that is
$$\begin{aligned} \tau (f_0, l) \le \tau (f_1, l) \le \cdots \le \tau (f_{k'-1}, l) \end{aligned}$$
The features were then added to the classifier one by one, starting from the best one (the mostly correlated with the labels vector, \(f_{k'-1}\)), and at the end the classifier contained the entire feature set. At each step, we performed a fivefold cross-validation on the validation dataset and calculated the overall F score as an average for individual labels. For completeness, we also repeated the same process with reversed order of the features, starting with less useful features. The results for initial and metadata classifier are shown in Figs. 5 and 6, respectively.
Using these results, we eliminated a number of the least useful features \(f_0, f_1, \ldots f_{t}\), such that the performance of the classifier with the remaining features was similar to the performance of the classifier trained on the entire feature set. Final feature sets contain 53 and 51 features for initial and metadata classifier, respectively.
SVM parameters adjustment
SVM parameters were also estimated using the zone validation dataset. The feature vectors were scaled linearly to interval [0, 1] according to the bounds found in the learning samples. In order to find the best parameters for the classifiers we performed a grid search over a three-dimensional space \(\langle K, \varGamma , C \rangle \), where K is a set of kernel function types (linear, fourth degree polynomial, radial-basis and sigmoid), \(\varGamma = \{2^i | i \in [-15,3]\}\) is a set of possible values of the kernel coefficient \(\gamma \), and \(C = \{2^i | i \in [-5,15]\}\) is a set of possible values of the penalty parameter. For every combination of the parameters, we performed a fivefold cross-validation. Finally, we chose those parameters, for which we obtained the highest mean F score (calculated as an average for individual classes). We also used classes weights based on the number of their training samples to set larger penalty for less represented classes.
Parameters for the best obtained results are presented in Tables 3 and 4. In both cases, we chose radial-basis kernel function, and chosen values of C and \(\gamma \) parameters are \(2^5\) and \(2^{-3}\) in the case of initial classifier and \(2^9\) and \(2^{-3}\) in the case of metadata classifier.
Table 3 The results of SVM parameters searching for initial classification
Table 4 The results of SVM parameters searching for metadata classification
Metadata extraction
The purpose of this phase is to analyse zones labelled as metadata and extract a rich set of document’s metadata information, including: title, authors, affiliations, relations author–affiliation, email addresses, relations author–email, abstract, keywords, journal, volume, issue, pages range, year and DOI.
The phase contains two steps:
-
1.
Metadata zone classification (B1)—assigning specific metadata classes to metadata zones, described in detail in Sect. 4.2.
-
2.
Metadata extraction (B2)—extracting atomic information from labelled zones.
During the last step (B2), a set of simple heuristic-based rules is used to perform the following operations:
-
zones labelled as abstract are concatenated,
-
as type is often specified just above the title, it is removed from the title zone if needed (based on a dictionary of types),
-
authors, affiliations and keywords lists are split with the use of a list of separators,
-
affiliations are associated with authors based on indexes and distances,
-
email addresses are extracted from correspondence and affiliation zones using regular expressions,
-
email addresses are associated with authors based on author names,
-
pages ranges placed directly in bib_info zones are parsed using regular expressions,
-
if there is no pages range given explicitly in the document, we also try to retrieve it from the pages numbers on each page,
-
dates are parsed using regular expressions,
-
journal, volume, issue and DOI are extracted from bib_info zones based on regular expressions.
Bibliography extraction
The goal of bibliography extraction is to extract a list of bibliographic references with their metadata (including author, title, source, volume, issue, pages and year) from zones labelled as references.
Bibliography extraction path contains two steps:
-
1.
Reference strings extraction (C1)—dividing the content of references zones into individual reference strings.
-
2.
Reference parsing (C2)—extracting metadata from reference strings.
Extracting reference strings
References zones contain a list of reference strings, each of which can span over one or more text lines. The goal of reference strings extraction is to split the content of those zones into individual reference strings. This step utilizes unsupervised machine learning techniques, which allows to omit time-consuming training set preparation and learning phases, while achieving very good extraction results.
Every bibliographic reference is displayed in the PDF document as a sequence of one or more text lines. Each text line in a reference zone belongs to exactly one reference string, some of them are first lines of their reference, others are inner or last ones. The sequence of all text lines belonging to bibliography section can be represented by the following regular expression:
In order to group text lines into consecutive references, first we determine which lines are first lines of their references. A set of such lines is presented in Fig. 7. To achieve this, we transform all lines to feature vectors and cluster them into two sets (first lines and all the rest). We make use of a simple observation that the first line from all references blocks is also the first line of its reference. Thus, the cluster containing this first line is assumed to contain all first lines. After recognizing all first lines, it is easy to concatenate lines to form consecutive reference strings.
For clustering lines, we use KMeans algorithm with Euclidean distance metric. In this case \(K = 2\), since the line set is clustered into two subsets. As initial centroids, we set the first line’s feature vector and the vector with the largest distance to the first one. We use five features based on line relative length, line indentation, space between the line and the previous one, and the text content of the line (if the line starts with an enumeration pattern, if the previous line ends with a dot).
Reference strings parsing
Reference strings extracted from references zones contain important reference metadata. In this step, metadata is extracted from reference strings and the result is the list of document’s parsed bibliographic references. The information we extract from the strings include: author, title, source, volume, issue, pages and year. An example of a parsed reference is shown in Fig. 8.
First a reference string is tokenized. The tokens are then transformed into vectors of features and classified by a supervised classifier. Finally, the neighbouring tokens with the same label are concatenated, the labels are mapped into final metadata classes and the resulting reference metadata record is formed.
The heart of the implementation is a classifier that assigns labels to reference tokens. For better performance, the classifier uses slightly more detailed labels than the target ones: first_name (author’s first name or initial), surname (author’s surname), title, source (journal or conference name), volume, issue, page_first (the lower bound of pages range), page_last (the upper bound of pages range), year and text (for separators and other tokens without a specific label). The token classifier employs conditional random fields and is built on top of GRMM and MALLET packages [35].
CRF classifiers are a state-of-the-art technique for citation parsing. They achieve very good results for classifying instances that form a sequence, especially when the label of one instance depends on the labels of previous instances.
The basic features are the tokens themselves. We use 42 additional features to describe the tokens:
-
Some of them are based on the presence of a particular character class, e.g. digits or lowercase/uppercase letters.
-
Others check whether the token is a particular character (e.g. a dot, a square bracket, a comma or a dash), or a particular word.
-
Finally, we use features checking if the token is contained by the dictionary built from the dataset, e.g. a dictionary of cities or words commonly appearing in the journal title.
It is worth to notice that the token’s label depends not only on its feature vector, but also on the features of the surrounding tokens. To reflect this in the classifier, the token’s feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.
After token classification, fragments labelled as first_name and surname are joined together based on their order to form consecutive author names, and similarly fragments labelled as page_first and page_last are joined together to form pages range. Additionally, in the case of title or source labels, the neighbouring tokens with the same label are concatenated.
The result of bibliography extraction is a list of document’s bibliographic references in a structured form, each of which contains the raw text as well as additional metadata.