1 Introduction

Historical documents are precious cultural heritage and have important scientific and cultural values. The digitization of ancient manuscripts is an important way to address literature protection and cultural heritage. However, it takes time and effort to manually process these large volumes of documents, and is error-prone. Therefore, it is necessary to use computers to process historical manuscripts automatically. The document analysis and recognition (DAR) system has emerged at this purpose. It consists of image enhancement, segmentation, page layout analysis, optical character recognition (OCR), and indexing [1]. Document image binarization (also referred to as document image segmentation) is an important preprocessing step. It aims to segment the input document image into text (foreground) and non-text (background). The segmentation performance will directly affect subsequent tasks in the DAR system.

The thresholding of high-quality images is very simple, but the binarization of historical document images is quite challenging because historical document images are subject to severe degradation, such as ink bleed through, page stains, text stroke fading, and artifacts. In addition, changes of text stroke color, width, brightness, and connectivity in degraded handwritten manuscripts further increase the difficulty of binarization. Figure 1 presents several degraded historical document image samples in recent DIBCO (document image binarization competition) and H-DIBCO (handwritten document image binarization competition) benchmark datasets.

Fig. 1
figure 1

Historical document image samples in recent DIBCO and H-DIBCO benchmark datasets

The DIBCO and H-DIBCO series (DIBCO 2009 [2], 2011 [3], 2013 [4], and 2017 [5] and H-DIBCO 2010 [6], 2012 [7], 2014 [8], 2016 [9], and 2018 [10]) show the latest progress in document image binarization. We have participated in such academic competitions since 2017. Our energy-based segmentation method achieved the best performance in ICFHR 2018 competition on handwritten document image binarization [10], and the 2nd best performance in Challenge A of ICFHR 2018 competition on document image analysis tasks for Southeast Asian palm leaf manuscripts [11]. Later, our newly developed binarization method based on D-LinkNet [12] achieved the best performance in ICDAR 2019 time–quality binarization competition on photographed document images captured by Motorola Z1 and Galaxy Note4 with flash off, and the 2nd and 3rd best performances on binarization of photographed document images captured by the same mobile devices with flash on, respectively [13].

This paper presents our winning algorithm in ICFHR 2018 competition on handwritten document image binarization (H-DIBCO 2018). The proposed method is based on background estimation and energy minimization. As far as we know, the estimation and compensation procedure can eliminate most document degradation, and help extract text objects from complex document background in the following energy-based segmentation stage.

Our contributions are two folds. First, we present a document image binarization method that can achieve promising pixel-wise labeling results on various degraded historical document images. Second, we investigate a voting strategy to automatically determine the correct directions for stroke width transform (SWT). The SWT direction determination has so far received little attention, but if done well, it offers many advantages. This method is simple and robust since users do not need to predefine document types, e.g., dark text on bright background or vice versa.

The rest of the paper is organized as follows. Section 2 reviews the related work on document image binarization. Section 3 introduces our proposed technique in detail. Section 4 presents the experimental results and discussion, and Section 5 concludes the paper.

2 Related work

Varieties of document image binarization or segmentation methods have been proposed over the past few decades. They can be roughly divided into global thresholding, local thresholding, and hybrid methods [1, 14].

A global thresholding approach, e.g., Otsu’s method [15] computes an optimal threshold for the entire image to maximize the inter-class variance or equivalently minimize the intra-class variance between text and non-text pixels. Global thresholding can provide satisfactory results when the image quality is good enough, that is, the image histogram follows a bimodal distribution, but it will generally fail when dealing with low-quality images.

A local thresholding approach adapts the threshold value of each pixel to its neighborhood image features; for instance, local mean and local standard deviation are required for Niblack’s [16], Sauvola’s [17], and Wolf’s [18] methods. In general, locally adaptive thresholding methods have better performance than global thresholding counterparts. However, the main disadvantages of these local methods are that the thresholding performance depends heavily on the sliding window size and hence on the text stroke width.

Ntirogiannis et al. [19] propose a hybrid method. First, Niblack’s method is used for document background estimation via image inpainting, and then image normalization is adopted to compensate background variations. Otsu’s method is then applied on the compensated image to remove background noise. Meanwhile, Niblack’s method is also performed on the normalized image to detect faint characters and estimate the text stroke width. The two binarization results are finally combined at connected component level.

Su et al. [20] also present a combined framework, which integrates several existing document binarization methods to achieve better segmentation. This method divides image pixels into three groups, i.e., foreground, background, and uncertain pixels. Based on preselected foreground and background pixels, a classifier is then applied to iteratively classify the uncertain pixels as foreground or background.

In the rest of this section, we classify other document image binarization methods into following categories:

2.1 Contrast or edge-based segmentation

Early studies of document image binarization are often based on edge detection. Lu et al. [21] present a segmentation technique using background estimation and stroke edges (BESE). This method first uses two one-dimensional polynomial smoothing procedures to estimate the document background, and then detects text stroke edges from the compensated document image based on the L1-norm image gradient. Finally, text strokes can be extracted based on the detected stroke edge pixels. Lu and Tan [22] also studied a similar technique for document background estimation via two-dimensional polynomial smoothing.

Su et al. [23] propose a binarization technique for historical document images. It first uses local maximum and minimum (LMM) to construct a contrast image, and then high-contrast pixels are extracted by using Otsu’s method. Therefore, the document text pixels can be further segmented based on the detected high-contrast image pixels, which are located near the text strokes. Later, Su et al. [24] present a degraded document image binarization method based on adaptive image contrast, which is a combination of LMM and local gradient. The adaptive contrast image is first binarized by Otsu’s method. The resulting binary contrast map is then combined with the Canny edge map to produce text stroke edges. Finally, the text pixels can be extracted based on the detected stroke edge pixels.

Jia et al. [25] present a document image binarization method based on structural symmetric pixels (SSPs), which are located along text stroke edges, and can be extracted from those with large gradient values and opposite gradient directions. Finally, a voting framework based on multiple local thresholds is adopted to further determine whether each pixel is text or non-text.

2.2 Energy-based segmentation

Markov random fields (MRFs) [26] and conditional random fields (CRFs) [27] are widely used in degraded document image binarization. Howe [28, 29] presents an energy-based segmentation that uses graph cut optimization [30] to solve the energy minimization problem of an objective function which combines the Laplacian operator and Canny edge detector. A fast algorithm for Howe’s binarization method based on heterogeneous computing platform is implemented by Westphal et al. [31].

Mesquita et al. [32] present a document image binarization algorithm based on the perception of objects by distance (POD). The k-means clustering, and Otsu’s thresholding methods are combined in the classification process. Later, Mesquita et al. [33] adopt the POD (with parameters tuned by I/F-Race) as a preprocessing stage of Howe’s binarization method.

Kligler et al. [34] propose a document enhancement technique based on visibility detection. The main idea of the algorithm is to convert an image to a 3D point cloud. The classification of visible and invisible points provides guidance for background and foreground segmentation.

Another approach based on energy generalization, active contour model or snakes, is also used for document image binarization. Rivest-Hénault et al. [35] propose a local linear level set framework, and Hadjadj et al. [36] present a technique based on active contour evolution. The snakes model generally has a high time complexity and a tendency to fall into the nearest local minimum.

The method proposed in this paper belongs to this category. Like Mesquita’s and Kligler’s approaches, we integrate Howe’s energy-based segmentation technique into our framework, but with several subtle and important differences described in Subsection 3.3. Document background estimation and compensation are performed in the preprocessing stage, while de-noising and text stroke preservation are performed in the post-processing stage.

2.3 Statistical learning-based segmentation

Chen et al. [37] propose a parallel non-parametric binarization framework for degraded document images. This method first uses Sauvola’s method with different parameters to generate many binary images. A support vector machine (SVM) is then used to recognize each pixel of those binarized images. Finally, the resulting binary image is reconstructed based on the recognition result.

After conducting local contrast enhancement, Xiong et al. [38] divide the document image into 5 × 5 blocks, and then adopt a SVM classifier to choose an optimal global threshold for each block. The document image is further segmented by Wolf’s method to eliminate noise near text stroke edges.

Bhowmik et al. [39] present a game theory inspired binarization (GiB) technique for degraded document images. It first extracts pixel-level features using neighbor’s collision, and then classifies each pixel as either text or non-text using the k-means clustering method.

In general, the main drawback of statistical learning-based methods is that only handcrafted features are used to obtain segmentation results. Therefore, it is difficult to design representative features for different applications, and manually designed features work well for one type of images, but often fail on another.

2.4 Deep learning-based segmentation

Pastor-Pellicer et al. [40] explore the use of convolutional neural networks (CNNs) for document image binarization. It includes several convolutional and subsampling layers followed by multilayer perceptron (MLP) layers, and then classifies each pixel as foreground or background from a sliding window.

Tensmeyer and Martinez [41] present a multi-scale fully convolutional network (FCN) that combines F-measure and pseudo F-measure losses for document image binarization tasks. The raw gray scale, Howe’s binarization, and relative darkness (RD) features are concatenated and fed into the networks.

Vo et al. [42] propose a supervised binarization for historical document images based on hierarchical deep supervised networks (DSNs). By extracting both low-level and high-level features, the networks can differentiate text pixels from background noise, and thus can deal with severe degradations occurring in document images.

Calvo-Zaragoza and Gallego [43] choose the residual encoder-decoder network (RED-Net) [44] as the backbone of their selectional auto-encoder (SAE) architecture for document image binarization. The encoder contains 5 convolution layers, while the decoder contains 5 transposed convolution layers, each with a stride value of 2. The RED-Net has an input image patch size of 256 × 256, the number of filters in the first layer is 64, and the kernel size of all layers is 7 × 7.

Bezmaternykh et al. [45] present a historical document image binarization method based on U-Net [46], originally designed for biomedical image segmentation. The U-Net architecture is derived from FCNs, but its architecture has been modified and extended to use fewer training images and produce more accurate segmentation.

Zhao et al. [47] formulate binarization as an image-to-image generation task, using conditional generative adversarial networks (cGANs) to solve multi-scale information combination problems in binarization tasks.

It is worth mentioning that deep learning is a subset of machine learning, which combines feature learning and metric learning in a deep network model. Although the purpose of metric learning is to reduce the distance between similar sample features while increasing the distance between different sample features, the intrinsic difference between deep learning-based methods and other non-deep learning-based approaches is that the former can be trained to automatically extract both fine-grained, shallow, low-level visual features and coarse-grained, deep, high-level semantic features, while the latter can only use handcrafted or manually designed features to obtain segmentation results, and no training is required.

3 Motivation and proposed method

Figure 2 depicts our proposed binarization framework for degraded historical document images based on background estimation and energy minimization. It consists of three main steps. First, we apply morphological operations on gray scale images to estimate and compensate document background. It utilizes a disk-shaped structuring element, whose radius is estimated by a minimum entropy-based stroke width transform (SWT). Second, we perform the Laplacian energy-based segmentation on the compensated document images. Finally, we implement post-processing to preserve text stroke connectivity and eliminate isolated noise.

Fig. 2
figure 2

Flow chart diagram of the proposed binarization framework

The motivations behind this method are as follows: First, historical document images generally contain severe degradation, such as page stains, ink bleed through, text stroke fading, and artifacts, which is not conducive to the correct extraction of text pixels from the images. The document background estimation and compensation technique can effectively eliminate the impacts of these degradation factors. Second, inspired by the image information entropy, the minimum entropy-based SWT is able to automatically detect the document image type, for instance, dark text on bright background or bright text on dark background. Third, the graph cut is a group of MRF algorithms that uses max-flow and min-cut algorithms to solve discrete energy minimization problems and has been widely used in many image analysis tasks, such as image restoration and reconstruction, edge detection, texture segmentation, optical flow, and stereo vision. Last but not least, we combine the use of preprocessing and post-processing, which has proven to be the gold standard for document image binarization.

3.1 Stroke width transform with minimum entropy

Text stroke width is a crucial attribute that can distinguish true text from possible degradation. Most previous approaches perform a locally adaptive thresholding (e.g., the Sauvola’s or Niblack’s method) on a given document image, and then estimate text stroke width by using the resulting binary image. We take a different approach and use Canny edge detector to generate text edge maps by extracting main edge features of the input image while minimizing irrelevant details such as various types of degradation described above.

We first apply a Gaussian filter with a sigma value σ = 1 on the gray scale document image, and then compute the magnitude \( {\left({g}_x^2+{g}_y^2\right)}^{\frac{1}{2}} \) and direction \( {\tan}^{-1}\left(\frac{g_y}{g_x}\right) \) of the local gradient at each pixel. An edge pixel is defined as the local maxima of the image gradient and determined by non-maximum suppression along the direction of the image gradient. The algorithm chooses to use a hysteresis thresholding with two thresholds (thigh and tlow) to preserve those true edges. Based on observations that true text edges often have higher contrast than possible degradation, so the following parameter settings can be used as reasonable default values in our implementation: thigh = 0.4 and tlow = 0.

Figure 3 displays text edge maps of the sample document images in Figs. 1e and d, which are constructed by using Canny edge detector with and without our parameter settings, respectively. It can be seen from the figures that the text edge maps, obtained by using the default syntax for “canny” options in Matlab and without parameter tuning, extract a large number of non-text edges. But the optimized Canny edge detection with our specified parameter settings produces a cleaner text edge map.

Fig. 3
figure 3

Text edge maps constructed using Canny edge detector with and without our parameter settings

After Canny edge detection, we can estimate the text stroke width from the detected text edge pixels and the directed edge gradients. It has been observed that (1) the text stroke width or its mathematical expectation remains nearly constant throughout individual characters, and (2) the gradient direction of each edge pixel is approximately perpendicular to the direction of the text stroke. Therefore, the text stroke width can be estimated along the gradient or counter-gradient directions.

The proposed technique adopts the similar idea of stroke width transform presented in [48], but deviates from the original in several ways. If text pixels are darker than background pixels as illustrated in Fig. 4a, the edge gradients will point to the exterior of strokes as shown in Fig. 4b; therefore, the search path is opposite to the gradient direction. However, if text pixels are brighter than background pixels, the edge gradients will point to the interior of strokes, and then the search path will follow the gradient direction. In order to detect either dark text on bright background or vice versa, the algorithm can be executed twice in parallel on the same image to achieve that effect, but the original paper does not inform how to implement.

Fig. 4
figure 4

Implementation of stroke width transform (SWT) and SWT direction determination based on minimum entropy

Figure 4c and d illustrate the resulting stroke width transform along the gradient direction, referred to as SWTgrad, and along counter-gradient direction, referred to as SWTcnt ‐ grad, respectively. Each color in the SWTgrad or SWTcnt ‐ grad image corresponds to a specific stroke width (black corresponds to background), so pixels with the same stroke width are represented with the same color. We can see that SWTcnt ‐ grad in this example is more compact and contains less colors than SWTgrad. This is reasonable since SWTcnt ‐ grad corresponds to the true text regions with a uniform stroke width distribution, but SWTgrad corresponds to the non-text regions with randomly distributed “strokes”.

Inspired by the above observations, we propose a minimum entropy-based technique to help determine the correct SWT direction. Specifically, the entropy S is defined as a logarithmic measure of the number of connected components with significant probability of being occupied:

$$ S=-{s}_w\sum \limits_i{p}_i\log {p}_i $$
(1)

where pi is the inverse of the number of connected components, and sw equals to the mathematical expectation of stroke widths in the corresponding SWT image. The summation is over all the connected components of the SWT image. We modify the conventional connected component labeling algorithm [49] by converting the association rule from a binary mask to a predication that compares the SWT values of neighboring pixels. If two pixels are adjacent and have similar stroke width values, they are in the same connected component, and it can be empirically verified that the two neighboring pixels belong to the same component if the SWT ratio does not exceed 3. This local rule ensures that strokes with a smooth width change will be grouped into the same component, and therefore be robust to various text sizes, fonts, and orientations.

As mentioned before, we perform the SWT algorithm twice in parallel, once along the gradient direction and once along the counter-gradient direction, and then two entropies Sgrad and Scnt ‐ grad are computed. We vote to determine the correct SWT direction in the following fashion:

$$ {\mathrm{SWT}}_{\mathrm{correct}\hbox{-} \mathrm{direction}}=\arg \min \left\{{S}_{\mathrm{grad}},{S}_{\mathrm{cnt}\hbox{-} \operatorname{grad}}\right\} $$
(2)

Consequently, the minimum entropy corresponds to the correct SWT direction, and in our implementation, the text stroke width is computed as the average of the corresponding non-zero stroke widths.

3.2 Document background estimation and compensation

Mathematical morphology has been used for background estimation. Yang et al. [50] compare various segmentation and background estimation methods used on cDNA microarray data. They find that morphological opening provides a more reliable estimation of background than other methods. Su et al. [23] adopt gray scale morphological dilation and erosion, which are referred to as maximum and minimum in the original paper, to eliminate the document background. Mesquita et al. [32, 33] make use of morphological closing and image resizing to estimate the document background based on POD, initially proposed by Mello [51].

We implement the mathematical morphology for document background estimation and compensation in a different way. The document background is estimated by using gray scale morphological opening or closing according to previous SWT direction determination. The structuring element is a disk-shaped mask. If text pixels are darker than document background as illustrated in Fig. 5a, the morphological closing is used as illustrated in Fig. 5b. If text pixels are brighter than background pixels, the morphological opening is then used. In this context, the morphological closing operator can suppress dark details that are smaller than the structuring element, while the morphological opening operator can suppress bright details that are smaller than the structuring element. Therefore, the size of the structuring element should be larger than the width of the text stroke, and the radius parameter settings will be discussed in Subsection 4.3.

Fig. 5
figure 5

Document background estimation and compensation

Morphological closing or opening can produce reasonable document background estimation for the entire image, and then we perform a morphological bottom-hat or top-hat transform to emphasize the text area and attenuate the document background. The bottom-hat transform, as illustrated in Fig. 5c, is defined as the difference between the closing and the gray scale images, while the top-hat transform is the difference between the gray scale image and its opening. The two related operations produce exactly the same images, and therefore, the subsequent algorithm will no longer distinguish the two types of documents accordingly.

In a difference image, background pixels are referred to as those whose intensity values are equal to 0. We convert the background pixels into white (assigned the maximum pixel value of 255), as illustrated in Fig. 5d, and then apply the gray scale image complement to the remaining pixels (subtracted from the maximum pixel value), as illustrated in Fig. 5e. Finally, we adjust the image contrast so that 1% of the image data is saturated at low and high intensities. Figure 5f depicts the resulting image of this preprocessing stage applied to the original image of Fig. 5a. We can see that the document background has been compensated and the contrast between foreground and background pixels has also been enhanced.

3.3 Laplacian energy-based segmentation

Once the document background estimation and compensation is completed, we then perform Markov random fields (MRFs) for image segmentation. The MRF models have been widely used to solve both low-level and high-level vision problems, including document image binarization [26, 52, 53].

Howe’s methods [28, 29] regard document image binarization as a max-flow and min-cut graph optimization problem [30]. The unary terms are determined by the Laplace operator, and the pairwise terms are given by the Canny edge detector. The exact optimal solution can be obtained by finding the maximum flow on a special defined graph network, constructed according to the energy function.

Due to the superior performance of Howe’s binarization method [28], we decide to integrate it as part of our framework, but there are several subtle and important differences. Given an h × w gray scale image I, the quadratic pseudo-Boolean energy function can be defined as

$$ {\mathrm{\mathcal{E}}}_I(B)=\sum \limits_{i=0}^h\sum \limits_{j=0}^w\left[{L}_{ij}^0{\overline{B}}_{ij}+{L}_{ij}^1{B}_{ij}\right]+\sum \limits_{i=0}^{h-1}\sum \limits_{j=0}^w{C}_{ij}^{\mathrm{H}}\left({B}_{ij}\ne {B}_{i+1,j}\right)+\sum \limits_{i=0}^h\sum \limits_{j=0}^{w-1}{C}_{ij}^{\mathrm{V}}\left({B}_{ij}\ne {B}_{i,j+1}\right) $$
(3)

where \( {\overline{B}}_{ij}=1-{B}_{ij} \) is the negation of Bij ∈ {0, 1} at (i, j), \( {L}_{ij}^0 \) and \( {L}_{ij}^1 \) denote the costs of assigning label 0 or 1 to each pixel, and \( {C}_{ij}^{\mathrm{H}} \) and \( {C}_{ij}^{\mathrm{V}} \) denote the costs of label mismatch between Bij and its horizontal or vertical neighbors, respectively.

The unary potentials \( {L}_{ij}^0 \) and \( {L}_{ij}^1 \) are given by the Laplace operator ∇2:

$$ {L}_{ij}^0={\nabla}^2{I}_{ij} $$
(4)
$$ {L}_{ij}^1=\left\{\begin{array}{ll}\phi, & \mathrm{background}\ \mathrm{with}\ \mathrm{high}\ \mathrm{confidence}\\ {}-{\nabla}^2{I}_{ij},& \mathrm{otherwise}\end{array}\right. $$
(5)

We set background pixels with high confidence, found in the previous stage, to a negative constant ϕ, which is twice the maximum pixel value in our implementation.

The pairwise potentials \( {C}_{ij}^{\mathrm{H}} \) and \( {C}_{ij}^{\mathrm{V}} \) are given by the Canny edge detector:

$$ {C}_{ij}^{\mathrm{H}}=\left\{\begin{array}{ll}0,& {E}_{ij}\wedge \left({I}_{i-1,j}\ge {I}_{ij}\right)\vee {E}_{ij}\wedge \left({I}_{ij}<{I}_{i+1,j}\right)\\ {}\psi, & \mathrm{otherwise}\end{array}\right. $$
(6)
$$ {C}_{ij}^{\mathrm{V}}=\left\{\begin{array}{ll}0,& {E}_{ij}\wedge \left({I}_{i,j-1}\ge {I}_{ij}\right)\vee {E}_{ij}\wedge \left({I}_{ij}<{I}_{i,j+1}\right)\\ {}\psi, & \mathrm{otherwise}\end{array}\right. $$
(7)

where Eij denotes the Canny edges, and non-edge pixels with label mismatch are set to a positive constant ψ. We have noticed that, among all the parameters, the high threshold thigh and the neighbor mismatch penalty ψ are the most important; therefore, we follow the automatic parameter tuning strategy proposed in [29].

3.4 Post-processing

After obtaining the segmentation based on Laplacian energy minimization, we proceed to the post-processing stage to preserve text stroke connectivity and eliminate isolated noise. Our post-processing algorithm is described in detail below.

3.4.1 Step 1

We perform foreground connected component analysis (CCA) to eliminate isolated noise from document background. The CCA operator scans the binary image and examines each foreground connected component. When it comes to an unlabeled foreground pixel p, we use the flood fill algorithm [54] to label all other pixels in the connected component that contains p. After completing the scan, we count the number of pixels in each connected component and generate a binary image B1 with an area greater than tnoise, where tnoise is the area threshold for isolated noise.

3.4.2 Step 2

We perform background CCA to fill small holes in text strokes, which may preserve text stroke connectivity. Consider using the same CCA framework as described in Step 1, we first apply the binary image complement on B1, and then follow Step 1 to generate a new binary image B2 with an area less than thole, where thole is the area threshold for text stroke holes. The resulting binary image B is

$$ B={B}_1\vee {B}_2 $$
(8)

4 Experimental results and discussion

We have conducted extensive experiments to evaluate the performance of our proposed framework. In this section, we first determined the size of disk-shaped morphological structuring elements, and then quantitatively compared our binarization method with other state-of-the-art segmentation techniques on recent DIBCO and H-DIBCO benchmark datasets.

4.1 Datasets

This study uses nine document image binarization competition datasets from 2009 to 2018, such as DIBCO 2009 [2], 2011 [3], 2013 [4], and 2017 [5] and H-DIBCO 2010 [6], 2012 [7], 2014 [8], 2016 [9], and 2018 [10] benchmark datasets, covering 31 machine-printed and 85 handwritten document images as well as their corresponding ground truth (GT) images. The historical documents in these datasets are originated from the recognition and enrichment of archival documents (READ) project, which contains a variety of collections from the 15th to 19th century. The datasets contain representative document degradation, such as ink bleed through, page smudges, text stroke fading, background texture, and artifacts.

4.2 Evaluation metrics

We adopt evaluation measures used in recent international document image binarization competitions, including FM (F-measure), pFM (pseudo F-measure), PSNR (peak signal-to-noise ratio), NRM (negative rate metric), DRD (distance reciprocal distortion), and MPM (misclassification penalty metric). The first two metrics, namely FM and pFM, reach their best values at 1 and the worst at 0. The PSNR measures how close a binary image to the GT image, so the higher the PSNR value, the better. In contrast to the former three metrics, the binarization performance is better for lower NRM, DRD, and MPM values. Due to space limitations, we omit definitions of those evaluation measures, but readers may refer to [6, 10] for more information.

4.3 Comparison results on the size of morphological structuring elements

This experiment demonstrates how to determine the size of the morphological structuring element, which is an essential part of morphological operations. In the document background estimation and compensation stage, we perform morphological closing or opening operation with disk-shaped structuring element to estimate document background, and then perform morphological bottom-hat or top-hat transform to emphasize text area and attenuate document background.

As mentioned in Subsection 3.2, the size of the disk-shaped structuring element should be no smaller than the width of text strokes. Therefore, we set different radius values to obtain different sizes of disk-shaped structuring elements, and then evaluate the binarization performance of our proposed method on the DIBCO 2009 and H-DIBCO 2010 benchmark datasets.

Figure 6 compares the FM of our proposed technique when the radius increases from 1 to 5 times the estimated text stroke width. As can be seen from the figure, the FM becomes stable when the radius value is 2 times larger than the estimated text stroke width on the two datasets. In our implementation, we therefore set the radius value to about 3.5 times the estimated text stroke width, as it gives the best score for the F-measure on both datasets.

Fig. 6
figure 6

F-measure vs. radius parameter settings over DIBCO 2009 and H-DIBCO 2010 datasets (SWE denotes the stroke width estimation)

4.4 Comparison results on the DIBCO and H-DIBCO benchmark datasets

In the first experiment, we quantitatively compared the proposed method with those that achieved TOP 3 performance in the annual document image binarization competition during 2009–2018. The evaluation results are provided in Table 1, and those for the TOP 3 winners of the year are copied from the corresponding competition reports [2,3,4,5,6,7,8,9,10], respectively. Readers may also refer to the same competition reports for a brief description of each winning method involved in this experiment. From the data in Table 1, we can see that our proposed method achieves the best performance in almost all the evaluation measures, except for DIBCO 2017, in which the TOP 3 winners are all based on deep learning architectures.

Table 1 Performance evaluation results of our proposed method against the TOP 3 winners of the annual DIBCO or H-DIBCO competitions (best results highlighted in bold)

It is worth noting that our proposed method is based on graph cut, which is an efficient and powerful graph-based segmentation technique before the deep learning era. It consists of two main parts, namely the data part, which measures the consistency of image data within the segmented region that includes the features of the image, and the regularization part, which smooths the boundaries of the segmented region by maintaining the spatial information of the image. The graph cut is considered as an energy minimization process of the constructed graph when segmenting the image using a set of seeds (e.g., text stroke edges and document background), and no training is required. However, deep learning-based network models follow hierarchy architecture. Images or patches are fed into the network, and then features are extracted by different layers. The shallow layer extracts fine-grained low-level visual features, which are minor details of the input, such as edges and blobs, while the deep layer extracts coarse-grained high-level semantic features, which are more abstract and built on top of low-level features to detect or recognize objects. Although the FM, pFM, PSNR, and DRD metrics of the proposed method are slightly worse than or comparable to those of the TOP 3 winners in the DIBCO 2017 competition, it still illustrates that our proposed method can better segment text pixels and preserve text strokes.

In the second experiment, we have also quantitatively compared our proposed method with Otsu’s global thresholding method [15], locally adaptive thresholding (e.g., Niblack’s [16], Sauvola’s [17], and Wolf’s [18]) methods, contrast or edge-based segmentation (e.g., Lu’s BESE [21], Su’s LMM [23, 24], and Jia’s SSP [25]) methods, energy-based segmentation (e.g., Howe’s [28, 29], Mesquita’s [33], and Kligler’s [34]) methods, Bhowmik’s game theory-inspired binarization [39], as well as deep learning-based segmentation (e.g., Tensmeyer’s FCN [41], Vo’s DSN [42], Gallego’s SAE [43], and Zhao’s cGAN [47]) methods for all the nine DIBCO and H-DIBCO testing datasets. The running codes for the methods involved in this comparison are provided by the original author(s), and the evaluation results are listed in Table 2. The first, second, and third best results for each evaluation measure are bolded in red, green, and blue, respectively. As can be seen from the table, our proposed method outperforms all other non-deep learning-based approaches, and is even comparable to several state-of-the-art deep learning-based techniques. This also implies that the proposed method is robust to various types and levels of document degradation, and can preserve text strokes better.

Table 2 Performance evaluation results of our proposed method against the state-of-the-art techniques on the nine DIBCO and H-DIBCO testing datasets (the first, second, and third best results in bold red, green, and blue font, respectively)

Figures 7 and 8 display two sample images (P15 in DIBCO 2017 and H06 in H-DIBCO 2018 datasets) and the resulting binary images generated by selected comparison methods. As you can see from the figures, Otsu’s [15] global thresholding and Niblack’s [16] local thresholding methods generally fail to produce reasonable results. Sauvola’s [17] and Wolf’s [18] locally adaptive thresholding methods tend to remove too many text strokes. Lu’s BESE [21], Su’s LMM [23, 24], as well as Howe’s [28, 29], Mesquita’s [33], and Kligler’s [34] energy-based methods fail to extract low-contrast text strokes. Compared with Jia’s SSP [25], Bhowmik’s GiB [39], and other state-of-the-art CNN-based segmentation methods (e.g., Tensmeyer’s FCN [41], Vo’s DSN [42], Gallego’s SAE [43], and Zhao’s cGAN [47]), our proposed method can better preserve text strokes and produce better visual quality.

Fig. 7
figure 7

Binarization results of selected methods for P15 in DIBCO 2017

Fig. 8
figure 8

Binarization results of selected methods for H06 in H-DIBCO 2018

4.5 Comparison results on the time complexity of each binarization method

Since the proposed method mainly consists of several stages, namely preprocessing (including image reading, standard color-to-gray conversion, and image normalization), stroke width transform, background estimation and compensation, graph cut-based segmentation, and post-processing, among which stroke width transform and graph cut-based segmentation are the two most time-consuming stages, we analyze the computational complexity of these two stages theoretically.

To estimate text stroke width accurately, we apply the stroke width transform operator presented by Epshtein et al. [48]. It is a text region detection algorithm that extracts text strokes from noisy images, which are finite width shapes and consist of two roughly parallel edges. Starting from a text edge pixel and by exploring pixels perpendicular to the direction of text edges, we may locate another text edge pixel, whose gradient direction is approximately opposite to that of the previous one, and then a stroke cross-section is formed from these two edge pairs. By joining the stroke cross-sections of similar widths, a complete text stroke is thus produced. Therefore, the theoretical worst case complexity of the stroke width transform algorithm is \( \mathcal{O}\left(n\left|P\right|\right) \), where n is the number of pixels in the image, and |P| is the length of stroke cross-sections.

For graph cut-based segmentation, we adopt the max-flow and min-cut algorithm presented by Boykov and Kolmogorov [30] for energy minimization. It belongs to the group of algorithms based on augmenting paths. To detect augmenting paths, two non-overlapping search trees rooted at the source and the sink are built; and to detect a new augmenting path (not necessarily the shortest one), these trees are no longer rebuilt from scratch. As a result, the theoretical worst case complexity of the Boykov-Kolmogorov algorithm is \( \mathcal{O}\left({mn}^2\left|C\right|\right) \), where |C| is the cost of the minimum cut, n and m are the number of nodes and edges in the graph, respectively.

To give the reader a clearer picture about the execution efficiency of each method, we adopt the average running time in second per megapixel (sec/MP) to evaluate the time complexity of each binarization algorithm. All experiments are conducted on my Dell Alienware 17 R5 laptop. The system hardware configurations are Intel(R) Core(TM) i7-8750H CPU @ 2.20 GHz with 16 GB RAM and NVIDIA GeForce GTX 1080 with 8 GB GDDR5X Video RAM.

In terms of the programming language used by each method, Otsu’s [15] global thresholding and Niblack’s [16], Sauvola’s [17], and Wolf’s [18] locally adaptive thresholding methods are reproduced in Matlab scripts. Lu’s BESE [21] is in Matlab pcode format, while Su’s LMM [23, 24] and Jia’s SSP [25] methods are written in C++ with OpenCV. Howe’s [28, 29], Mesquita’s [33], and Kligler’s [34] energy-based segmentation (including our proposed method) are implemented in Matlab scripts. Bhowmik’s GiB [39] also uses Matlab, but is compiled into an executable. The deep learning methods are all Python-based. However, the deep learning framework used by Tensmeyer’s FCN [41] and Vo’s DSN [42] is Caffe. Gallego’s SAE [43] adopts TensorFlow and Zhao’s cGAN [47] uses PyTorch. Deep learning-based methods run on the GPU, while non-deep learning-based counterparts run on the CPU. Therefore, we can only roughly evaluate the average running time of each binarization algorithm, as shown in Fig. 9.

Fig. 9
figure 9

Average running time (sec/MP) of each binarization algorithm

It can be seen from the figure that binarization methods based on simple statistical features, such as Otsu’s, Niblack’s, Sauvola’s, and Wolf’s, are relatively less computationally intensive and faster to process, but the binarization performance is poor. The processing speed of our proposed method is comparable to that of most other contrast/edge-based or energy-based segmentation algorithms, and is significantly faster than that of Bhowmik’s game theory-inspired binarization.

4.6 Discussion

The superior performance of the proposed method can be explained by the following factors:

First, our proposed method estimates the text stroke width feature based on stroke width transform with minimum entropy. It can detect either bright text on dark background or vice versa, and can distinguish true text in various sizes, fonts, and orientations from possible degradation.

Second, mathematical morphology is used to compensate the document background and then the Laplacian energy-based segmentation is performed on the compensated document images. The estimation and compensation procedure can remove most of the document degradation and help to extract text objects from complex document background in the subsequent energy-based segmentation stage.

Last but not least, the proposed method employs post-processing operations to eliminate possible noise and preserve text stroke connectivity by removing isolated text pixels and filling breaks, gaps, or holes inside text strokes.

Of course, we also need to emphasize that there is no single binarization algorithm that works for all types of historical document images. The method proposed in this paper is no exception, and it also has some limitations:

(1) Manual extraction of text stroke features using traditional feature engineering is somewhat inadequate and subject to bias, especially when dealing with extremely degraded or badly damaged pages of historical antiquities. Therefore, deep learning-based approaches are not only a good alternative, but also the current trend.

(2) The memory usage of graph cuts will increase rapidly as the image size increases; for instance, the well-known Boykov-Kolmogorov’s max-flow and min-cut algorithm [30] that we used for energy minimization allocates 24n + 14m bytes, where n and m are the number of nodes and edges in the graph, respectively.

5 Conclusion

We propose an enhanced historical document image binarization method based on background estimation and energy minimization. It first adopts mathematical morphology algorithms to compensate document background. The size of the disk-shaped structuring element is determined by the stroke width transform with minimum entropy. We then perform Laplacian energy-based segmentation on the compensated document images. Finally, we implement post-processing to preserve text stroke connectivity and eliminate isolated noise. The proposed method is robust to various types and levels of document degradation and leads to high accuracy. Experimental results show that the overall performance of our proposed method is far superior to other state-of-the-art segmentation techniques.

For future work, we intend to conduct further research in the following aspects. First, we can improve the contrast between text and background by using machine learning or deep learning techniques to effectively achieve degraded document image enhancement in the preprocessing stage, and then take fully connected CRFs [65] or convolutional CRFs [66] for further segmentation. Second, to address the problems such as poor robustness of handcrafted or manually designed text features, we hope to perform multi-scale and adaptive feature extraction and learning through deep network models, so as to improve the discriminative property of text regions.