Keywords

1 Introduction

Document image binarization is an active area of research in computer vision due to high demands related to the robustness of the thresholding algorithms. As it is one of the most relevant steps in document recognition applications, considering both machine printed and handwritten text documents, many algorithms have been proposed for this purpose. Many of them were presented at Document Image Binarization COmpetitions (DIBCO) held during International Conferences on Document Analysis and Recognition (ICDAR) and H-DIBCO during International Conferences on Frontiers in Handwriting Recognition (ICFHR). Due to the presence of challenging image distortions, DIBCO datasets [20], used for performance evaluation of the submitted algorithms, became the most popular ones for the verification of newly proposed binarization methods.

The motivation of research related with document image binarization and recognition is not only the possibility of preserving the cultural heritage and discovering some historical facts, e.g. by the recognition of ancient manuscripts, but also potential applications of the developed algorithms in some other areas of industry. Considering the rapid development of Industry 4.0 solutions, similar algorithms may be useful in self-localization and navigation of mobile robots based on machine vision as well as modern autonomous vehicles. Capturing the video data by cameras the presence of some similar distortions may be expected both for natural images and degraded document images. Nevertheless, document image datasets containing also ground truth binary images are still the best tool for verification purposes and therefore the method proposed in the paper is considered using the images from DIBCO datasets.

During the last several years many various approaches to image thresholding have been proposed, outperforming the classical Otsu method [18], including adaptive methods proposed by Niblack [12], Sauvola [22], Feng [3], Wolf [27], or Bradley [1] and their modifications [23], being the most useful for document image binarization purposes. Nonetheless, one of the main issues of such adaptive methods is the necessity of analysis of the neighbourhood of each pixel, increasing the computational effort. Recently, some applications of local features with the use of Gaussian mixtures [11], as well as the use of deep neural networks [24] have been proposed as well. However, to obtain satisfactory results, most of such approaches require multiple processing stages with background removal, median filtering, morphological processing or the time-consuming training process.

Nevertheless, the motivation of the paper is not the direct comparison of the proposed approach with the state-of-the-art methods, especially based on recent advances of deep learning, but the increase of the performance of some known methods due to the application of the proposed approach to image preprocessing.

2 The Basics of the Proposed Approach

2.1 Identification and Definition of the Problem

Handwritten and machine printed documents usually are subject to slow destruction over time influencing their readability. Some characteristic examples of this process are ancient books and old prints, however digital restoration methods allow for reading of even heavily damaged documents. Assuming that the original text was distorted by its summing with a noisy image of a normal distribution, analysing the histograms, it can be noticed that the original information was hidden (blurred) and the histogram of the resulting image is a distorted version of a histogram of a “purely” noisy image.

This similarity is preserved also for the real scanned images of historical documents. Therefore, it can be assumed that potential removing of partial information related with noise should improve the quality of the image being the input for further processing. An illustration of this phenomenon is shown in Fig. 1.

Fig. 1.
figure 1

Illustration of the similarity of histograms for noisy text image and scanned real documents: (a) ground truth binary image, (b) noisy image, (c) text image combined with Gaussian noise, (d) real scanned document images, and their histograms (e)–(h), respectively.

2.2 The Basic Idea of Text Image Reconstruction

Assuming that the real text image is approximately the combination of ground truth (GT) binary image with Gaussian noise, being the most widely present one in nature, the readability of text may be initially improved by normalization of pixel intensity levels according to classical formula

$$\begin{aligned} q(x,y) = \left| \frac{(p(x,y)-I_{min})\cdot 255}{I_{max}-I_{min}}\right| , \end{aligned}$$
(1)

where:

  • \( 0 \le I_{min} < I_{max} \le 255 \),

  • \( I_{max} \) is the maximum pixel intensity level,

  • \( I_{min} \) is the minimum pixel intensity level,

  • p(xy) is the input pixel intensity level at (xy) coordinates,

  • q(xy) is the output pixel intensity level at (xy) coordinates.

Nevertheless, in most typical applications the values of \(I_{min}\) and \(I_{max}\) are the minimum and maximum intensity values from all image pixels, so this normalization may lead only to the increase of image contrast. Assuming the presence of a dark text on brighter background, one may remove safely the detailed data related to brighter pixels not influencing the text information. Therefore, we proposed to set the \(I_{max} = \mu _{GGD}\) and \(I_{min} = 0\) where the \(\mu _{GGD}\) is the location parameter of the Generalized Gaussian Distribution which is used for the approximation of the image histogram. Such operation causes the removal of partial information related to the presence of distortions and is followed by the thresholding which may be conducted using one of the typical methods, e.g. classical global Otsu binarization. The illustration of the consecutive steps is presented in Fig. 2.

Fig. 2.
figure 2

Illustration of the consecutive steps of the document image processing and obtained GGD distributions - from top: histogram and its GGD approximation for the real document image (a) and GT combined with Gaussian noise (b), results of proposed normalization with the use of \(\mu _{GGD}\) (c) and (d), respectively, and the results of further Otsu thresholding (e) and (f), respectively.

Fig. 3.
figure 3

Density function of GGD with \(\lambda =1\) for three selected exponents \(p=1/2\), \(p=1\) and \(p=4\).

3 Generalized Gaussian Distribution

Generalized Gaussian Distribution (GGD) is very popular tool in many research areas related to signal and image processing. Its popularity comes from the coverage of other widely known distributions: Gaussian distribution, Laplacian distribution, a uniform one and an impulse function. Other special cases were also considered in literature [5, 6]. Many different methods were designed to estimate the parameters of this distribution [28].

This distribution was also extended to cover the complex variable [13] and multidimensional [19]. GGD was used to design many different models, for instance, to model the tangential wavelet coefficients for compressing three-dimensional triangular mesh data [8], the image segmentation algorithm [25], to generate an augmented quaternion random variable with GGD [7], the natural scene statistics (NSS) model to describe certain regular statistical properties of natural images [29], to approximate an atmosphere point spread function (APSF) kernel [26].

The probability density function of GGD is defined by the equation [2]

$$\begin{aligned} f(x)=\frac{\lambda \cdot p}{2 \cdot \varGamma \left( \frac{1}{p}\right) }e^{-[\lambda \cdot |x|]^{p}} , \end{aligned}$$
(2)

where p is the shape parameter, \(\varGamma (z)=\int _{0}^{\infty }t^{z-1}e^{-t}dt, z>0\) [17] and \(\lambda \) is connected to the standard deviation \(\sigma \) of the distribution by the equation \(\lambda (p,\sigma )=\frac{1}{\sigma }\left[ \frac{\varGamma (\frac{3}{p})}{\varGamma (\frac{1}{p})}\right] ^{\frac{1}{2}}\). The parameter \(p=1\) corresponds to Laplacian distribution and \(p=2\) corresponds to Gaussian distribution. When \(p \rightarrow \infty \), the GGD density function becomes a uniform distribution and when \(p \rightarrow 0\), f(x) approaches an impulse function. Some examples are shown in Fig. 3.

4 Application of the Monte Carlo Method

4.1 Idea of the Monte Carlo Method

Since the calculation of the GGD parameters for the histogram obtained for the whole image is relatively slow, a significant reduction of the computational burden may be achieved using the simplified histogram calculated for the limited number of pixels. To preserve the statistical properties of the analysed image the randomly chosen pixel locations should be evenly distributed on the image plane and therefore the random number generator with uniform distribution should be applied in the Monte Carlo procedure [15].

The general idea of the statistical Monte Carlo method is based on the random drawing procedure applied for the reshaped one-dimensional vector consisting of all \(M \times N\) pixels from the analysed image. Then, n independent numbers, equivalent to positions in the vector, are generated by a pseudo-random generator of uniform distribution with possibly good statistical properties. Next, the total number of randomly chosen pixels (k) for each luminance level is determined used as an estimate of the simplified histogram, according to:

$$\begin{aligned} \hat{L}_{MC} = \frac{k}{n} \cdot M \cdot N , \end{aligned}$$
(3)

where k is the number of drawn pixels for the specified luminance level in randomly chosen samples, n denotes the total number of draws and \(M \times N\) stands for the total number of samples in the entire image. In general, the estimator \(\hat{L}_{MC}\) may refer to any defined image feature which may be described by binary values 0 and 1.

Fig. 4.
figure 4

Illustration of convergence of the Monte Carlo method used for estimation of the GGD parameters \(p, \mu , \lambda \) and \(\sigma \) using n randomly chosen samples for an exemplary representative image.

The estimation error can be determined as:

$$\begin{aligned} \varepsilon _\alpha = \frac{u_\alpha }{\sqrt{n}} \cdot \sqrt{\frac{K}{M \cdot N} \cdot \left( 1 - \frac{K}{M \cdot N}\right) } , \end{aligned}$$
(4)

assuming that K represents the total number of samples with specified luminance level and \(u_\alpha \) denotes the two-sided critical range.

For such estimated histogram some classical binarization methods may be applied leading to results comparable with those obtained for the analysis of full images [9, 16], also in terms of recognition accuracy.

4.2 Experimental Verification of the Proposed Approach for the Estimation of the GGD Parameters

The influence of the number of randomly drawn pixels on the obtained parameters of the GGD was verified for the images from DIBCO datasets with each drawing repeated 30 times for each assumed n. The minimum, average and maximum values of the four GGD parameters: shape parameter p, location parameter \(\mu \), variance of the distribution \(\lambda \), and standard deviation \(\sigma \) were then determined, according to the method described in the paper [4], without the necessity of using of more sophisticated estimators based on maximum likelihood, moments, entropy matching or global convergence [21]. The illustration of convergence of the parameters for an exemplary representative image from DIBCO datasets using different numbers of drawn samples (n) is shown in Fig. 4.

Fig. 5.
figure 5

Illustration of histogram approximation for an exemplary representative image before (a) and after (b) limitation of the brightness range (full image without normalization, \(x_{min} = 139\), \(x_{max} = 224\)).

Fig. 6.
figure 6

Approximation errors (RMSE) of the GGD parameters \(p, \mu , \lambda \) and \(\sigma \) for various numbers of samples (n) used in the Monte Carlo method for an exemplary representative image.

Nonetheless, it should be noted that for each independent run of the Monte Carlo method the values of the estimated parameters may differ, especially assuming a low number of randomly chosen samples (n). One of the possible solutions of this issue is the use of the predefined numbers obtained from the pseudorandom number generator with a uniform distribution. Therefore an appropriate choice of n is necessary to obtain stable results. Some local histogram peaks may be related with the presence of some larger smears of constant brightness on the image plane (considered as background information). Since the histogram of a natural image should be in fact approximated by a multi-Gaussian model, a limitation of the analysed range of brightness should be made to obtain a better fitting of the GGD model.

Determination of the limited brightness range is conducted as follows:

  • determination of the simplified histogram using the Monte Carlo method for n samples (e.g. \(n=100\)),

  • estimation of the GGD using the simplified histogram,

  • setting the lower boundary as \(x_{min}\), such that \(P(x=x_{min}) = 1/n\),

  • setting the upper boundary as \(x_{max}\), such that \(P(x=x_{max}) = 1 - 1/n\).

Therefore, the brightness values with probabilities lower than the probability of an occurrence of a single pixel \(P(x) = 1/n\) are removed on both sides of the distribution. An example based on the histogram determined for the full image (\(M\cdot N\) samples used instead of n) is shown in Fig. 5.

Additionally, the calculation of the Root Mean Squared Error (RMSE), was made to verify the influence of the number of samples n on the approximation error. Since the histograms of natural images are usually “rough”, the additional median filtering of histograms with 5-elements mask was examined. Nevertheless, the obtained results were not always satisfactory, as shown in Fig. 6 and therefore this filtration was not used to prevent additional increase of computation time.

Fig. 7.
figure 7

Illustration of histograms obtained in two steps of the proposed method for an exemplary representative image using \(n=100\) randomly drawn samples.

5 Proposed Two-Step Algorithm and Its Experimental Verification

On the basis of the above considerations, the following procedure is proposed:

  • determination of lower boundary \(x_{min}\) and upper boundary \(x_{max}\) with the use of the Monte Carlo histogram estimation and GGD approximation,

  • limiting the brightness range to \( \langle x_{min} ; x_{max} \rangle \),

  • restarted GGD approximation of the histogram for the limited range with the use of Monte Carlo method,

  • estimation of the location parameter \(\mu _{GGD}\) for the histogram with the limited range,

  • limiting the brightness range to \( \langle 0 ; \mu _{GGD} \rangle \) and normalization,

  • binarization using one of the classical thresholding methods.

The illustration of histograms and GGD parameters obtained for an exemplary representative image after two major steps of the proposed algorithm for \(n=100\) pixels randomly chosen according to the Monte Carlo method is shown in Fig. 7. The noticeably different shapes of the left (a) and right (b) histograms result from independent random draws in each of two steps.

In the last step three various image binarization methods are considered: fixed threshold at 0.5 of the brightness range, global Otsu thresholding [18] and locally adaptive thresholding proposed by Bradley [1].

To verify the validity and performance of the proposed method some experiments were made using 8 available DIBCO datasets (2009, 2010, 2011, 2012, 2013, 2014, 2016 and 2017). For all of these databases some typical metrics used for the evaluation of binarization algorithms [14] were calculated for five different values of samples (n) used in the Monte Carlo method. The executions of the Monte Carlo method were repeated 30 times and the obtained results were compared with the application of three classical thresholding methods mentioned above without the proposed image preprocessing based on the GGD histogram approximation. Detailed results obtained for the fixed threshold (0.5), Otsu and Bradley thresholding are presented in Tables 1, 2 and 3, respectively. Better results are indicated by higher accuracy, F-Measure, specificity and PSNR values, whereas lower Distance-Reciprocal Distortion (DRD) values denotes better quality [10]. All the metrics marked with (GGD) were calculated for the proposed GGD based approach with the Monte Carlo method setting the n value at 5% of the total number of pixels (about 1000–5000 depending on image resolution).

As can be observed analysing the results presented in Tables 1, 2 and 3, the proposed approach, utilising the GGD histogram approximation with the use of Monte Carlo method for image preprocessing, leads to the enhancement of binarization results for Otsu and Bradley thresholding methods, whereas its application for the binarization with a fixed threshold is inappropriate. Particularly significant improvements can be observed for DIBCO2012 dataset with the use of Otsu binarization, however the advantages of the proposed approach can also be observed for the aggregated results for all datasets (weighted by the number of images they contain). A visual illustration of the obtained improvement is shown in Fig. 8 for an exemplary H10 image for DIBCO2012 dataset, especially well visible for Otsu thresholding.

Table 1. Results of binarization metrics obtained for DIBCO datasets using the classical binarization with fixed threshold (0.5) without and with proposed GGD preprocessing.
Table 2. Results of binarization metrics obtained for DIBCO datasets using Otsu binarization without and with proposed GGD preprocessing.
Table 3. Results of binarization metrics obtained for DIBCO datasets using Bradley binarization without and with proposed GGD preprocessing.

It is worth noting that the results shown in Fig. 8 were obtained using the proposed method applied with random drawing of only \(n=120\) samples using the Monte Carlo method. The obtained improvement of the accuracy value due to the proposed preprocessing is from 0.7765 to 0.9748 for Otsu method and from 0.9847 to 0.9851 for Bradley thresholding. The respective F-Measure values increased from 0.4618 to 0.8608 for Otsu and from 0.9220 to 0.9222 for Bradley method. Nevertheless, depending on the number of randomly drawn pixels the values achieved for the proposed method may slightly differ.

Proposed application of the GGD based preprocessing combined with the Monte Carlo method leads to the improvement of the binarization results which are comparable with the application of adaptive thresholding or better for some images. In most cases its application for adaptive thresholding allows for further slight increase of binarization accuracy.

Fig. 8.
figure 8

Illustrations of the obtained improvement of binarization results for an exemplary image from DIBCO2012 dataset.

6 Summary and Future Work

Although the obtained results may be outperformed by some more complex state-of-the-art methods, especially based on deep CNNs [24], they can be considered as promising and confirm the usefulness of the GGD histogram approximation with the use of the Monte Carlo method for preprocessing of degraded document images before binarization and further analysis. Since in the proposed approach, only one of the GGD parameters (location parameter \(\mu \)) is used, a natural direction of our future research is the utilisation of the other parameters for the removal of additional information related to contaminations.

Our future research will concentrate on further improvement of binarization accuracy, although an important limitation might be the computational burden. However, due to an efficient use of the Monte Carlo method, the overall processing time may be shortened and therefore our proposed approach may be further combined with some other binarization algorithms proposed by various researchers.