Learning High-Order Filters for Efficient Blind Deconvolution of Document Photographs

Xiao, Lei; Wang, Jue; Heidrich, Wolfgang; Hirsch, Michael

doi:10.1007/978-3-319-46487-9_45

Lei Xiao^18,17,
Jue Wang¹⁹,
Wolfgang Heidrich^17,18 &
…
Michael Hirsch²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9907))

Included in the following conference series:

European Conference on Computer Vision

14k Accesses
27 Citations

Abstract

Photographs of text documents taken by hand-held cameras can be easily degraded by camera motion during exposure. In this paper, we propose a new method for blind deconvolution of document images. Observing that document images are usually dominated by small-scale high-order structures, we propose to learn a multi-scale, interleaved cascade of shrinkage fields model, which contains a series of high-order filters to facilitate joint recovery of blur kernel and latent image. With extensive experiments, we show that our method produces high quality results and is highly efficient at the same time, making it a practical choice for deblurring high resolution text images captured by modern mobile devices.

You have full access to this open access chapter, Download conference paper PDF

Binarization Driven Blind Deconvolution for Document Image Restoration

Fast high-quality non-blind deconvolution using sparse adaptive priors

Article 13 May 2014

Text Image Deblurring via Intensity Extremums Prior

Keywords

1 Introduction

Taking photographs of text documents (printed articles, receipts, newspapers, books, etc.) instead of scanning them has become increasingly common due to the popularity of mobile cameras. However, photos taken by hand-held cameras are likely to suffer from blur caused by camera shake during exposure. This is critical for document images, as slight blur can prevent existing optical-character-recognition (OCR) techniques from extracting correct text from them. Removing blur and recovering sharp, eligible document images is thus highly desirable. As in many previous work, we assume a simple image formation model for each local text region as

$$\begin{aligned} \begin{aligned} \mathbf{y}= \mathbf{K}\mathbf{x}+ \mathbf{n}, \end{aligned} \end{aligned}$$

(1)

where $\mathbf{y}$ represents the degraded image, $\mathbf{x}$ the sharp latent image, matrix $\mathbf{K}$ the corresponding 2D convolution with blur kernel ${\mathbf{k}}$, and $\mathbf{n}$ white Gaussian noise. The goal of the post-processing is to recover $\mathbf{x}$ and ${\mathbf{k}}$ from single input $\mathbf{y}$, which is known as blind deconvolution or blind deblurring. This problem is highly ill-posed and non-convex. As shown in many previous work, good prior knowledge of both $\mathbf{x}$ and ${\mathbf{k}}$ is crucial for constraining the solution space and robust optimization. Specifically, most previous methods focus on designing effective priors for $\mathbf{x}$, while ${\mathbf{k}}$ is usually restricted to be smooth.

Recent text image deblurring methods use sparse gradient priors (e.g., total variation [3], $\ell _0$ gradient [5, 14]) and text-specific priors (e.g., text classifier [5], $\ell _0$ intensity [14]) for sharp latent image estimation. These methods can produce high-quality results in many cases, however their practical adaptation is hampered by several drawbacks. Firstly, their use of sparse gradient priors usually forces the recovered image to be piece-wise constant. Although these priors are effective for images with large-font text (i.e., high pixel-per-inch (PPI)), they do not work well for photographs of common text documents such as printed articles and newspapers where the font sizes are typically small [10]. Furthermore, these methods employ iterative sparse optimization techniques that are usually time-consuming for high resolution images taken by modern cameras (e.g., up to a few megapixels).

In this paper, we propose a new algorithm for practical document deblurring that achieves both high quality and high efficiency. In contrast to previous works relying on low-order filter statistics, our algorithm aims to capture the domain-specific property of document images by learning a series of scale- and iteration-wise high-order filters. A motivational example is shown in Fig. 1, where we compare small patches extracted from a natural image, a large-font text image and a common text document image. Since most deblurring methods adopt a multi-scale framework in order to avoid bad local optima, we compare patches extracted from multiple scales. Evidently, the natural image and large-font text image both contain long, clear edges at all scales, making the use of sparse gradient priors effective. In contrast, patches from the document image with a small font size are mostly composed of small-scale high-order structures, especially at coarse scales, which makes sparse gradient priors to be inaccurate. This observation motivates us to use high-order filter statistics as effective regularization for deblurring document images. We use a discriminative approach and learn such regularization terms by training a multi-scale, interleaved cascade of shrinkage field models [18], which was recently proposed as an effective tool for image restoration.

Our main contributions include:

We demonstrate the importance of using high-order filters in text document image restoration.
We propose a new algorithm for fast and high-quality deblurring of document photographs, suitable for processing high resolution images captured by modern mobile devices.
Unlike the recent convolutional-neural-network (CNN) based document deblurring method [10], our approach is robust to page orientation, font style and text language, even though such variants are not included at our training.

2 Related Work

Blind Deblurring of Natural Images. Most deblurring methods solve the non-convex problem by alternately estimating latent image $\mathbf{x}$ and blur kernel ${\mathbf{k}}$, with an emphasis on designing effective priors on $\mathbf{x}$. Krishnan et al. [11] introduced a scale-invariant $\ell _1/\ell _2$ prior, which compensates for the attenuation of high frequencies in the blurry image. Xu et al. [24] used the $\ell _0$ regularizer on the image gradient. Xiao et al. [22] used a color-channel edge-concurrence prior to facilitate chromatic kernel recovery. Goldstein and Fattal [8] estimated the kernel from the power spectrum of the blurred image. Yue et al. [25] improved [8] by fusing it with sparse gradient prior. Sun et al. [21] imposed patch priors to recover good partial latent images for kernel estimation. Michaeli and Irani [13] exploited the recurrence of small image patches across different scales of single natural images. Anwar et al. [2] learned a class-specific prior of image frequency spectrum for the restoration of frequencies that cannot be recovered with generic priors. Zuo et al. [26] learned iteration-wise parameters of the $\ell _p$ regularizer on image gradients. Schelten et al. [16] trained cascaded interleaved regression tree field (RTF) [19] to post-improve the result of other blind deblurring methods for natural images.

Another type of methods use explicit nonlinear filters to extract large-scale image edges from which kernels can be estimated rapidly. Cho and Lee [6] adopted a combination of shock and bilateral filters to predict sharp edges. Xu and Jia [23] improved [6] by neglecting edges with small spatial support as they impede kernel estimation. Schuler et al. [20] learned such nonlinear filters with a multi-layer convolutional neural network.

Blind Deblurring of Document Images. Most recent methods of text deblurring use the same sparse gradient assumption developed for natural images, and augment it with additional text-specific regularization. Chen et al. [3] and Cho et al. [5] applied explicit text pixel segmentation and enforced the text pixels to be dark or have similar colors. Pan et al. [14] used $\ell _0$-regularized intensity and gradient priors for text deblurring. As discussed in Sect. 1 and as we will show in our experiments in Sect. 4, the use of sparse gradient priors makes such methods work well for large-font text images, but fail on common document images that have smaller fonts.

Hradiš et al. [10] trained a convolutional neural network to directly predict the sharp patch from a small blurry one, without considering the image formation model and explicit blur kernel estimation. With a large enough model and training dataset, this method produces good results on English documents with severe noise, large defocus blurs or simple motion blur. However, this method fails on more complicated motion trajectories, and is sensitive to page orientation, font style and text languages. Furthermore, this method often produces “hallucinated” characters or words which appears to be sharp and natural in the output image, but are completely wrong semantically. This undesirable side-effect severely limits its application range as most users do not expect the text to be changed in the deblurring process.

Discriminative Learning Methods for Image Restoration. Recently several methods were proposed to use trainable random field models for image restoration (denoising and non-blind deconvolution where the blur kernel is known a priori). These methods have achieved high-quality results with attractive run-times [4, 18, 19]. One representative technique is the shrinkage fields method [18], which reduces the optimization problem of random field models into cascaded quadratic minimization problems that can be efficiently solved in Fourier domain. In this paper, we extend this idea to the more challenging blind deconvolution problem, and employ the cascaded shrinkage fields model to capture high-order statistics of text document images.

3 Our Algorithm

The shrinkage fields (SF) model has been recently proposed as an effective and efficient tool for image restoration [18]. It has been successfully applied to both image denoising and non-blind image deconvolution, producing state-of-the-art results while maintaining high computational efficiency. Motivated by this success, we adopt the shrinkage field model for the challenging problem of blind deblurring of document images. In particular, we propose a multi-scale, interleaved cascade of shrinkage fields (CSF) which estimates the unknown blur kernel while progressively refining the estimation of the latent image. This is also partly inspired by [16], which proposes an interleaved cascade of regression tree fields (RTF) to post-improve the results of state-of-the-art natural image deblurring methods. However, in contrast to [16], our method does not depend on an initial kernel estimation from an auxiliary method. Instead, we estimate both the unknown blur kernel and latent sharp image from a single blurry input image.

3.1 Cascade of Shrinkage Fields (CSF)

The shrinkage field model can be derived from the field of experts (FoE) model [15]:

$$\begin{aligned} \begin{aligned} \mathop {\hbox {argmin}}\limits _{\mathbf{x}}\,\mathcal {D}(\mathbf{x}, \mathbf{y}) + \sum \nolimits _{i=1}^{N} \rho _{i}(\mathbf{F}_{i}\mathbf{x}), \end{aligned} \end{aligned}$$

(2)

where $\mathcal {D}$ represents the data fidelity given measurement $\mathbf{y}$, matrix $\mathbf{F}_{i}$ represents the corresponding 2D convolution with filter $\mathbf{f}_i$, and $\rho _{i}$ is the penalty on the filter response. Half-quadratic optimization [7], a popular approach for the optimization of common random field models, introduces auxiliary variables $\mathbf{u}_i$ for all filter responses $\mathbf{F}_{i}\mathbf{x}$ and replaces the energy optimization problem Eq. 2 with a quadratic relaxation:

$$\begin{aligned} \begin{aligned} \mathop {\hbox {argmin}}\limits _{\mathbf{x},\mathbf{u}}\,\mathcal {D}(\mathbf{x}, \mathbf{y}) + \sum \nolimits _{i=1}^{N} \left( \beta ||\mathbf{F}_{i}\mathbf{x}-\mathbf{u}_i||_2^2 +\rho _{i}(\mathbf{u}_{i}) \right) , \end{aligned} \end{aligned}$$

(3)

which for $\beta \rightarrow \infty $ converges to the original problem in Eq. 2. The key insight of [18] is that the minimizer of the second term w.r.t. $\mathbf{u}_i$ can be replaced by a flexible 1D shrinkage function $\psi _i$ of filter response $\mathbf{F}_{i}\mathbf{x}$. Different from standard random fields which are parameterized through potential functions, SF models the shrinkage functions associated with the potential directly. Given data formation model as in Eq. 1, this reduces the original optimization problem Eq. 2 to a single quadratic minimization problem in each iteration, which can be solved efficiently as

$$\begin{aligned} \begin{aligned} \mathbf{x}^t = \mathcal {F}^{-1}\left[ \frac{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}\mathbf{y}+ {\lambda ^t} \sum _{i=1}^{N} {{\mathbf{F}^t_{i}}}^\mathsf {T}{\psi ^t_{i}}({\mathbf{F}^t_{i}}\mathbf{x}^{t-1}))}{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}) \cdot \mathcal {F}(\mathbf{K}_{t-1}) + {\lambda ^t} \sum _{i=1}^{N}{\mathcal {F}({\mathbf{F}^t_i}^\mathsf {T}) \cdot \mathcal {F}({\mathbf{F}^t_i})}}\right] , \end{aligned} \end{aligned}$$

(4)

where t is iteration index, $\mathbf{K}$ is the blur kernel matrix, $\mathcal {F}$ and $\mathcal {F}^{-1}$ indicate Fourier transform and its inverse, and $\psi _{i}$ the shrinkage function. The model parameters $\varTheta ^t = ({\mathbf{f}}^t_i, \psi ^t_i, \lambda ^t)$ are trained by loss-minimization, e.g. by minimizing the $\ell _2$ error between estimated images $\mathbf{x}^t$ and the ground truth. Performing multiple predictions of Eq. 4 is known as a cascade of shrinkage fields. For more details on the shrinkage fields model we refer readers to the supplemental material and [18].

3.2 Multi-scale Interleaved CSF for Blind Deconvolution

We do not follow the commonly used two-step deblurring procedure where kernel estimation and final latent image recovery are separated. Instead, we learn an interleaved CSF that directly produces both the estimated blur kernel and the predicted latent image. Our interleaved CSF is obtained by stacking multiple SFs into a cascade that is intermitted by kernel refinement steps. This cascade generates a sequence of iteratively refined blur kernel and latent image estimates, i.e. $\{{\mathbf{k}}^{t}\}_{t=1,..,T}$ and $\{\mathbf{x}^t\}_{t=1,..,T}$ respectively. At each stage of the cascade, we employ a separately trained SF model for sharp image restoration. In addition, we learn an auxiliary SF model which generates a latent image $\mathbf{z}^{t}$ that is used to facilitate blur kernel estimation. The reason of including this extra SF model at each stage is to allow for selecting features that might benefit kernel estimation and eliminating other features and artifacts. Note that the idea of introducing such a latent feature image for improving kernel estimation is not new, and is a rather common practice in recent state-of-the-art blind deconvolution methods [6, 23]. Figure 2 depicts a schematic illustration of a single stage of our interleaved CSF approach.

More specifically, given the input image $\mathbf{y}$, our method recovers ${\mathbf{k}}$ and $\mathbf{x}$ simultaneously by solving the following optimization problem:

$$\begin{aligned} \begin{aligned} (\mathbf{x}, {\mathbf{k}}) =&\mathop {\hbox {argmin}}\limits _{\mathbf{x}, {\mathbf{k}}} ||\mathbf{y}- {\mathbf{k}}\otimes \mathbf{x}||^2_2 + \sum \nolimits _{i=1}^{N} \rho _{i}(\mathbf{F}_{i}\mathbf{x}) + \tau ||{\mathbf{k}}||^2_2,\\&s.t. \quad {\mathbf{k}}\ge 0, ||{\mathbf{k}}||_1 = 1\\ \end{aligned} \end{aligned}$$

(5)

To this end, our proposed interleaved CSF alternates between the following blur kernel and latent image estimation steps:

Update $\mathbf{x}^t$ . :

For sharp image update we train a SF model with parameters $\varTheta ^t = ({\mathbf{f}}^t_i, \psi ^t_i, \lambda ^t)$. Analogously to Eq. 4 we obtain the following update for $\mathbf{x}^t$ at iteration t:

$$\begin{aligned} \begin{aligned} \mathbf{x}^t = \mathcal {F}^{-1}\left[ \frac{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}\mathbf{y}+ {\lambda ^t} \sum _{i=1}^{N} {{\mathbf{F}^t_{i}}}^\mathsf {T}{\psi ^t_{i}}({\mathbf{F}^t_{i}}\mathbf{z}^{t-1}))}{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}) \cdot \mathcal {F}(\mathbf{K}_{t-1}) + {\lambda ^t} \sum _{i=1}^{N}{\mathcal {F}({\mathbf{F}^t_i}^\mathsf {T}) \cdot \mathcal {F}({\mathbf{F}^t_i})}}\right] \end{aligned} \end{aligned}$$

(6)

Update $\mathbf{z}^t$ and ${\mathbf{k}}^t$ . :

For kernel estimation we first update the latent image $\mathbf{z}^t$ from $\mathbf{x}^t$ by learning a separate SF model. Denoting convolution with filter ${\mathbf{g}}^t_i$ by matrix $\mathbf{G}^t_{i}$, we have:

$$\begin{aligned} \begin{aligned} \mathbf{z}^{t} = \mathcal {F}^{-1}\left[ \frac{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}\mathbf{y}+ {\eta ^t} \sum _{i=1}^{N} {\mathbf{G}^t_{i}}^\mathsf {T}{{\phi }^t_{i}}({\mathbf{G}^t_{i}}\mathbf{x}^t))}{\mathcal {F}(\mathbf{K}_{t-1}^\mathsf {T}) \cdot \mathcal {F}(\mathbf{K}_{t-1}) + {\eta ^t} \sum _{i=1}^{N}{\mathcal {F}({{\mathbf{G}^t_i}}^\mathsf {T}) \cdot \mathcal {F}({{\mathbf{G}^t_i}})}}\right] \end{aligned} \end{aligned}$$

(7)

For kernel estimation we employ a simple Thikonov prior. Given the estimated latent image $\mathbf{z}^t$ and the blurry input image $\mathbf{y}$, the update for ${\mathbf{k}}^{t}$ reads:

$$\begin{aligned} \begin{aligned} {\mathbf{k}}^{t} = \mathcal {F}^{-1}\left[ \frac{{\mathcal {F}(\mathbf{z}^{t})}^{*} \cdot \mathcal {F}(\mathbf{y})}{{\mathcal {F}(\mathbf{z}^t)}^{*} \cdot \mathcal {F}(\mathbf{z}^t) + \tau ^t}\right] , \end{aligned} \end{aligned}$$

(8)

where $*$ indicates complex conjugate. The model parameters learned at this step are denoted as $\varOmega ^t = ({\mathbf{g}}^t_i, \phi ^t_i, \eta ^t, \tau ^t)$. Note that $\varOmega ^t$ are trained to facilitate the update of both kernel ${\mathbf{k}}^t$ and image $\mathbf{z}^t$.

The $\mathbf{x}^t$ update step in Eq. 6 takes $\mathbf{z}^{t-1}$ rather than $\mathbf{x}^{t-1}$ as input, as $\mathbf{z}^{t-1}$ improves from $\mathbf{x}^{t-1}$ w.r.t. removing blur by Eq. 7 at iteration $t-1$. $\mathbf{x}^t$ and $\mathbf{z}^t$ is observed to converge as the latent image and kernel are recovered.

Algorithm 1 summarizes the proposed approach for blind deblurring of document images. Note that there is translation and scaling ambiguity between the sharp image and blur kernel at blind deconvolution. The estimated kernel is normalized such that all its pixel values sum up to one. In Algorithm 2 for training, $\mathbf{x}^t$ is shifted to better align with the ground truth image $\mathbf{\bar{x}}$, before updating ${\mathbf{k}}$. We find that our algorithm usually converges in 5 iterations per scale.

3.3 Learning

Our interleaved CSF has two sets of model parameters at every stage $t=1,..,5$, one for sharp image restoration, $\varTheta ^t = ({\mathbf{f}}^t_i, \psi ^t_i, \lambda ^t)$, and the other for blur kernel estimation, $\varOmega ^t = ({\mathbf{g}}^t_i, \phi ^t_i, \eta ^t, \tau ^t)$. All model parameters are learned through loss-minimization.

Note that in addition to the blurry input image, each model receives also the previous image and blur kernel predictions as input, which are progressively refined at each iteration. This is in contrast to the non-blind deconvolution setting of [18], where the blur kernel is known and is kept fixed throughout all stages. Our interleaved CSF model is trained in a greedy fashion, i.e. stage by stage such that the learned SF models at one stage are able to adapt to the kernel and latent image estimated at the previous stage.

More specifically, at each stage we update our model parameters by iterating between the following two steps:

Update $\mathbf{x}^t$ . :

To learn the model parameters $\varTheta ^t$, we minimize the $\ell _2$ error between the current image estimate and the ground truth image $\mathbf{\bar{x}}$, i.e. $\ell = ||\mathbf{x}^t - \mathbf{\bar{x}}||_2^2$. Its gradient w.r.t. the model parameters $\varTheta ^t = ({\mathbf{f}}^t_i, \psi ^t_i, \lambda ^t$) can be readily computed as

$$\begin{aligned} \begin{aligned} \frac{\partial \ell }{\varTheta ^t} = \frac{\partial \mathbf{x}^t}{\partial \varTheta ^t}\frac{\partial \ell }{\mathbf{x}^t} \end{aligned} \end{aligned}$$

(9)

The derivatives for specific model parameters are omitted here for brevity, but can be found in the supplemental material.

Update $\mathbf{z}^t$ and ${\mathbf{k}}^t$ . :

The model parameters $\varOmega ^t$ of the SF models for kernel estimation at stage t are learned by minimizing the loss function $\ell = ||{\mathbf{k}}^t - \mathbf{\bar{k}}||_2^2 + \alpha ||\mathbf{z}^t - \mathbf{\bar{x}}||_2^2$, where $\mathbf{\bar{k}}$ denotes the ground truth blur kernel and $\alpha $ is a coupling constant. This loss accounts for errors in the kernel but also prevents the latent image used in Eq. (8) to diverge. Its gradient w.r.t. the model parameters $\varOmega ^t = ({\mathbf{g}}^t_i, \phi ^t_i, \eta ^t, \tau ^t)$ reads

$$\begin{aligned} \begin{aligned} \frac{\partial \ell }{\partial \varOmega ^t} = \frac{\partial \mathbf{z}^t}{\partial \varOmega ^t} \frac{\partial {\mathbf{k}}^t}{\partial \mathbf{z}^t} \frac{\partial \ell }{\partial {\mathbf{k}}^t} + \frac{\partial {\mathbf{k}}^t}{\partial \varOmega ^t} \frac{\partial \ell }{\partial {\mathbf{k}}^t} + \frac{\partial \mathbf{z}^t}{\partial \varOmega ^t} \frac{\partial \ell }{\partial \mathbf{z}^t} \end{aligned} \end{aligned}$$

(10)

Again, details for the computation of the derivatives w.r.t. to specific model parameters are included in the supplemental material. We want to point out that the kernel estimation error $||{\mathbf{k}}^t - \mathbf{\bar{k}}||_2^2$ is back-propagated to the model parameters $({\mathbf{g}}^t_i, \phi ^t_i, \eta ^t)$ in the SF for $\mathbf{z}^t$. Hence, the latent image $\mathbf{z}^t$ is tailored for accurate kernel estimation and predicted such that the refinement in ${\mathbf{k}}^t$ in each iteration is optimal. This differs from related work in [16, 26].

Multi-scale Approach. Our algorithm uses a multi-scale approach to prevent bad local optima. The kernel widths that are used at different scales are 5, 9, 17, 25 pixels. At each scale s, the blurry image $\mathbf{y}^s$, the true latent image $\mathbf{\bar{x}}^s$ and $\mathbf{\bar{k}}^s$ are downsampled (and normalized for $\mathbf{\bar{k}}^s$) from their original resolution. The scale index s is omitted for convenience. At the beginning of each scale $s>1$, the estimated image $\mathbf{x}$ is initialized by bicubic upsampling its estimation at the previous scale, and the blur kernel ${\mathbf{k}}$ is initialized by nearest-neighbor upsampling, followed by re-normalization. At the coarsest scale $s=1$, $\mathbf{x}$ is initialized as $\mathbf{y}$ and ${\mathbf{k}}$ is initialized as a delta peak. The coupling constant $\alpha $ in kernel estimation loss is defined as $\alpha =r\cdot \eta $, where r is the ratio between pixel numbers in kernel ${\mathbf{k}}^t$ and image $\mathbf{z}^t$ at current scale, $\eta $ is initialized with 1 at the coarsest scale and at each subsequent scale it is multiplied by a factor of 0.25. Algorithm 2 summarizes our learning procedure for a single scale of our CSF model.

Model Complexity. In both the model $\varTheta ^t$ for $\mathbf{x}^t$ and model $\varOmega ^t$ for ($\mathbf{z}^t$, ${\mathbf{k}}^t$), we choose to use 24 filters $\mathbf{f}^t_i$ of size $5\times 5$ for trade-off between result quality, model complexity and time efficiency. As in [18], we initialize the filters with a DCT filter bank. Each shrinkage function $\psi ^t_i$ and $\phi ^t_i$ are composed of 51 equidistant-positioned radial basis functions (RBFs) and are initialized as identity function. We further enforce central symmetry to the shrinkage functions, so that the number of trainable RBFs reduces by half to 25. Figure 3 visualizes some learned models.

Training Datasets. We have found that that our method works well with a relatively small training dataset without over-fitting. We collected 20 motion blur kernels from [18], and randomly rotated them to generate 60 different kernels. We collected 60 sharp patches of 250 $\times $ 250 pixels cropped from documents rendered around 175 PPI, and rotated each with a random angle between −4 and 4 degrees. We then generated 60 blurry images by convolving each pair of sharp image and kernel, followed by adding white Gaussian noise and quantizing to 8 bits. We used the L-BFGS solver [17] in Matlab for training, which took about 12 h on a desktop with an Intel Xeon CPU.

4 Results

In this section we evaluate the proposed algorithm on both synthetic and real-world images. We compare with Pan et al. [14] and Hradiš et al. [10], the state-of-the-art methods for text image blind deblurring, and the natural image deblurring software produced by Xu [1], which are based on recently proposed state-of-the-art techniques [23, 24]. We used the code and binaries provided by the authors and tuned the parameters to generate the best possible results.

Real-World Images. In Figs. 4 and 5 we show comparisons on real images. The result images of Xu [1] and Pan [14] contain obvious artifacts due to ineffective image priors that lead to inaccurate kernel estimation. Hradiš et al. [10] fails to recover many characters and distorted the font type and illumination. Our method produces the best results in these cases, and our results are both visually pleasing and highly legible. The full resolution images and more results are included in the supplemental material.

Quantitative Comparisons. For quantitative evaluation, we test all methods on a synthetic dataset and compare results in terms of the peak-signal-to-noise-ratio (PSNR). We collect 8 sharp document images with 250$\times $250 pixels cropped from documents rendered at 150 PPI (similar PPI as used for training in [10]). Each image is blurred with 8 kernels at 25$\times $25 collected from [12], followed by adding 1 % Gaussian noise and 8-bit quantization. In Fig. 6, we show the average PSNR values of all 8 test images synthesized with the same blur kernel. Our method outperforms other methods in all cases by 0.5–6.0 dB. Hradiš et al. [10] has close performance to ours on kernel #3, which is close to defocus blur. It also performs reasonably well on kernel #6 which features a simple motion path, but fails on other more challenging kernels. Some results along with the estimated kernels are shown in Fig. 7 for visual comparison.

An interesting question one may ask is whether improved deblur can directly lead to better optical-character-recognition (OCR) accuracy. To answer this question we evaluate OCR accuracy using the software ABBYY FineReader 12. We collected 8 sharp document images from the OCR test dataset in [10]. Each document image contains a continuous paragraph. We synthesized 64 blurry images with the 8 kernels and 1 % Gaussian noise similarly as in the PSNR comparison. We run the OCR software and used the script provided by [10] to compute the average character error rate for all 8 test images synthesized with the same kernel^{Footnote 1}. The results are shown in Fig. 6. They are consistent with the PSNR results also in Fig. 6. Hradiš et al. [10] performs well on kernel #3 and #6 but fails on other challenging kernels, while our method is consistently better than others. All the test images and results for PSNR and OCR comparisons are included in the supplemental material.

Table 1. Run-time comparison (in seconds).

Full size table

Run-Time Comparison. Table 1 provides a comparison on computational efficiency, using images blurred by a 17$\times $17 kernel at three different resolutions. The experiments were done on an Intel i7 CPU with 16 GB RAM and a GeForce GTX TITAN GPU. Assuming the image sensor resolution is a known priori^{Footnote 2}, we pre-compute the FFTs of the trained filters $\mathbf{f}_i$ and $\mathbf{g}_i$ for maximal efficiency. We report the timing of our Matlab implementation on CPU. A GPU implementation should significantly reduce the time as our method only requires FFT, 2D convolution and 1D look-up-table (LUT) operations, which is our future work.

Robustness. In Fig. 8, we show results on non-English text and severely rotated image. Although both Hradiš et al. [10] and our method are only trained on English text data, our method can be applied to non-English text as well. This is a great benefit of our method as we do not need to train on every different language, or increase the model complexity to handle them as [10] would need to do. Our method is also robust against a significant change of page orientation, which cannot be handled well by [10].

In Fig. 9, we show the results of our method when the noise level and PPI of the test data differs from the training data. Figure 9(a) shows that the performance of our method is fairly steady when the noise level in the test images is not too much higher than that of the training data, meaning that the models trained at sparse noise levels are sufficient for practical use. Figure 9(b) shows that our method works well in a fairly broad range of image PPIs given the training data are around 175 PPI.

In Fig. 10, we show a comparison on a real image with large-font text. Following [10], the input of Hradiš’ and our method was downsampled by factor of 3 in order to apply the trained models without re-training. Although such downsampling breaks the image formation model in Eq. 1, our method can still generate reasonable result.

Non-uniform Blur. Our method can be easily extended to handle non-uniform blur by dividing the image into overlapped tiles, deblurring each tile with our proposed algorithm, and then realigning the resulting tiles to generate the final estimated image. An example is shown in Fig. 11.

5 Conclusion and Discussion

In this paper we present a new algorithm for fast and high-quality blind deconvolution of document photographs. Our key idea is to to use high-order filters for document image regularization, and propose to learn such filters and influences from training data using multi-scale, interleaved cascade of shrinkage field models. Extensive experiments demonstrate that our approach not only produces higher quality results than the state-of-the-art methods, but is also computational efficient, and robust against noise level, language and page orientation changes that are not included in the training data.

Our method also has some limitations. It cannot fully recover the details of an image if it is degraded by large out-of-focus blur. In such case, Hradiš et al. [10] may outperform our method given its excellent synthesis ability. As future work it would be interesting to combine both approaches. Although we only show learning our model on document photographs, we believe such a framework can also be applied to other domain-specific images, which we plan to explore in the future. The code, dataset and other supplemental material will be available on the author’s webpage.

Notes

1.
We used the script ‘eval.py’ downloaded from the author webpage [10] to compute the error rate (after a bug was fixed).
2.
This is a common assumption especially for batch processing of document images.

References

Robust deblurring software. www.cse.cuhk.edu.hk/~leojia/deblurring.htm
Anwar, S., Phuoc Huynh, C., Porikli, F.: Class-specific image deblurring. In: ICCV (2015)
Google Scholar
Chen, X., He, X., Yang, J., Wu, Q.: An effective document image deblurring algorithm. In: CVPR (2011)
Google Scholar
Chen, Y., Yu, W., Pock, T.: On learning optimized reaction diffusion processes for effective image restoration. In: CVPR (2015)
Google Scholar
Cho, H., Wang, J., Lee, S.: Text image deblurring using text-specific properties. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 524–537. Springer, Heidelberg (2012)
Google Scholar
Cho, S., Lee, S.: Fast motion deblurring. ACM Trans. Graph. 28(5) (2009)
Google Scholar
Geman, D., Yang, C.: Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Process. 4(7), 932–946 (1995)
Article Google Scholar
Goldstein, A., Fattal, R.: Blur-Kernel estimation from spectral irregularities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 622–635. Springer, Heidelberg (2012)
Google Scholar
Hirsch, M., Sra, S., Scholkopf, B., Harmeling, S.: Efficient filter flow for space-variant multiframe blind deconvolution. In: CVPR (2010)
Google Scholar
Hradiš, M., Kotera, J., Zemcík, P., Šroubek, F.: Convolutional neural networks for direct text deblurring. In: BMVC (2015)
Google Scholar
Krishnan, D., Tay, T., Fergus, R.: Blind deconvolution using a normalized sparsity measure. In: CVPR (2011)
Google Scholar
Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Understanding and evaluating blind deconvolution algorithms. In: CVPR (2009)
Google Scholar
Michaeli, T., Irani, M.: Blind deblurring using internal patch recurrence. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 783–798. Springer, Heidelberg (2014)
Google Scholar
Pan, J., Hu, Z., Su, Z., Yang, M.H.: Deblurring text images via. l0-regularized intensity and gradient prior. In: CVPR (2014)
Google Scholar
Roth, S., Black, M.J.: Fields of experts: a framework for learning image priors. In: CVPR (2005)
Google Scholar
Schelten, K., Nowozin, S., Jancsary, J., Rother, C., Roth, S.: Interleaved regression tree field cascades for blind image deconvolution. In: WACV (2015)
Google Scholar
Schmidt, M.: minfunc: unconstrained differentiable multivariate optimization in matlab. http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html
Schmidt, U., Roth, S.: Shrinkage fields for effective image restoration. In: CVPR (2014)
Google Scholar
Schmidt, U., Rother, C., Nowozin, S., Jancsary, J., Roth, S.: Discriminative non-blind deblurring. In: CVPR (2013)
Google Scholar
Schuler, C.J., Hirsch, M., Harmeling, S., Schölkopf, B.: Learning to deblur (2014). arXiv preprint arXiv:1406.7444
Sun, L., Cho, S., Wang, J., Hays, J.: Edge-based blur kernel estimation using patch priors. In: ICCP (2013)
Google Scholar
Xiao, L., Gregson, J., Heide, F., Heidrich, W.: Stochastic blind motion deblurring. IEEE Trans. Image Process. 24(10), 3071–3085 (2015)
Article MathSciNet Google Scholar
Xu, L., Jia, J.: Two-Phase Kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010)
Chapter Google Scholar
Xu, L., Zheng, S., Jia, J.: Unnatural l0 sparse representation for natural image deblurring. In: CVPR (2013)
Google Scholar
Yue, T., Cho, S., Wang, J., Dai, Q.: Hybrid image deblurring by fusing edge and power spectrum information. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 79–93. Springer, Heidelberg (2014)
Google Scholar
Zuo, W., Ren, D., Gu, S., Lin, L., Zhang, L.: Discriminative learning of iteration-wise priors for blind deconvolution. In: CVPR (2015)
Google Scholar

Download references

Acknowledgement

This work was supported in part by Adobe and Baseline Funding of KAUST. Part of this work was done when the first author was an intern at Adobe Research. The authors thank the anonymous reviewers for helpful suggestions.

Author information

Authors and Affiliations

KAUST, Thuwal, Saudi Arabia
Lei Xiao & Wolfgang Heidrich
University of British Columbia, Vancouver, Canada
Lei Xiao & Wolfgang Heidrich
Adobe Research, Seattle, USA
Jue Wang
MPI for Intelligent Systems, Tübingen, Germany
Michael Hirsch

Authors

Lei Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Heidrich
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hirsch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Xiao .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2960 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, L., Wang, J., Heidrich, W., Hirsch, M. (2016). Learning High-Order Filters for Efficient Blind Deconvolution of Document Photographs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-46487-9_45
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46486-2
Online ISBN: 978-3-319-46487-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics