Adversarial Framework for General Image Inpainting

Huang, Wei; Yu, Hongliang

doi:10.1007/978-3-319-93713-7_50

Wei Huang²⁰ &
Hongliang Yu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10862))

Included in the following conference series:

International Conference on Computational Science

3119 Accesses

Abstract

We present a novel adversarial framework to solve the arbitrarily sized image random inpainting problem, where a pair of convolution generator and discriminator is trained jointly to fill the relatively large but random “holes”. The generator is a symmetric encoder-decoder just like an hourglass but with added skip connections. The skip connections act like information shortcut to transfer some necessary details that discarded by the “bottleneck” layer. Our discriminator is trained to distinguish whether an image is natural or not and find out the hidden holes from a reconstructed image. A combination of a standard pixel-wise L2 loss and an adversarial loss is used to guided the generator to preserve the known part of the origin image and fills the missing part with plausible result. Our experiment is conducted on over 1.24M images with uniformly random 25% missing part. We found the generator is good at capturing structure context and performs well in arbitrary size images without complex texture.

You have full access to this open access chapter, Download conference paper PDF

High-Resolution Image Inpainting Using Generative Adversarial Networks

Recovering Images Using Image Inpainting Techniques

Edge-Guided Image Inpainting with Transformer

Keywords

1 Introduction

Image inpainting, or “hole filling” problem, aims to reconstruct the missing or damaged part of an input image. Classical inpainting algorithms in computer vision mainly fall into three categories: information diffusion and examplar-based filling.

Partial differential equation (PDE) is the foundation of the diffusion algorithms [2, 3, 12, 14]. They are also referred as variational methods. Through iterations, the information outside a hole is continuously propagated into the hole, while preserving the continuity of the isophote. They can tackle cracks and small holes well, but produce blur artifacts when faced with large and textured region.

The exemplar-based algorithm, on the other hand, is able to reconstruct large region and remove large unwanted objects by patch matching and filling. The straight forward exemplar approach applies a carefully designed prioritizing filling [4, 5] or a coherence optimization strategy [1, 15]. The limitations are also obvious. It has difficulty in handling curved structures. When proper similar patches do not exist, it will not produce reasonable result.

Recently, convolution network and adversarial training are also introduced into inpainting problem. Compared with the classical algorithms above, the network based algorithms are born to understand high-level semantic context, which brings it capability to tackle harder problems, like semantics required prediction. The context-encoder [13], for example, consisting of an encoder and a decoder, can predict the large squared missing center of an image.

In this paper, we aim to solve a more challenging semantic inpainting problem, the arbitrarily sized image with random holes. We adopt the adversarial training to suppress the multi-modal problem and get sharper result. Instead of predicting a single likelihood to evaluate whether an image is fake like most of other GAN works do, the discriminator provides a pixel-wise evaluation by outputting a single channel image. If the generator does not work well, the discriminator is supposed to point out the origin missing region. The output is visually explainable, as one can clearly figure out the contour of the hole mask if the inpainted image is unsatisfactory and vice versa.

In our experiment, we found the generator is good at capturing structure context and performs well in arbitrary size images without complex texture. As for the failed cases, mainly the complex texture with tiny variations in the intensity, the generator will produce reasonable but blur result.

2 Method

2.1 Adversarial Framework

The adversarial framework in this paper is based on Deep Convolutional Generative Adversarial Networks (DCGAN). A generator G and a discriminator D are trained jointly to for two opposite goals. When the training is stopped, the G is supposed to reconstruct a damaged image in high quality.

Generator. The generator G is an hourglass encoder-decoder consisting of basic convolution blocks (Conv/FullConv-BatchNorm-LeakyReLU/ ReLU), but with shortcut connections to propagate detail information in the encoder directly to the corresponding the symmetric layer of the decoder (Fig. 1).

Endoder. The encoder performs down sampling using $4*4$ convolution filters with stride of 2 and padding of 1. The encoder drops out what it considered useless for reconstruction and squeezes the image information liquid into a “concept”. When it passes the bottleneck layer of the encoder, the feature map size is reduced to $1*1$. The number of the filters in this the bottle layer (m), decides the channel capacity ($m*1*1$) of the whole encoder-decoder pipeline. The activation function for the encoder is LeakyReLU with negative slope of 0.2.

Dedoder. The structure of the decoder is completely symmetric toward the encoder except the output layer. There are 3 added shortcut connection directly join the 3 decoder layers closest to the bottleneck layer with their corresponding symmetric encoder layers. The final layer is supposed to output the desired reconstructed RGB image without holes. The activation function are Tanh for the final layer and ReLU for the rest layers.

Discriminator. The discriminator D is a 5-layer stack of convolution blocks (Conv-BatchNorm-LeakyReLU). All the convolution layers have the same number of 3*3 filters with stride of 1 and padding of 1. The shape of the information fluid is the same as the shape of input throughout the network. The activation layers are Sigmoid for the final layer and LeakyReLU with negative slope of 0.2 for the rest layers.

As the final step of the inpainting is to fusion the output with the origin damaged image, we specify a high-level goal, to make the “hole” indistinguishable. Instead of predicting a single likelihood to evaluate whether an image is fake like most of other GAN works do, the discriminator here is trained to find the flaw of the output of G or the hidden holes.

The discriminator is supposed to output all ones when faced with natural images and output the given hole masks (ones/white for known regions and zeroes/black for the holes) otherwise. Compared to a single number, this single channel image output is visually explainable. And it can provide more targeted guidance for each input pixel and comprehensive judgment.

2.2 Objective

The generator is trained to regress the ground truth content of the input image. It is well known that the L2 loss (Eq. 1) - and L1, prefers a blurry solution to a clear texture [10]. It is an effective way to capture and rebuild the low frequency information.

$$\begin{aligned} L_2(G) = ||x - G(x')||_2^2 \end{aligned}$$

(1)

The adversarial loss is introduced to get a sharper result. The objective of a GAN can be expressed as Eq. 2. $x'$ is the damaged input image and x is the corresponding ground truth. $\hat{M}$ is the hole mask. The holes are filled with zeroes, and the known regions are filled with ones. The all $\mathbf {1}$s mask means no holes at all. The G is trained to minimize this objective, while the D is trained to maximize it.

$$\begin{aligned} L_{GAN}(G, D) = E_x|| \hat{M} - D(G(x'))||_2^2 + E_{x'}||\mathbf {1} - D(x)||_2^2 \end{aligned}$$

(2)

The total loss function is a weighted average of a reconstruction loss and an adversarial loss (Eq. 3). We assign a quite large weight ($\lambda = 0.999$) upon the reconstruction loss.

$$\begin{aligned} L(G, D) = \lambda L_2(G) + (1 - \lambda ) L_{GAN} (G, D) \end{aligned}$$

(3)

2.3 Masks for Training

As our framework is supposed to support damaged region with arbitrary shape or position, we need to train the networks with numerous random mask. For the efficiency, the inputs within a mini-batch share the same mask and all the masks are sampled from a global pattern pool.

The global is generated as follows: (1) create a fix-sized uniform random distribution (range from 0 to 1) matrix; (2) scale it to a given large size (10000 * 10000); (3) mark the region with value less than a threshold as “holes” (ones/white) and the rest as “known region” (zeroes/black). The scaling ratio and the loss threshold are two important hyper parameters for the global pattern pool. The scaling ratio controls the continuity of the holes. Larger scaling ratio generates scatter result.

3 Experiment

3.1 Training Details

This work is implemented in Torch and trained using Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with TITAN X (Pascal). We train the G and the D jointly for 500 epochs, using stochastic gradient solver, ADAM, for optimization. The learning rate is 0.0002.

The training dataset is 100 classes of ILSVRC2010 training dataset (over 1.24M natural images). The natural images are scaled down at the same proportion so that the maximum of the width and the height is no more than 350px. Then, 64 random crops of $128 * 128$ from different images consist of a mini-batch. These crops share the same mask. During the training process, we assume the hole area of mask should be in between 20%–30%.

The G receive input with size of $128 * 128 * 4$, where the first three channels are the RGB data and the last channel is a binary mask. The ones in the binary mask indicate the holes while the zeros indicate the known region. The missing region of RGB data specified by the mask will be filled with a constant mean value (R:117, G:104, B:123). We also experimented the gray value (R:127, G:127, B:127) and found no significant difference between them in improving the performance of the generator. The G consists of 10 convolution layers and the bottleneck size is 4000.

3.2 Evaluation

In prior works in GAN, the D outputs a single probability indicating whether the input is a natural image. In this random inpainting problem, we find it requires elaborate design of weight assign among the hole regions and the known regions when updating the parameters of both D and G. What’s more, the output of D may be not consistent with human intuition.

Our method, on the contrary, trains D to find the pixel-wise flaw of G output. It turns out that the output of D is visual explanatory and the optimization is easier. One can clearly figure out the contour of the hole mask if the inpainted image is unsatisfying and vice versa (Fig. 2).

We evaluate our inpainting framework using images from the ILSVRC2010 validation dataset (the “barn spider” and black and “gold garden spider”). As the generator only receives 128*128 images, we split the input into a batch of 128*128 crops if the input image is too large. Afterwards, the result will be tiled to create the origin-sized image. We found the G generates plausible result when faced with linear structure, curved lines and blur scenes. But it is difficult for G to handle complex texture with tiny variations in the intensity. The background regions in Fig. 3 are inpainted so well that the G successfully fools the D, while the spider body regions full of low contrast details are handled poorly.

4 Conclusions

In this paper, we aim to provide a unified solution for random image inpainting problems. Unlike the prior works, the output of D is visually explainable and the G is modified to adapt the general inpainting tasks. Trained in a completely unsupervised manner, without carefully designed strategy, the GAN networks learn basic common sense about natural images. The results suggest that our method is a promising approach for many inpainting tasks.

References

Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24:1–24:11 (2009)
Article Google Scholar
Bertalmio, M., Bertozzi, A.L., Sapiro, G.: Navier-stokes, fluid dynamics, and image and video inpainting. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, pp. I-355–I-362 (2001)
Google Scholar
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of Conference on Computer Graphics and Interactive Techniques, pp. 417–424 (2000)
Google Scholar
Cheng, W., Hsieh, C., Lin, S.: Robust algorithm for exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2005)
Google Scholar
Criminisi, A., Prez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Article Google Scholar
Dong, W., Shi, G., Li, X.: Nonlocal image restoration with bilateral variance estimation: a low-rank approach. IEEE Trans. Image Process. 22(2), 700–711 (2013)
Article MathSciNet Google Scholar
Elad, M., Starck, J.L., Querre, P., Donoho, D.L.: Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). Appl. Comput. Harmon. Anal. 19(3), 340–358 (2005)
Article MathSciNet Google Scholar
Guo, Q., Gao, S., Zhang, X., Yin, Y., Zhang, C.: Patch-based image inpainting via two-stage low rank approximation. IEEE Trans. Vis. Comput. Graph. 2626(c), 1 (2017)
Article Google Scholar
He, L., Wang, Y.: Iterative support detection-based split bregman method for wavelet frame-based image inpainting. IEEE Trans. Image Process. 23(12), 5470–5485 (2014)
Article MathSciNet Google Scholar
Larsen, A.B.L., Snderby, S.K., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300 (2015)
Li, W., Zhao, L., Lin, Z., Xu, D., Lu, D.: Non-local image inpainting using low-rank matrix completion. Comput. Graph. Forum 34(6), 111–122 (2015)
Article Google Scholar
Oliveira, M.M., Bowen, B., McKenna, R., Chang, Y.-S.: Fast Digital Image Inpainting. In: International Conference on Visualization, Imaging and Image Processing, pp. 261–266 (2001)
Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Google Scholar
Telea, Alexandru: An image inpainting technique based on the fast marching method. J. Graph. Tools 9(1), 23–34 (2004)
Article Google Scholar
Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 463–476 (2007)
Article Google Scholar
Zhou, M., Chen, H., Paisley, J., Ren, L., Li, L., Xing, Z., Dunson, D., Sapiro, G., Carin, L.: Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images. IEEE Trans. Image Process. 21(1), 5470–5485 (2012)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Wei Huang & Hongliang Yu

Authors

Wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hongliang Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongliang Yu .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Yong Shi
National Supercomputing Center in Wuxi, Wuxi, China
Haohuan Fu
Chinese Academy of Sciences, Beijing, China
Yingjie Tian
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Michael Harold Lees
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, W., Yu, H. (2018). Adversarial Framework for General Image Inpainting. In: Shi, Y., et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10862. Springer, Cham. https://doi.org/10.1007/978-3-319-93713-7_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-93713-7_50
Published: 12 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93712-0
Online ISBN: 978-3-319-93713-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics