Generator Pyramid for High-Resolution Image Inpainting

Inpainting high-resolution images with large holes challenges existing deep learning based image inpainting methods. We present a novel framework -- PyramidFill for high-resolution image inpainting task, which explicitly disentangles content completion and texture synthesis. PyramidFill attempts to complete the content of unknown regions in a lower-resolution image, and synthesis the textures of unknown regions in a higher-resolution image, progressively. Thus, our model consists of a pyramid of fully convolutional GANs, wherein the content GAN is responsible for completing contents in the lowest-resolution masked image, and each texture GAN is responsible for synthesizing textures in a higher-resolution image. Since completing contents and synthesising textures demand different abilities from generators, we customize different architectures for the content GAN and texture GAN. Experiments on multiple datasets including CelebA-HQ, Places2 and a new natural scenery dataset (NSHQ) with different resolutions demonstrate that PyramidFill generates higher-quality inpainting results than the state-of-the-art methods. To better assess high-resolution image inpainting methods, we will release NSHQ, high-quality natural scenery images with high-resolution 1920$\times$1080.


Introduction
Image inpainting, as a fundamental low-level vision task, has attracted much attention from academic and industry.A wide range of vision and graphics applications refer to image inpainting, e.g., object removal [2,1], image restoration [4], manipulation [18], and super-resolution [26].Image inpainting aims to synthesize alternative contents in missing regions of an image, which is visually realistic and semantically correct [29].
Existing inpainting methods generally fall into two categories: copying from low-level image features and generating new contents from high-level semantic features.The former ones attempt to fill holes by matching and copying patches from known regions [1] or external database Original Input PConv Ours images [6].These approaches are effective for easy cases, e.g., uniform textured background inpainting.However, they cannot solve challenge cases where missing regions involve complex textures or nonrepetitive structures, due to lacking of high-level semantic understanding [29].The latter approaches learn to capture the context representation of known regions and synthesize the missing contents by using deep convolutional neural networks (CNN) in an end-to-end training manner.Context Encoder [19] is the first CNN-based method to solve the image inpainting task, wherein a deep convolution encoder-decoder architecture was proposed and an adversarial loss [5] was added to generate visually realistic results.The encoderdecoder architecture was extended by many CNN-based inpainting models to get more reasonable and fine-detailed contents [7,29,30,15,9,28].However, the end-to-end training encoder-decoder architecture still remains challenging to fill large holes in high resolution images.As shown in Figure 1, the results of PConv [15] are obtained from its online demo 1 , which illustrates that PConv fails to generate reasonable contents for these two cases.Because the guidance for filling the holes has lost as some regions in holes are far away from the surrounding known regions, which causes the generator to produce semantically ambiguous contents or visually artifacts for the holes [14].In addition, higher resolution im-ages make the discriminator easier to tell the fully inpainted images apart from training images, thus drastically amplifying the gradient problem [11].An alternative way is to use two encoder-decoder architectures to separately fill contents and textures for the unknown regions [29].Even so, only a few methods can process images with resolution smaller than 512 × 512, wherein the masked area is generally less than 25% [30].
We observe that filling the contents and textures demand quite different abilities from the generator.Content completion relies more on capturing the high-level semantics and the global structure of the image, yet low-level features and local texture statistics of the image is more critical for the texture synthesis.Furthermore, content completion can be regarded as an image generation task [5], and texture synthesis can be treated as an image to image translation task [8].
Motivated by the observation that high-resolution image inpainting could be disentangled into content completion and texture synthesis, we propose a simple yet powerful high-resolution image inpainting framework named Pyra-midFill, which attempts to fill large holes in images with high resolution reaching to 1024×1024.Our key insight is that we can fill the contents for the easier low-resolution images and synthesize the textures for the higher-resolution details progressively, wherein the high-resolution image is formed into a pyramid of different scale images by downsampling.PyramidFill consists of a pyramid of fully convolutional Generative Adversarial Networks (GANs), wherein the content GAN is responsible for generating contents in the lowest-resolution masked image in image pyramid, and each texture GAN is responsible for synthesizing textures in a higher-resolution image.
Our major contributions can be summarized as follows: • We provide a new perspective that high-resolution image inpainting could be disentangled into lowresolution content completion and higher-resolution texture synthesis.
• Following the new perspective, we design a novel framework consisting of a pyramid of GANs, wherein the content GAN is responsible for generating contents in the lowest-resolution masked image in image pyramid, and each texture GAN is responsible for synthesizing textures in a higher-resolution image.
• We introduce a new dataset of natural scenery with high resolution 1920 × 1080 for real image inpainting applicatons.

Deep Image Inpainting
A variety of CNN-based approaches have been proposed for image inpainting.Pathak et al. [19] first introduced an eoncoder-decoder architecture for the image inpainting task, as well as a pixel-wise reconstruction loss and an adversarial loss.Based on Pathak's work, Iizuka et al. [7] proposed a fully convolutional GAN model with an extra local discriminator to ensure local image coherency.Yang et al. [25] proposed a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints.Based on two-stage encoder-decoder architectures, Observing ineffectiveness of CNN in modeling long-term correlations between distant contextual information and the hole regions, Yu et al. [29] presented a novel contextual attention layer to integrate in the second stage, which used the features of known patches as convolutional filters to process the generated patches.The aforementioned approaches were based on vanilla convolutions that treated all input pixels as same valid ones, which is not reasonable for masked holes.To address this limitation, Liu et al. [15] proposed a partial convolutional layer for irregular holes in image inpainting, comprising a masked and re-normalized convolution operation followed by a maskupdate step.Partial Convolutions could be viewed as hardmask convolutional layers, Yu et al. [30] proposed gated convolution to learns a dynamic feature gating mechanism for each channel and each spatial location across all layers, which could be viewed as a soft-mask convolutional layer.Yang et al. [27] proposed a multi-task learning framework to incorporate the image structure knowledge to assist image inpainting, which trained a shared generator to simultaneous complete the masked image and corresponding structures-edge and gradient.Liu et al. [16] proposed a mutual encoder-decoder CNN for joint recovering structures and textures, which used CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively.

High-Resolution Image Inpainting
Most recently, a few works have been presented for highresolution image inpainting.Instead of directly filling holes in high-resolution images, Yi et al. [28] proposed a contextual residual aggregation mechanism that could produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network.Zeng et al. [32] presented a guided upsampling network to generate high-resolution image inpainting results, by extending the contextual attention module which borrowed highresolution feature patches in the input image.

Pyramid in Image Generation
Pyramid has been explored widely in the image generation task.Denton et al. [3] introduced a cascade of CNNs within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.At each level of the pyramid, a separate generation model was trained using the GAN approach.Shaham et al. [21] introduced SinGAN to learn from a single natural image, which contained a pyramid of fully convolutional GANs, each was responsible for learning the patch distribution at a different scale of the image.Inspired by classical image pyramid representations, Shocher et al. [22] proposed a semantic generation pyramid framework to generate diverse image samples, which utilized the continuum of semantic information encapsulated in deep features, ranging from low-level textural information contained in fine features to high-level semantic information contained in deeper features.

Method
In this section, we first introduce the pipeline of the proposed PyramidFill, and then present the generator and discriminator networks in each level, as well as the loss functions.

Pipeline of PyramidFill
Figure 2 illustrates the pipeline of our proposed algorithm PyramidFill, which consists of a pyramid of PatchGANs [13,8]: {G i , D i }, i = 0, 1, 2, 3, 4. For training, given a high-resolution image x, we sample a binary image mask m at a random location (e.g., center area in Figure 2).Input image z is masked from the original image as z = x (1 − m) + m.The original image, input image, mask are separately downsampled to be pyramids of images by a factor r 4−i (we choose r = 2): x i , z i , m i i = 0, 1, 2, 3, 4. Each Generator G i is responsible of filling the corresponding-scale masked image z i , trained against the image x i , where G i learns to fool the discriminator D i .
The filling of a high-resolution image sample starts with the lowest-scale image x 0 , wherein {G 0 , D 0 } are trained to complete the contents, and then progressively synthesis the finer textures at the higher-scale image by training {G i , D i }, i = 1, 2, 3, 4. The generator G 0 takes concatenation of z 0 and m 0 as input, and output the predicted image x 0 with the same size as input.We then replace the masked region of z 0 using the predicted image to get the inpainting result y 0 , After training the generator G 0 , the inpainting result y 0 and input image z 1 are given to the generator G 1 to output the predicted image x 1 with the same size as z 1 , and it replaces the masked region of z 1 to get the inpainting result y 1 .Training the other PatchGANs is similar to training {G 1 , D 1 }, given the corresponding input image z i and the lower-scale inpainting result y i−1 to get the current-scale inpainting result y i , i.e., (2)

Generators and Discriminators
Since completing contents and synthesising textures demand quite different abilities from generators, we customize different architectures for {G 0 , D 0 } and {G i , D i }, i = 1, 2, 3, 4.
Figure 3 shows the architecture of a PatchGAN for completing contents on the lowest-scale input image z 0 .For the generator network, we maintain the spatial size of the feature maps since of low-scale input images, instead of encoder-decoder networks used in most of image inpainting methods.The input images are first passed through four gated convolutional layers [30], which are then split into two branches along the channel dimension.Four dilated gated convolutional layers are adopted in the upper branch to expand the size of receptive fields for exploring the global structures of input images, four gated convolutional layers are simultaneously adopted in the lower branch for exploiting fine contents.Feature maps from two branches are then concatenated to proceed the last four gated convolutional layers to predict the contents in the masked region.The discriminator network consists of eight convolutional layers, wherein no downsampling operations are adopted to ensure the spatial size of the output as the same with the input image.It can better capture fine details in the lowest-scale image.
The architecture of a PatchGAN for synthesising textures is presented in Figure 4, and {G 1 , D 1 }, {G 2 , D 2 }, {G 3 , D 3 }, {G 4 , D 4 } share this same architecture yet progressively training.There are two stages in the generator, and we call the first stage as the super-resolution network, the second stage as the refinement network.The superresolution network predicts a higher-resolution image from its lower-resolution counterpart which is taken from the inpainting result of the lower-scale generator.We use the predicted higher-resolution image to replace the masked region of the corresponding-scale input image, which then passes through the refinement network to output a finer result.Inspired by ESRGAN [24] and SRResNet [12], we design a simple yet powerful super-resolution network, wherein all vanilla convolutions in RRDB module [24] are replaced with gated convolutions, and we only adopt two RRDB modules.The upsampling operator uses the sub-pixel layer with the factor 2 to increase the resultion of the input image.In the refinement network, there are only six gated convolutional layers with a shortcut connection within the middle four layers.In the discriminator, we use four strided convolutions with stride 2 to reduce the spatial size of feature maps due to memory cost.

Loss Function
To stabilize the training of PatchGANs, the spectral normalization [17] is applied in all discriminators.We use a hinge loss [30] as objective function for all PatchGANs {G i , D i }, i = 0, 1, 2, 3, 4: where x i and z i represent the real image and the masked input image, respectively.D i and G i denote the spectralnormalized discriminator and the generator, respectively.Inspired by U-Net GAN [20], we use a per-pixel consistency regularization technique when training D 0 , encouraging the discriminator to focus more on semantic and structural changes between real and fake images.This technique can further enhance the quality of the generated sample x 0 .We train the discriminator to provide consistent per-pixel predictions, by introducing the consistency regularization loss term in the discriminator objective function: where • denotes the L 2 norm.This consistency loss is taken between the per-pixel output of D 0 on the composing image y 0 and the composite of output of D 0 on real and fake images, penalizing the discriminator for inconsistent predictions.
Except the adversarial loss, a pixel reconstruction loss is introduced for training the generators, because the task of generators is not only fool the discriminators but also to generate image being similar to the real image.
where N xgen is the number elements in the image x gen , which is the predicted image by the generators.We introduce additional losses for training the texture generators, perceptual loss [10] and style loss [15,9].The perceptual loss penalizes results that are not perceptually similar by computing L 1 distance between feature maps of a pretrained network, where N q indicates the number of elements in the q-th layer, and φ q is the feature map of the q-th layer extracted from the VGG-16 network pretrained on ImageNet [23], and we choose the feature maps from layers pool1, pool2 and pool3.Style loss compares the content of two images by using Gram matrix.Given feature maps of sizes C q × H q × W q , style loss is computed as follows: where G φ q is a C q × C q Gram matrix constructed from feature maps φ q , which are the same with that used in the perceptual loss.

Experiments
We evaluate PyramidFill on three datasets: CelebA-HQ [11], Places2 [33], and our new collected NSHQ dataset.CelebA-HQ contains 30,000 high-quality images at 1024×1024 resolution focusing on human faces.For Places2, we randomly select 20 categories from the 365 categories to form a subset of 100,000 images.The NSHQ dataset includes 5000 high-quality natural scenery images at 1920×1080 or 1920×1280 resolution.For Places2 and NSHQ, images are randomly cropped to 512×512 and 1024×1024, respectively.Therefore, there is no {G 4 , D 4 } for training Places2 subset.For all datasets, we randomly partition into 90% of images for training and 10% of images for testing.

Quantitative Evaluation
We conduct quantitative comparisons on CelebA-HQ dataset and Places2 subset.For a fair evaluation, all input images are resized to 256×256 and 512×512, respec-

Qualitative Evaluation
The qualitative comparisons on three datasets are shown in Figure5.It shows that Global&Local and DeepFillV1 often generate heavy artifacts, even the holes are not very large.PConv, PEN-Net and DeepFillV2 can generate correct contents for completing faces, yet lacking of detailed textures.By contrast, our model PyramidFill can generate more realistic results on the corrupted faces with finer textures.When dealing with the corrupted images from Places2 and NSHQ, PConv and PEN-Net also produce obvious artifacts.DeepFillV2 and our PyramidFill both can generate reasonable contents and detailed textures, but re-

Real Applications
We study real use cases of image inpainting on highresolution images using our PyramidFill, e.g., freckles removal, face editing, watermark removal, and general object removal in natural scenery.In the first row of Figure 6, our model successfully removes the freckles on the face, as well as changing eyes.In the second row, the watermarks on the original image are removed successfully.In the third row, we remove the giraffes on the grassland and recover the background.In the last row, a boat is removed from the river.All inpainting results are realistic and with finegrained details.Furthermore, our PyramidFill can also be used for more real applications, e.g., removing wrinkles on faces, changing hairstyles, restoring old photos.the consistency regularization loss can definitely improve the performance of PyramidFill on completing contents.
Refinement network In Texture Generators, a refinement network is designed as a second-stage for generating a finer result.We provide ablation experiments on CelebA-HQ dataset with 128×128 images masked with center holes.For fair comparisons, two-stage networks are merged into a one-stage network, wherein the feature maps of the upsampling operator in the super-resolution network are directly inputted to the refinement network, and there is no coarse result to output.As shown in Figure 7, the two-stage network can generate a more photo-realistic inpainting result.

Conclusions
We present PyramidFill, a novel framework for the high-resolution image inpainting task.PyramidFill consists of a pyramid of PatchGANs, wherein the content GAN is responsible for generating contents in the lowestresolution corrupted images, and each texture GAN is responsible for synthesizing textures at higher-resolution images progressively.We customized the generators and discriminators for the content GAN and texture GAN, respectively.Our model was trained on several datasets to evaluate its ability to fill correct contents and realistic textures for high-resolution image inpainting.Quantitative and qualitative results demonstrated the superiority of PyramidFill, comparing with several state-of-the-art methods.

Figure 1 .
Figure 1.Free-form inpainting results on 512×512 images by PConv and our model.Zoom-in to see the details.

Figure 2 .
Figure 2. Pipeline of our high-resolution image inpainting algorithm-PyramidFill.Our model consist of a pyramid of PatchGANs, where training are progressive.

Figure 3 .
Figure 3. Architecture of a PatchGAN for completing contents.

Figure 4 .
Figure 4. Architecture of a PatchGAN for synthesising textures.

Figure 6 .
Figure 6.Inpainting results on 1024×1024 images using our Pyra-midFill.Zoom-in to see the details.The original images at the first two rows are extracted from CelebA-HQ dataset, and the original images at the last two rows are extracted from NSHQ dataset.

Figure 7 .
Figure 7. Ablation study of Refinement network in Texture Generators.From left to right, we show the real image, the masked input, the result with a one-stage network in Texture Generators and our result with a two-stage networks.

Table 1 .
Quantitative comparisons on CelebA-HQ testing set with input size 256×256 resolution, where the inputs are with center hole regions.The ↓ indicates lower is better while ↑ indicates higher is better.

Table 2 .
Quantitative comparisons on CelebA-HQ dataset with input size 512×512 resolution, where the inputs are with center hole regions.The images are masked with center holes, and we use L1 loss, PSNR and SSIM as metrics.It shows that our Pyra-midFill outperforms existing state-of-the-art methods with evident superiorities.Table3and Table 4 present the quantitative comparisons on Places2 testing set with 256×256

Table 3 .
Quantitative comparisons on Places2 subset with input size 256×256 resolution, where the inputs are with irregular holes.

Table 4 .
Quantitative comparisons on Places2 subset with input size 512×512 resolution, where the inputs are with irregular holes.

Table 5 .
Ablation study on CelebA-HQ dataset with input size 64×64 resolution, where the inputs are with center hole regions.Consistency regularization loss can improve the performance of our model.To demonstrate the effects of consistency regularization loss used in the discriminator D 0 , we retrain {G 0 , D 0 } yet without consistency regularization loss on CelebA-HQ dataset masked with center holes.The results are provided in Table5, which shows that