Pyramid-VAE-GAN: Transferring hierarchical latent variables for image inpainting

Significant progress has been made in image inpainting methods in recent years. However, they are incapable of producing inpainting results with reasonable structures, rich detail, and sharpness at the same time. In this paper, we propose the Pyramid-VAE-GAN network for image inpainting to address this limitation. Our network is built on a variational autoencoder (VAE) backbone that encodes high-level latent variables to represent complicated high-dimensional prior distributions of images. The prior assists in reconstructing reasonable structures when inpainting. We also adopt a pyramid structure in our model to maintain rich detail in low-level latent variables. To avoid the usual incompatibility of requiring both reasonable structures and rich detail, we propose a novel cross-layer latent variable transfer module. This transfers information about long-range structures contained in high-level latent variables to low-level latent variables representing more detailed information. We further use adversarial training to select the most reasonable results and to improve the sharpness of the images. Extensive experimental results on multiple datasets demonstrate the superiority of our method. Our code is available at https://github.com/thy960112/Pyramid-VAE-GAN.


Introduction
The image inpainting task, also known as the image restoration task, requires filling missing pixels of a damaged image [1][2][3][4]. It plays a significant role in various applications, including object removal [5,6] and photo editing [7]. However, when the missing regions are large and the images have complex structures, it is challenging to generate sharp output images with reasonable structures and rich detail. Over the last several decades, many studies [8][9][10][11][12][13] have targeted developing effective algorithms for this task.
The most widely used techniques are methods based on distance measurement [7], methods [14,15] based on generative adversarial networks (GANs) [14,16], and pyramid-based methods [17]. Methods based on distance measurement produce clear but unreasonable images because they lack semantic comprehension of the images. Images generated by GAN-based methods typically have reasonable highlevel semantics, but are often unstable in training and prone to mode collapse. Pyramid-based methods are often used to preserve details of images, but they typically produce distorted structures because they do not have prior information about the images.
In order to address the aforementioned issues, we propose a Pyramid-VAE-GAN (PVG) network. Our model uses a variational autoencoder (VAE) [18] backbone to encode latent variables to represent complicated high-dimensional prior distributions of images. The meaningful latent variables help the decoder to reconstruct images with reasonable and undistorted structures. A pyramid structure is also adopted to model both high-level semantic structures in high-level latent variables as well as rich details in low-level latent variables. To effectively make use of the correlations between hierarchical latent variables, we propose a novel latent variable transfer module that transfers long-range relationships contained in high-level latent variables to low-level latent variables to ensure both visual and semantic plausibility.
In general, the advantages of VAEs primarily lie in their elegant theory and training stability. However, when a basic VAE is directly used for image inpainting, the loss functions in the VAE are insufficient to appraise the decoded images. Thus, we introduce an adversarial training process using GAN discriminators to select the most reasonable results and improve the sharpness of the images. Our hybrid model provides the advantages of both VAEs and GANs, resulting in training stability and sharpness of decoded images.
Our proposed algorithm has been tested on a number of benchmark datasets, including CELEBA-HQ [19], Cars [20], DTD [21], and Facade [22]. The effectiveness of our method is shown by quantitative and qualitative comparisons to multiple state-of-theart methods. Figure 1 shows some examples of inpainting generated by our model.
Our contributions in this paper can be summarized as: • the first use of a Pyramid-VAE backbone for image inpainting, in which the pyramid structure augments generated details, and the VAE improves the structure, • a novel latent variable transfer module, in which long-range correlations in high-level latent variables are transferred to low-level latent variables providing details, thereby generating both visually and semantically convincing images, and • use of a hybrid VAE-GAN model to improve robustness of the training process and enhance the sharpness of the generated images.

Image inpainting
Before the rise of deep learning, image inpainting historically used diffusion-based technologies and patch-based techniques [7,23]. They relied on similarity of content between patches to repair images. However, lacking high-level semantic comprehension, these approaches had difficulty in producing contextcoherent output.
Recently, methods based on deep neural networks, especially convolutional neural networks (CNNs) [24,25] and GANs, have shown superior image inpainting results. CNN-based models [8] typically utilize an encoder to extract semantic features and a decoder for image reconstruction. However, image details belonging to low-level features are blurred by use of many convolutional and pooling layers, making it difficult to produce visually realistic images. GAN- Fig. 1 Examples of inpainting using our Pyramid-VAE-GAN (PVG-Net). Each pair shows an image with a removed (white) block (left) and an inpainted image (right). PVG-Net performs well on a wide range of images, including textures, cars, faces, and facades. based models [14,15] used adversarial training to improve the sharpness of images, but they are often troubled by the problem of mode collapse.
Specific networks using pyramid structures [17] have also been designed for the inpainting task. These approaches simultaneously take into account semantic information in the high-level features and texture details contained in the shallow features. Zeng et al. [17], for example, used a pyramid filling mechanism for image inpainting based on a U-Net architecture: it can fill missing regions at both image and feature levels. Nevertheless, these networks generally fail to grasp prior knowledge of images containing structural information, resulting in distorted structures. Thus, there is still much potential for improvement in the meaningfulness of structures.

Variational autoencoders
The image inpainting task is often plagued by mode collapse and training instability, which may be addressed using VAE-based approaches [26]. Kingma and Welling [18] proposed the VAE method to efficiently carry out approximate posterior inference in directed probabilistic models with continuous latent variables. They reparameterized the variational lower bound to obtain a stochastic gradient variational Bayes estimator and proposed an auto-encoding variational Bayesian algorithm to optimize the recognition model using that estimator. VAEs have achieved considerable success in a variety of domains including computer vision [27,28], natural language processing [29], and many others [30]. These approaches are particularly effective in generating high-dimensional complex data such as handwritten digits [18,31] and faces [18,32,33], and facial attribute manipulation [34]. Walker et al. [35] developed a conditional variational autoencoder to predict future events that might happen from static images. Sohn et al. [36] achieved excellent pixel-level object segmentation and semantic labeling using Gaussian latent variables. Ramesh et al. [28] proposed a zero-shot technique for text-toimage synthesis challenges; it outperforms earlier domain-specific models. A deep hierarchical VAE [27] employing batch normalization and depth-wise separable convolutions was proposed by Vahdat and Kautz for image generation.
Generally speaking, VAEs have a simpler optimization procedure than the well-known GAN generative models; training is faster and more reliable. Generators for both approaches tend to produce blurred images, but discriminators are utilized in GANs to improve the sharpness of the generated images.
In turn, GAN discriminators can be integrated into VAE-based models to create a VAE-GAN hybrid model [34,37,38] to enhance the sharpness of decoded images.
Gao et al. [37] developed Zero-VAE-GAN, a VAE-GAN hybrid that generates high-quality unseen features for zero-shot learning tasks. For inpainting, Zheng et al. [38] presented a VAE-GAN hybrid model with reconstructive and generative paths, which reconstructs ground truth while ensuring diversity of the repaired results. However, the encoders and decoders of these methods consist of a large stack of convolutional and pooling layers, which typically do not make use of both high-level features and low-level details, as well as their correlations.

Problem statement
Generally, an image inpainting method has two stages: a generation stage and a discrimination stage. The former aims to produce high-quality restored images, while the latter selects the most reasonable image from the various generated images.
In the generation stage, image inpainting methods try to restore a degraded image by utilizing prior knowledge of the degradation phenomenon [39]. These tasks can be formulated as maximizing the prior distribution of datasets generated by a directed probabilistic model. The model assumes that the dataset X = {x (i) } N i=1 (consisting of N independent identically distributed samples) is generated through two random processes [18]: (i) sampling latent variables z from the prior distribution p θ (z) of latent variables, and (ii) generating data x from the conditional distribution p θ (x|z).
We use VAEs to find the prior distribution in the probabilistic model. The prior distribution p θ (X) of the dataset meets the constraint in Eq. (1) [18]: The terms on the right-hand side of Eq. (1) are the variational lower bound of the prior to be optimized. The first term is the Kullback-Leibler (KL) divergence between the variational approximate posterior and the real prior of the latent variables.
Minimizing D KL [q φ (z|X) p θ (z)] makes the approximate posterior distribution q φ (z|X) as close as possible to the prior p θ (z) of the latent variables.
In our method, we use a pyramid variational inference module with parameters φ to model the approximate posterior distribution (see Section 4.2). The second term E z∼q φ (z|X) [log p θ (X|z)] represents the reconstruction expectation. The logarithmic likelihood p θ (X|z) is modeled by a pyramid likelihood decoder module with parameters θ (see Section 4.3).
In the discrimination stage, we select the most reasonable image among those imagesX generated by the decoder: image inpainting is an ill-posed problem, which we address by incorporating a GAN discriminator to choose the most plausible outcome. It does so by selecting the image with maximal score given by In summary, the overall optimization goal has three objectives: Figure 2 shows our VAE-GAN hybrid model based on a pyramid architecture for achieving the above optimization goals.

Overview
The model takes masked images as input and outputs restored high-quality images. The first component of the model is a pyramid variational inference module q φ (z|X) ( Fig. 2

(a)) intended to optimize the objective
Given masked images, this module samples hierarchical latent variables z with approximate posterior distributions q φ (z|X) via a pyramid probabilistic encoder and a Monte Carlo (MC) sampling process. The second part of the model is a pyramid likelihood decoder module p θ (X|z) (Fig. 2(b)) whose purpose is to optimize the objective E z∼q φ (z|X) [log p θ (X|z)]. This module decodes multilevel latent variables z into restored images with likelihood distributions p θ (X|z). Finally, we employ a GAN discriminator (Fig. 2(c)) to choose the most reasonable restored image, by optimizing Overall, the pyramid variational inference module and pyramid likelihood decoder module together constitute a generator, while the discriminator of geometric-GAN [40] is used to judge the decoded imagesX against the ground truth images X.
In Section 4.2, we cover the pyramid variational inference module and KL divergence losses in detail. Section 4.3 describes the pyramid likelihood decoder module and pyramid reconstruction losses, while the adversarial training losses are presented in Section 4.4.

Components
The pyramid variational inference module comprises three sub-modules: the pyramid probabilistic  Finally, the semantic structural information in highlevel latent variables is transferred to low-level latent variables carrying more details via the latent variable transfer module.

Pyramid probabilistic encoder
In order to achieve satisfactory image inpainting, both the semantic information in high-level latent variables, and texture details at higher resolution contained in low-level latent variables, must be considered. We use a pyramid probabilistic encoder to encode each level of semantic information of the image into a high-dimensional probability space. Unlike previous pyramid encoders which directly learn image features, our pyramid probabilistic encoder learns means and covariances of approximate posterior distributions.

Reparameterization module
If latent variables z are directly sampled from the above posterior distributions q φ (z|X), the sampling process introduces discontinuities and inconsistencies into the deep neural network, resulting in failure of the gradient descent process: the network becomes untrainable.
We address this problem using the Law of the Unconscious Statistician [41]: which says that latent variables z ∼ q φ (z|X) can be reparameterized using a differentiable transformation g φ (ε, X) with variables ε: This differentiable transformation can be carried out by MC gradient estimation. The formulation of the transformation g φ (ε, X) is determined by the specific form of q φ (z|X).
We assume that the posterior distributions q φ (z|X) of latent variables obey a multivariate Gaussian distribution N (µ(X), Σ(X)). We utilize a reparameterization module to implement the complex transformation [42]. The form of the differentiable transformation is Figure 3(a) shows the reparameterization module in which the sampling process is shifted to the input layer. The variables ε sampled from the standard normal distribution N (0, I) are transformed to latent variables z obeying the distribution N (µ(X), Σ(X)) through the differentiable transformation in Eq. (4).
By optimizing the first term −D KL [q φ (z|X) p θ (z)] on the right-hand side of Eq. (1), the approximate posteriors q φ (z|X) gradually approach the prior distributions p θ (z) of the latent variables. For the sake of simplicity, we assume p θ (z) ∼ N (0, I), (0, I)). For a reparameterization module with L layers, the KL divergence losses are given by (see Ref. [43] for further details):

Latent variable transfer module
After using the pyramid probabilistic encoder and reparameterization module, our model has encoded hierarchical latent variables that contain both highlevel semantics and image texture information. To maximize the utility of the connections between multilayer latent variables, we use a latent variable transfer module before feeding the results to the decoder. The module transfers long-range relationships between distant background information and missing regions in high-level latent variables to low-level latent variables. The primary aim is to prevent unreasonable structures arising across neighboring regions: lowlevel latent variables are incapable of capturing longrange correlations. Figure 3(b) depicts the architecture of this module. We calculate the relationships (similarity) between the hole (foreground) and its periphery (background) in the high-level latent variables, using cosine similarity: where z h f and z h b indicate the foreground and background patches of the high-level latent variables respectively. We implement the calculation as a convolution, in which the foreground and background blocks of the high-level latent variables are represented by convolution kernels and feature maps respectively.
We then calculate the attention scores of foreground patches in high-level latent variables by applying softmax to these cosine similarities cos h f,b : The scores s h f,b from higher-level latent variables are then used to guide the reconstruction of the foreground patches in lower-level latent variables: where z l f and z l b represent the foreground and background patches of the lower-level latent variables respectively. This procedure is also performed as a convolution where the attention scores from higher-level latent variables and background patches in lower-level latent variables are formulated as convolution kernels and feature maps. Finally, our model generates the transferred latent variables that include both high-level semantic information and detailed texture features. Again, we emphasise that all of the operations in this module are computed via convolution, making them differentiable and suitable for end-to-end training [9].

Pyramid likelihood decoder module
The pyramid variational inference module learns both high-level latent variables and transferred low-level latent variables. These high-level latent variables contain semantic information, while the lowlevel latent variables contain detailed information while simultaneously being instructed by long-range relationships. In order to integrate these latent variables, we use a pyramid likelihood decoder module, shown in Fig. 2(b). Firstly, the high-level latent variables are upsampled to match the shape of the transferred low-level latent variables. Then the reshaped highlevel latent variables and transferred low-level latent variables are concatenated and fed into the decoder to generate images at different scales. These images from deep to shallow are represented asX L , · · ·,X 1 , and obey the likelihood relationships in Eq. (9): X L ≡ p θ (X L |z L ), · · ·,X 1 ≡ p θ (X 1 |z 1 ) (9) By maximizing the second term on the right hand side of Eq. (1), E z∼q φ (z|X) [log p θ (X|z)], the generated imagesX L , · · ·,X 1 become closer to the corresponding ground truth images X L , · · ·, X 1 which are scaled to the same size. This process is performed by minimizing the reconstruction loss calculated using L1 distance. Therefore, for a decoder with L layers, the reconstruction loss is Since the pyramid likelihood decoder module takes as input hierarchical latent variables, in which the high-level latent variables reflect the prior distribution of the ground truth images, and transferred low-level latent variables after feature fusion take into account both semantic structures and texture characteristics, our decoder can generate reasonable images with rich details.

Adversarial discriminator
A GAN discriminator is included in our framework to assist the decoder in generate more plausible images while enhancing image sharpness. A discriminator is widely used in many GAN-based models [14,40]. For more details about the structure of a discriminator, readers may refer to Ref. [40]. Our Pyramid-VAE backbone is robust during training, not prone to mode collapse, and the output images from the pyramid likelihood decoder are reasonable. However, the loss functions in the VAE are insufficient to appraise the decoded images, and image inpainting task is an ill-posed issue. The above issues are tackled by our VAE-GAN hybrid model.
In the adversarial training process, the generator, comprising the pyramid variational inference module and pyramid likelihood decoder module, and the discriminator from geometric-GAN [40] are trained together, using the two loss functions in Eq. (11): where the target of the adversarial loss L g is to choose the most sensible outcome by maximizing the score given by the discriminator on the decoded images. L D represents the hinge loss, a concept originating in support vector machines (SVMs) [47]. Its underlying principle is to maximize the distance between positive and negative samples. It was later introduced to GANs and applied in geometric GAN [40].
For L D , only the positive samples of D(X) < 1 and the negative samples of D(X) > −1 have an effect on the results, implying that only certain samples that are not reasonably distinguished will affect the gradient. This strategy may help to increase stability during training [40].
In summary, the discriminator portion of our Pyramid-VAE-GAN model is optimized by minimizing L D . The generator part of the model is optimized by minimizing KL divergence losses established in Section 4.2, reconstruction losses defined in Section 4.3, and an adversarial loss given in Section 4.4. The overall objective function is as Eq. (12): The training procedure is given in Algorithm 1. Sample batch images X from training data and generate random masks M for X 6: Construct corrupted imagesX = X M 7: Feed the corrupted image, maskX, M pairs into the pyramid variational inference module 8: Compute multilayer transferred latent variables z and KL divergence losses L kld

9:
Feed these variables into pyramid likelihood decoder module 10: Get final decoded imagesX =X +G(X, M ) (1−M ) and calculate reconstruction losses Lrec

11:
Feed decoded images and ground truth images into discriminator and calculate adversarial loss Lg

Experiments
We now qualitatively and quantitatively evaluate our PVG network with respect to baselines. Section 5.1 introduces the detailed experimental setup, Section 5.2 describes the results of the experiments, and Section 5.3 provides an ablation study for our model.

Experimental setting
Experiments were carried out on four datasets: CELEBA-HQ [19], Cars [20], DTD [21], and Facade [22]. They respectively contain high-quality human face photos, various classes of cars, textured images, and rectified images of facades having diverse architectural styles. The images were allocated to training and testing sets at random as detailed in Table 1.
Our full model was implemented using PyTorch v1.7 and trained using an NVIDIA RTX3080 GPU. The batch size was 16. The weights of the different losses were set to λ kld = λ rec = λ G = 1 throughout the optimization process. All images used for training and testing were reshaped to 256×256 pixels. For fair comparison, the training images from CELEBA-HQ, DTD and Facade were randomly masked with 128 × 128 squares, which are equivalent to the experimental settings of the baselines [17]. Training images from Cars were randomly masked with 64 × 64 squares: the whole car might be removed if a larger hole were used. Testing images were also centrally masked with 128 × 128 or 64 × 64 squares in the same way. All results shown in this paper are direct outputs from the trained networks, with no post-processing.

Quantitative assessment
The important feature of VAE is its ability to selfgenerate a large amount of data through the Monte Carlo (MC) sampling procedure. This characteristic makes VAE models particularly well-suited to small datasets, especially in domains like biology [30] where data are expensive to acquire. Thus, we quantitatively evaluate our approach using the smallest of the four datasets mentioned before. We use the following four evaluation metrics: mean absolute error (MAE), multi-scale structural similarity (MS-SSIM) [48], peak signal-to-noise ratio (PSNR) [49], and Fréchet inception distance (FID) [50]. When evaluating the effectiveness of image inpainting algorithms, it is common to report the results as a PSNR measurement. The higher the PSNR value, the less distortion it represents. Another method to assess the quality of inpainting images is to use a perceptually based similarity metric, such as the MS-SSIM. Additionally, FID is a widely used numerical measure in the area of image generation. It can also identify recognized flaws of GANs such as mode collapse.
The results listed in Tables 2-4 use different datasets to quantitatively compare our proposed approach to various baselines including GL [44], CA [9], PConv [45], PEN-Net [17], and GConv [46]. Our method almost always provides the best results for each evaluation metric, particularly MS-SSIM, indicating that our model really learns structural knowledge from images and uses it to generate reasonable structures.

Qualitative assessment
As Fig. 1 demonstrates, our Pyramid-VAE-GAN Network (PVG-Net) is capable of high-quality image inpainting. In each pair, the input image is presented on the left, masked by a central white block, and the corresponding output image is shown on the right. It can be seen that the inpainting from our model is visually and semantically realistic across a number of datasets, including images of facades, cars, faces, and textures. Figure 4 visually compares results of using various methods to inpaint central holes for Facade, DTD, and CELEBA-HQ datasets. GL [44] and CA [9] tend to generate context-coherent but blurred images, as stacking of convolutions smooths details. PConv [45] and PEN-Net [17] provide results with poorer structures than our method as they fail to capture  [44], CA [9], PConv [45], PEN-Net [17], GConv [46], and our model. Experimental results from GL [44], CA [9], PConv [45], PEN-Net [17] are reproduced with permission from Ref. [17], c IEEE 2019.
priors from the datasets. In particular, our method generates high-quality, well-structured inpainting for the Facade example. In the DTD example, visual artifacts appear in the interior or at the boundary of the hole for other methods, but not for our method. In the CELEBA-HQ example, there are distorted structures in the bottom left corner of the images for all other methods, while our method generates more natural face results.
In order to show differences more clearly, we have magnified the images generated by PEN-Net and our PVG-Net in Fig. 5. There are some unrealistic details in the image generated by PEN-NET. Our Pyramid-VAE backbone robustly generates reasonable results and is not prone to mode collapse. Our inpainting model generalizes better by using a reparameterization module, because it can generate more diverse data through the MC sampling process. Such a module can be added to any neural network architecture, allowing it to sample sufficient latent variables from the limited data to accurately reflect the high-dimensional prior distributions of the datasets. Figures 6-8 compare original images to the inpainting results obtained using our method, using images from Facade, DTD, and CELEBA-HQ datasets respectively. In each case, the first row shows original images, the second, masked images, and the last, images generated by our proposed method. The Facade inpainting results are very close to the original images, as is also the case for the DTD dataset. These results show that, for images with obvious structural and textural features, the masked region can be inferred from the prior knowledge learned by our model: our model can extract prior knowledge of datasets and reproduce their distributions using latent variables. However, for the face images (Fig. 8), it is challenging to completely recover the original image using only the unobstructed information. Even in this situation, our model can generate very reasonable, natural-looking images regardless of hair color, gender, hairstyle, or face shape.

Ablation study
We performed an experimental ablation study to demonstrate the effectiveness of the latent variable transfer (LTN) and reparameterization modules in our proposed network. We compare results from pruned models when these modules were removed to results from our full model, using the Facade dataset. Metrics are given in Table 5 and a visual comparison using an example is shown in Fig. 9. Removing either module leads to poorer results as evaluated by the   Figure 9 compares inpainted images after removing these two modules to the image generated by our full model. Without the LTN module, the inpainted image is relatively noisy, contains much meaningless detail, and lacks a semantically reasonable image structure. For example, the windows are not Fig. 9 Effects of removing components of our method. Left to right: masked image, infilled image without the LTN module, infilled image without the reparameterization module, infilled image without the GAN, and infilled image from our full model. distinctly separate and there are large black spots on the wall. These do not occur in the full model, showing that our proposed LTN module helps to transfer high-level semantic information to the lowlevel to generate high-quality images with reasonable detail and structure.
When the reparameterization module is omitted, the network degenerates to a traditional end-to-end encoder-decoder model. The inpainted images are smeared and blurred: for example, the windows are smoothed out in the middle while there are some small black spots on the pillars.
If the GAN discriminator is removed, the generated images are again blurred. Compared to the pruned model, the images generated by our original model are cleaner and have more reasonable structure and essential details.

Conclusions
In this paper, we have presented a Pyramid-VAE-GAN model for image inpainting, which restores high-quality, sharp images with reasonable structures and rich details. In our method, a latent variable transfer module transfers structural information in high-level latent variables to rich details in low-level latent variables. We incorporate a GAN discriminator to improve the sharpness of inpainted images while maintaining training stability. Our many experiments have shown the effectiveness of our method for image inpainting, from both qualitative and quantitative perspectives. Our ablation study demonstrates that using correlations between multi-level latent variables benefits inpainting significantly. In terms of future work, we hope to expand our approach to a variety of other image processing applications, including image translation, super-resolution, and denoising. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.