1 Introduction

Image inpainting is an important yet challenging computer vision task. Its goal is to predict appropriate pixels of the missing areas. It serves a wide range of applications, such as photoediting, decapping, and removing unwanted objects from photos. Well-repaired areas should have reasonable semantic structures and visually realistic textures.

Earlier traditional algorithms [2, 4, 6] fill holes by dealing with suitable patches or pixels from known regions. However, these methods cannot understand meaningful image semantic priors, and the repaired areas might exhibit incorrect semantic features.

In recent years, with the rapid development of deep learning, some methods [8, 16, 24, 26, 37, 39] have learned image data distributions by adopting convolutional neural networks(CNNs). Although these approaches generate semantically reasonable contents in the missing areas, the completed regions often suffer from either distorted structures or texture artifacts due to incompatibility with the human restoration behavior of the structure and texture of missing regions separately. To solve this problem, a multistage network [1, 28, 29, 33] repairs edge or structure information firstly, and use it to guide full image recovery. A single-stage network [35] utilizes the properties of the convolutional encoder-decoder itself, which extracts structures and textures at the deep layers and shallow layers of the encoder respectively and repairs them. However, as the holes become larger, most of separate repair networks easily generate discontinuous structures and unsatisfactory textures. This mainly lies in two reasons. One is failure to depict accurate edges or structures for guiding the texture generation, due to the constraint of a small amount of known information. The other is not offering the specific effective strategies to handle large holes. The progressive repair model [31, 32] fills large holes gradually by repeating the same modules, as they consider the reconstruction of structures and textures as a whole, making it more likely to produce inferior content that affects the restoration of the next stage.

To overcome the above limitations, we design multiple tactics to aid in fixing large holes on the basis of separate restoration. Specifically, we propose a novel framework called parallel adaptive guidance network (PAGN), in which the structure reconstruction branch and texture enrichment branch restore the structured image (the image after edge-preserving smoothing) and full image, respectively. As shown in Fig. 1, the structured image consists of large-scale objects such as edges and flat areas and excludes small-scale objects (e.g., details). The complete image includes semantically reasonable structures and rich textures. Unlike unidirectional structure-guided repair [28, 29, 33], our parallel branches guide each other at multiple intermediate layers via a skip connection to achieve “win-win cooperation”, in which the structure reconstruction branch provides strong structure priors for the repair process of complete image, and the well-designed texture enrichment branch carrying reasonable structural information contributes to the recovery of structure maps. In addition, in contrast to the previously repaired pixels directly guiding the recovery of the remaining pixels [31, 32], we adopt the guidance filter to avoid useless information from one branch directly passing into another. In this way, the model can adaptively utilize contributing information in the mutual guidance process and produce more visually pleasing results, especially on large hole inpainting tasks. To the best of my knowledge, we are the first to use a mutual guidance structure to allow each branch to take advantage of additional information that is only beneficial to its own restoration results.

Fig. 1
figure 1

The difference between a complete image with details and a structured image

To make the recovered content more consistent with the background (undamaged regions), some methods [9, 22, 24, 32] build long-term connections between distant pixels by modifying contextual attention [25]. However, all of these existing attention designs fill missing areas by searching patches with high similarity within the same scale. When the image is badly damaged and only scarce information is available, exploiting limited known information through various perspectives is particularly essential. Thus, we devise a joint-contextual attention mechanism(Joint-CAM), which explores feature dependencies not only at the same scale but also different scales. This can utilize finite contextual information maximally to infer the missing contents and ensure structural continuity to some extent.

Furthermore, precise feature representation helps to understand the semantic content of an image, which is good for the inpainting task. Existing methods [3, 30, 34] utilizes utilize layerwise multiscale structures to extract different-scale features. However, these learned representations are relatively coarse due to the lack of intralevel feature fusion. To solve this issue, we design a novel multibranch module in the end of the encoder, namely attention-based multiscale perceptual res2block(AMPR). In addition to the multiple parallel convolution layers with different receptive fields, AMPR contains intralayer residual connections and attention mechanism. Thus, AMPR not only extracts multiscale features of the whole image at a granular level and retains more accurate spatial location information, but also fuses features from various branches effectively through an attention mechanism.

In summary, the main contributions of our work can be described as follows:

  1. 1.

    A novel framework called the parallel adaptive guidance network (PAGN) for image inpainting not only specializes in repairing structures and enriching textures through parallel branches separately but also facilitates features from one branch to adaptively accept useful information from another via a skip connection with a guidance filter.

  2. 2.

    We introduce strategies for repairing large holes, including the joint-contextual attention mechanisms(Joint-CAM), which utilize limited known information maximally, and an attention-based multiscale perceptual res2block(AMPR), which effectively recovers missing objects of various sizes. These tactics help to generate the results with clear textures and continuous structures

  3. 3.

    Our method is more effective than state-of-the-art models for dealing with large holes, and achieves high-quality results on facial and natural datasets.

2 Related work

2.1 Image inpainting

Currently, image inpainting is divided into two main parts: traditional methods or deep learning methods. Traditional approach [2, 4, 6] spreads the neighborhood pixels or patches from background to the target hole. Although these methods can repair better textures, they have many limitations: 1. They must assume that the content of the missing regions can be found in the input image or external image libraries. 2. The reconstructed areas often exhibit incorrect structures due to a lack of understanding of the high-level semantics of images. 3. Broken images are repaired with regular-shaped masks.

With the rapid development of deep learning, convolutional neural networks (CNNs) have shown outstanding performance in image inpainting tasks because they effectively capture local features and high-level abstract features of an image. Pathak et al. [7] proposed a context encoder and was the first work incorporating generative adversarial loss into an encoder-decoder architecture to repair broken pixels. However, the repaired results always contain texture artifacts due to the restriction of the channelwise fully connected layer. Subsequently, Iizuka et al. [8] improved the image quality by employing cascaded dilated convolution, and use the local and global discriminators together to ensure both the consistency of the repaired region and the entire image. Chen et al. [39] let the local discriminator identify the similar patches in different images but in the same type, which improves the discriminative ability of the network. Yu et al. [25] proposed a coarse-to-fine network, which produces the rough prediction first through a coarse network, and further optimizes the coarse intermediate results into more high-quality images through a refinement network. Chen et al. [27] added context-awareness loss to make the repaired regions more realistic by constraining the similarity of local features. Liu et al. [16] and Yu et al. [26] replace the partial convolution and gated convolution with an ordinary convolution, respectively, to avoid color incongruity and edge response.

These methods often fail to reconstruct continuous structures or fine details as they recover the holes without plausible strong constraints. Nazeri et al. [29] depicted the lines of the missing areas first and added the colors and textures based on these restricted lines. Ren et al. [33] split the whole inpainting into two steps: first, they repaired the missing structures and then provided the completed structures to the texture generator to direct the synthesis the vivid textures through appearance flow. Shao et al. [1] utilized fusion images of edge maps and blurred images which provide color information as labels to guide the reconstruction of the refined image. Liu et al. [35] captured structure and texture features using the deep and shallow layers of the encoder respectively, and filled the holes of different-type features via separate multiscale blocks. Guo et al. [31] and Li et al. [32] considered a progressive inpainting policy, in which dilated pixels gradually form known regions to the hole center by using repeating blocks or modules. Zhu et al. [40] utilized multiple decoders to refine the reconstructed results.

2.2 Contextual attention inpainting

To keep the generated textures realistic and consistent with the surrounding features, Yu et al. [25] proposed a contextual attention layer that borrows similar feature patches from context to fill missing regions. Chen et al. [39] preprocessed the images by using a similar block around the damaged area to update the damaged block. Zeng et al. [24] designed a pyramid-context encoder, which progressively applies a contextual attention mechanism from latent feature maps to the original image, to ensure both semantic and visual coherence in the repaired regions. Liu et al. [9] found that the repaired results will show discontinuous pixels if they focus only on feature dependency inside and outside the holes. Thus, they explore relationships between pixels in the holes as well. Li et al. [32] considered the consistency between the attention scores from different recurrences and devised the knowledge attention layer for recurrent architecture. However, these methods ignore the strong correlations between feature patches at different scales. Exploring it will obtain more accurate matching patches in missing regions, especially when background information fades considerably.

2.3 Multiscale design

Influenced by the way neurons in the human brain are connecte, multiscale structure is adopted to capture features of different sizes in many computer vision tasks. Inception [12] and atrous spatial pyramid pooling (ASPP) [21] are the most common multiscale designs are implemented in various networks. In the object detection field, Liu et al. [42] proposed a receptive field block, which absorbs the advantages of ASPP and Inception, to enhance feature robustness and improve detection accuracy. In the super resolution field, Li et al. [38] combined ASPP and channel attention at the bottleneck. In the 3D reconstruction field, Ding et al. [44] estimated a more accurate depth map by using continuous multiple ASPP blocks, which is vital for better 3D reconstruction. In the image deraining filed, Wang et al. [45] utilizeed multiscale kernels and multiresolution feature maps to capture rain streaks with different sizes and scales.

In the image inpainting field, multiscale feature representation is essential for understanding the semantic information of images. Wang et al. [30] devised the multicolumn network, which contains three parallel encoder-decoder branches with different filter sizes and spatial resolutions, to extract different levels of features. Chen et al. [34] designed two parallel encoders with different receptive fields to obtain global semantic features and local detail features respectively.

3 Parallel adaptive guidance network

Given a defective image and the corresponding binary mask, our goal is to output a well-repaired image with visually realistic content. Xiong et al. [28], Nazeri et al. [29], Ren et al. [33] showed that repairing the structures and textures of an image separately would reduces texture artifacts and oversmoothed boundaries. Therefore, we design a parallel adaptive guidance network(PAGN) as shown in Fig. 2, where the structure reconstruction branch aims to reconstruct the structures of a damaged image, and the texture enrichment branch simultaneously enriches the textures and recover the complete image. These features from different branches guide each other’s restoration using skip connections and guidance filters. Moreover, the utilizing of the joint-contextual attention mechanism(Joint-CAM) in the texture enrichment branch helps to output completed images with realistic details and continuous structures, and designed attention-based multiscale perceptual Res2Blocks(AMPR) in the bottleneck helps to capture the features with different receptive fields in a way closer to the human eye.

Fig. 2
figure 2

The overall pipeline of the PAGN. It consists of structure reconstruction branch and texture enrichment branch. Each branch adopts a typical encoder-decoder structure. The two branches guide each other by skip connection and guidance filter. In the encoder, downsample blocks, AMPR, etc. are used to understand the semantic information. In the decoder, upsample blocks, Joint-CAM, etc. are used to reconstruct the image. The downsample block consists of convolution block and Resblock, and the upsample block consists of ResBlock, nearest upsampling and convolution block with normalization and activation layer. The number k represents the kernel size of the convolutional layer

We first introduce the idea of mutual adaptive guidance in Section 3.1. Then, we describe the Attention-based Multi-scale Perceptual Res2block and Joint Contextual Attention Mechanism in Sections 3.2 and 3.3 respectively. Finally, we provide the corresponding loss functions of our model in Section 3.4.

3.1 Mutual adaptive guidance

Different elements of an image, structure and texture are interrelated. In contrast to the traditional idea, which only takes the guidance map (structure map, edge map, etc.) as one of the inputs to provide extra information, we make two improvements. First, one-way guidance is replaced with mutual guidance in favor of the restoration of each element. Specifically, in the decoder, recovered structures can be integrated into the texture enrichment branch to provide strong priors. Once the correct structures are completed, the inpainting task can be treated as a detail-enrich problem. In the encoder, extracted full features involving rich texture and reasonable structure information can also help the repair of accurate structures in turn. Second, each branch incorporates multiple intermediate-level features from another branch so that the guidance information is considered at multiple layers, avoiding only affecting previous layers of the deep network.

As the mask area increases, the feature maps used in guidance inevitably contain wrong or invalid information during the restoration process. To prevent less informative features from one branch directly going to another via skip connection, an extra guidance filter is added to highlight contributing features adaptively and suppress poor features. The guidance filter consists of two convolution layers with different kernel sizes and a sigmoid function, in which a 1∗1 convolution is utilized to integrate information and compress channels of input feature maps, and another 3∗3 convolution and a sigmoid function are used to yield an attention map. This attention map is then used to recalibrate the feature map to be directed to another branch. The feature contrast diagram of the input and output feature of the guide filter is shown in Fig. 3, and the processed feature maps highlight the important objectives such as mask regions and key points of the face.(feature maps are upsampled to 256 ∗ 256 for observation).

Fig. 3
figure 3

Visual contrast diagram of the input and output of the guidance filter

3.2 AMPR In the encoder

Robust feature representations facilitate inpainting networks to yield semantically accurate contents and clear details. We think that robust feature representations are reflected in two aspects: features with multiscale receptive fields and precise spatial location. The former usually contains global and local information, which helps network understand the missing semantic content and enrich texture details. The latter is vital for visual systems. For up to these purposes, we propose a new multiscale structure called attention-based multiscale perceptual res2block(AMPR) at the bottleneck, which consists of multiscale perceptual block, intralayer residual connection similar to [19] and convolutional block attention module(CBAM) [36]. The multiscale perceptual block (Section 3.2.1) aims to capture scale-diverse features effectively. Intralayer residual connection (Section 3.2.2) enhances information exchange between different branches, contributing to understanding the contents of missing regions at a granular level and retaining relatively precise spatial location information. The convolutional block attention module(CBAM) helps to reduce redundant features when fusing multiscale features. The design inspiration of this block originates from the receptive fields block(RFB) [42] as shown in Fig. 4a, but has three special modifications as shown in Fig. 4b, where d represents the dilation rate, and k represents the kernel size. Next, we introduce the corresponding modifications in detail.

Fig. 4
figure 4

The comparison between (a) RFB and (b) AMPR. The CBAM module of AMPR is shown in (c). The parameters k and d represent the kernel size and the dilated rates of the dilated convolution(or convolution), respectively. In convolution, the default value of d is 1

3.2.1 Multiscale perceptual block

Inspired by receptive fields block [42], both the size and eccentricity of the receptive fields play important roles in human vision. Thus, the multiscale perceptual block can be divided into two components as well: convolution layers with different kernels and dilated convolution layers [18] with individual eccentricity. To reduce network parameters, a 1 ∗ 1 convolution is employed to compress the channels of the feature map before going to multibranch structure. As shown in Fig. 4b, in four parallel branches with various filter groups, for the convolution layer part, the kernel size is 1, 3, 5 and 7, and the eccentricity is fixed at 1. For the dilated convolution layer part, the eccentricity is 1, 2, 4 and 8, and the kernel size is fixed at 3.

Since replacing a large-scale convolution kernel with multiple small-scale convolution kernels can both reduce the parameters and increase the depth of the network with the same receptive field, we use the corresponding 3 ∗ 3 convolution kernels instead of a large-scale convolutional kernel of 5,7.

3.2.2 Intralayer residual connection

Intralayer feature fusion has been successfully applied in many computer vision tasks, but is less explored in image inpainting tasks. Our proposed intralayer residual connection allows adequate information fusion between different-branch features. This operation not only helps to capture multiscale features at a finer level, but also preserves accurate spatial information, which leads to key point (nose, eyes, etc.) localization. We denote Ci() as the i-th branch in the multiscale perceptual block and Fi is the output of Ci(). The specific implementation is to add the input feature Xi to the output of the previous branch Fi− 1, and then feed into Ci(). As shown in Fig. 4b, we use the element-wise sum instead of a concatenation operation to fuse features inside each branch, aiming to avoid redundant feature information as the convolution layers continuously increase. The output of each branch can be computed as follows:

$$ \begin{array}{@{}rcl@{}} F_{i}=\begin{cases} C_{i}(X_{i}) & i=1 \\ C_{i}(X_{i}+F_{i-1})& i=2,3,4 \end{cases} \end{array} $$
(1)

3.2.3 Convolutional block attention module

Actually, each feature from different sources is treated equally if we stack them directly in channel dim, which is not consistent with human vision. Thus, attention mechanism is utilized to tell us which features need special attention and elevate the redundant information. Here, we choose the popular attention module - convolutional block attention module(CBAM) [43] - to fuse features effectively. As shown in Fig. 4c, the CBAM includes channel attention and spatial attention. Within channel attention, we concern about what is meaningful information. The 3-dimensional input (simple stacking of feature maps from multiple branches) is compressed into two different 2-dimensional features that only have the channel information by global maximum pooling and global average pooling along the spatial dimmention. Maximum pooling is used to search for unique semantic features and average pooling is used to count the semantic information. Then the shared neural network and sigmoid function are explored for internal correlations between channels to produce a channel attention mask.

Supplementation to channel attention, spatial attention is used to concern where the important information is. We perform maximum pooling and average pooling on features of passing channel attention along the channel dimension, which represent two different kinds of spatial information. Then, the two pooled feature maps are stacked on the channel dimension to generate the spatial attention mask by convolution and sigmoid activation.

3.3 Joint-contextual attention mechanism

With the help of contextual attention [25], we can copy distant patches from surrounding areas to synthesize better quality textures. However, existing attention modules [9, 24, 25] for inpainting task reconstruct the missing patch by using the similar patches at the same scale, which fails to make the most of the helpful information. With the missing areas increase, the scale restriction make the contextual attention prone to generate the distorted structures. In the task of super resolution reconstruction, the paper [43] utilize the cross-scale non-local module to explore the correlations between the low-resolution patch and high-resolution patches, which ensures the structural consistency. Inspired by it, we introduce the joint-contextual attention mechanism(Joint-CAM)to enlarge the search scope of similar patches. Joint-CAM contains in-scale contextual attention(is-ca), cross-scale contextual attention(cs-ca) and the residual connection. The is-ca considers similarity between same-scale feature patches, which help to generate the clearer textures. The cs-ca explores the dependencies between cross-scale feature patches, which facilitates the recovery of reasonable structures and details. The residual connection can help the Joint-CAM target feature patches that need to be filled with patches of known regions. These two contextual attention are described in detail below.

3.3.1 In-scale contextual attention

We take the center hole as an example. For the in-scale contextual attention(is-ca), we usually measure the similarity for all the patch pairs inside and outside the holes(foreground and background) using the cosine similarity:

$$ sim_{x,y,\bar{x},\bar{y}}=\langle\frac{f_{x,y}}{\|f_{x,y}\|},\frac{f_{\bar{x},\bar{y}}}{\|f_{\bar{x},\bar{y}}\|}\rangle $$
(2)

Where \(sim_{x,y,\bar {x},\bar {y}}\) represents the similarity between the foreground patch fx,y at location (x,y) and the background patch \(f_{\bar {x},\bar {y}}\) at the location \((\bar {x},\bar {y})\) in the same feature maps. The in-scale attention score of each background patch \(f_{\bar {x},\bar {y}}\) is calculated by softmax function with a scale. Finally, each foreground patch is reconstructed by aggregating weighted background patches. In practice, the above steps usually can be simplified by the convolution, a channelwise softmax function, and the transposed convolution. The paper [25] described the simplified process in detail.If you are interested, you can review it. The green and black arrows in Fig. 5 show the process of is-ca clearly as well.

Fig. 5
figure 5

The implementation of in-scale contextual attention(is-ca) and cross-scale contextual attention(cs-ca). The green and red lines represent different parts between two types of contextual attention, and the black lines represent similar parts

3.3.2 Cross-scale contextual attention

Cross-scale feature similarity was proposed for image super-resolution by [43], and this idea is then extended to our restoration tasks. Cross-scale contextual attention(cs-ca) models long-range dependency without same-scale restriction. The similarity between cross-scale features is obtained by measuring the correlations between low-resolution patches (kk) and higher-resolution patches (sksk) in the same feature map. However, applying cosine similarity directly is infeasible since the spatial dimensions of different resolutions are different. Thus, we first downsample the feature map, and the low-resolution patches (kk) in the downsampled maps have the same receptive fields as the higher-resolution (sksk) patches in the original maps. Then, the cross-scale attention scores are derived by calculating the cosine similarity between the patches from the downsampled map and the those with the resolutions from the original map, which can be achieved along the red and black arrows in Fig. 5. Specifically, we assume that the input feature map is f(HW). First, f is downsampled to g (H/sW/s ). Then, the cross-scale attention scores between patches(kk) in f and those in g are calculated by convolution and a softmax function. Finally, the corresponding patches(sksk) in f are used as deconvolution filters to reconstruct the missing patches in f and generate high-frequency details. Notably, unlike the single image super resolution task [38], the stride of transposed convolution(deconvolution) is set to 1 so that the feature maps are not zoomed upon when reconstructing the missing patches. In this paper, we choose k to be 3 and s to be 2, and the downsampling operation is bilinear interpolation.

When merging the two independent feature maps generated by the is-ca \(F_{in\_scale}\) and cs-ca \(F_{cross\_scale}\) into the unified feature maps, we use residual convolution ResConv() to learn residual features Rattn between different sources instead of adding or concatenating them directly, which allows the network to focus on only the distinct information while bypassing the same information, to reduce redundant features in the merged maps. In addition, building a skip connection between the input feature Finput and output of Joint-CAM Fattn allows the network focus more on hole patches that have similar patches in known regions. This improves the discriminative ability of the network. The merging process is shown in (3)-(5).

$$ \begin{array}{@{}rcl@{}} R_{attn}=F_{in\_scale}-F_{cross\_scale} \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} F_{attn}=ResConv(R_{attn})+F_{in\_scale} \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} F =F_{attn}+ F_{input} \end{array} $$
(5)

3.4 Loss function

For the structure reconstruction branch, we jointly use generative adversarial loss and L1 loss. For the texture enrichment branch, we add extra perceptual loss as well as style loss. In our paper, the weight setting of the above losses is the same as that of StructureFlow(SF) [33]. In addition, we use depth-supervised perceptual loss in each deconvolution layers of two branches to refine the predictions at each scale. The refined estimation of the missing regions ensures that the joint contextual attention mechanism performs well.

3.4.1 Depth-supervised perceptual loss

Compared with pixel-by-pixel loss, the perceptual loss is shown to be more consistent with the human visual system and generate more details. Thus, we use the depth-supervised perceptual loss to progressively optimize the predictions at each deconvolution layer of parallel branches. Taking the texture enrichment branch as an example, we first use activation layers {relu3_1}and {relu2_1} of VGG19 [17] to extract two-resolution feature maps of real images, and then calculate the loss Ldeep between the extracted real features and the features predicted by our corresponding deconvolution layer. The depth-supervised perceptual loss in the structure reconstruction branch \(L_{deep}^{S}\) is obtained in the same way:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} L_{deep}=\sum\limits_{i=1}^{P}\Arrowvert {\Phi}_{i}(I_{gt})-F_{i}\Arrowvert_{1}\\ L_{deep}^{S}=\sum\limits_{i=1}^{P}\Arrowvert {\Phi}_{i}(S_{gt})-S_{i}\Arrowvert_{1} \end{array} \end{array} $$
(6)

Here, Igt and Sgt represent real images and real structured images, respectively. Φi represents the i-th selected activation layer of the VGG19 [17] network. Fi represents the feature maps with the same size as Φi(Igt), predicted by the deconvolution layer of texture enrichment branch. Si represents the structure feature maps with the same size as Φi(Sgt), predicted by the deconvolution layer of the structure reconstruction branch.

3.4.2 Structure reconstruction branch loss

The pixel-to-pixel loss of the structure reconstruction branch \(L_{l1}^{s}\) is defined as the L1 loss between the reconstructed structured image \(\hat S_{re}\) and the real structured image Sgt, which constrains the major content in the missing regions, as shown in (7).

$$ \begin{array}{@{}rcl@{}}[] L_{l1}^{s}=\Arrowvert S_{gt}-\hat S_{re}\Arrowvert_{1} \end{array} $$
(7)

In fact, the image inpainting task is an ill-posed problem with multiple feasible restoration results in the missing regions. To make the repaired results look more realistic and contain more details, we use the generative adversarial loss \(L_{adv}^{s}\) [13]:

$$ \begin{array}{@{}rcl@{}} L_{adv}^{s}=E[log(1-D_{s}(G_{s}(I_{in},S_{in},M)))]+E[log(D_{s}(S_{gt}))] \end{array} $$
(8)

Here, Iin denotes the broken input image, and Sin denotes the edge-preserving smoothed structure of the corrupted image. Sgt represents the real structured image. M represents the binary mask, in which pixel value 0 represents the background and pixel value 1 represents the missing region. Gs is our structure reconstruction branch. Ds is the structure discriminator, which discriminates whether the restored structured image is the same as the real one or not. If the identification results is true, the output is 1, otherwise, the output is 0. The best output is 0.5, i.e., restored image is so realistic that fools the discriminator. In this paper, we adopt PatchGAN [20] as our discriminator to discriminate the authenticity of all image patches instead of the whole image.

Eventually, the structure reconstruction branch is trained together using (9), and \(\lambda _{l1}^{s}\), \(\lambda _{adv}^{s}\) and \(\lambda _{deep}^{s}\) are 4,1,0.01, respectively.

$$ \begin{array}{@{}rcl@{}} \mathop{min}\limits_{G}\mathop{max}\limits_{D}L^{s}(G,D)=\lambda_{l1}^{s}L_{l1}^{s}+\lambda_{adv}^{s}L_{adv}^{s}+\lambda_{deep}^{s}L_{deep}^{s} \end{array} $$
(9)

3.4.3 Texture enrichment branch loss

The pixel-to-pixel loss of the texture enrichment branch \(L_{l1}^{t}\) is defined as the L1 loss between the reconstructed image \(\hat I_{re}\) and the real image Igt, as shown in (10):

$$ \begin{array}{@{}rcl@{}} L_{l1}^{t}=\Arrowvert I_{gt}-\hat I_{re}\Arrowvert_{1} \end{array} $$
(10)

Additionally, the generative adversarial loss is added in the texture enrichment branch, as shown in (11), where Gt is the texture enrichment branch that generates the final restored result image containing rich textures. The structure of discriminator Dt is the same as Ds, which determines whether the given input is real or fake.

$$ \begin{array}{@{}rcl@{}} L_{adv}^{t}=E[log(1-D_{t}(G_{t}(I_{in},M)))]+E[log(D_{t}(I_{gt}))] \end{array} $$
(11)

To ensure that the repaired image matches the human vision system, the perceptual loss \( L_{per}^{t}\) [15] is shown in (12). Φi() represents the activation layer of the VGG19 [17] network, and {relu1_1}, {relu2_1}, {relu3_1}, {relu4_1} and {relu5_1} are selected in this paper. As shown in (13), the style loss \(L_{style}^{t}\) [15] is also additionally included in the texture enrichment branch, to reduce the checkerboard artifacts caused by the perceptual loss and to maintain the consistent texture style with the real image. \(G_{j}^{\phi }\) is a style matrix of size CC. Because the result of model trained with a small style loss weight has many fish scale artifacts, the weight of style loss is much bigger than other losses.

$$ \begin{array}{@{}rcl@{}} L_{per}^{t}&=&E[{\sum\limits_{i}^{P}}\frac{1}{N_{i}}\Arrowvert {\Phi}_{i}(I_{gt})-{\Phi}_{i}(\hat I_{re}) \Arrowvert_{1}] \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} L_{style}^{t}&=&E_{j}[\Arrowvert G_{j}^{\phi}(I_{gt})- G_{j}^{\phi}(\hat I_{re}) \Arrowvert_{1}] \end{array} $$
(13)

Finally, the texture enrichment branch is trained together using (14), where \(\lambda _{adv}^{t}\),\(\lambda _{l1}^{t}\), \(\lambda _{per}^{t}\), \(\lambda _{sty}^{t}\), \(\lambda _{deep}^{t}\) are 1, 5, 0.01, 180, 0.1 respectively in the experiment.

$$ \begin{array}{@{}rcl@{}} \mathop{min}\limits_{G}\mathop{max}\limits_{D}L^{t}(G,D)&= & \lambda_{l1}^{t}L_{l1}^{t}+\lambda_{adv}^{t}L_{adv}^{t}+ \lambda_{deep}^{t}L_{deep}\\ && +\lambda_{sty}^{t}L_{style}^{t}+\lambda_{per}^{t}L_{per}^{t} \end{array} $$
(14)

The total losses are shown in (15). Inspired by PEPSI [11] that slightly reduces the weights of one path loss to focus on the other path. At the beginning of training, we want to provide a strong structure prior to the texture enrichment branch, and the penalty of the structure reconstruction branch is strong. As the training processes, we focus on image detail restoration gradually, and the penalty of the structure reconstruction branch slowly weakens. I represents current iterations, and Imax represents the maximum number of iterations.

$$ \begin{array}{@{}rcl@{}} L_{total}=\mathop{min}\limits_{G}\mathop{max}\limits_{D}L^{t}(G,D)+(1-\frac{I}{I_{max}})\mathop{min}\limits_{G}\mathop{max}\limits_{D}L^{s}(G,D) \\ \end{array} $$
(15)

4 Experiment

4.1 Implementation details

We train our method on the Place2 [10], CelebA-HQ [46] and Paris [23] datasets. Place2 contains more than 10 million images and covers 365 natural scenes, and we follow the original training and validation splits. The CelebA-HQ dataset consists of 30,000 highly structured face images with the resolution of 1024∗1024, where 3000 images are randomly selected into the test set. For Paris, it is 6,000 images of Paris street buildings, where 50 images belong to the test set. For mask, we use challenging irregular mask datasets provided by [16]. All the irregular masks and images used for training and testing are resized to 256 ∗ 256.

During the training, we use the Adam optimizer [41] with β1 = 0 and β2 = 0.999. The batch size and the learning rate are set to 8 and 1 × 10e4, respectively. Our proposed method is implemented in PyTorch, and conducted on a single NVIDIA 3090 GPU. In addition, the end-to-end training strategy is adopted, so the structure reconstruction branch and texture enrichment branch are trained simultaneously. The training process of the CelebA-HQ model, Paris model and Placces2 model took 3 days, 2.5 days and 10 days respectively.

To obtain a smoothed structure of the ground truth, a rolling guidance filter (RGF) [14] is utilized to process real images, which leaves critical structures and edges and removes texture details. As a real time edge-preserved method, its parameters σs and σr are used to control spatial and range scale of smooth window. As the σs and σr larger, more details are smoothed. Here, we select σs as 3 and σr as 0.05.

4.2 Quantitative comparisons

4.2.1 objective evaluation

The image inpainting task lacks reasonable objective evaluation metrics that can accurately reflect the image performance. However, when our method is compared with other methods, objective evaluation metrics are essential because they are relatively fair and independent of human will. Thus, the commonly used metrics - PSNR, SSIM and LPIPS- are adopted for image quantitative comparisons. PSNR measures the difference between the reference image and the predicted result based on the pixel-level errors, and the larger the value is, the less distorted the image is. SSIM evaluates the image in terms of luminance, contrast and structure, and its value range is [0,1]. The closer the value is to 1, the closer the image is to the reference image. LPIPS [5] based on AlexNet, measures the distance between the deep features of restored image and those of real image. The caculation of LPIPS is shown (16).

$$ \begin{array}{@{}rcl@{}} D(X_{gt},X_{re})=&\sum\limits_{i}\frac{1}{H_{i} \times W_{i}}\sum\limits_{h,w}\Arrowvert W_{i}\odot(f_{gt}^{i}-f_{re}^{i}) \Arrowvert_{2} \end{array} $$
(16)

The specific algorithm is as follows. Firstly, we use the \(relu{1} \sim relu{5}\) layers from AlexNet to extract the feature stack \(f_{gt}^{i}\) and \(f_{re}^{i}\) from the ground truth and generated image, respectively. And we assign \(f_{gt}^{i}\), \(f_{re}^{i} \in R^{H_{i} \times W_{i} \times C_{i}}\) for activation layer i (i = 1\(\sim \)5). Secondly, we scale the channels of each feature by vector Wi and compute the l2 distance to get the difference map. The Wi consists of a 1 ∗ 1 convolution layer where the output channel is set 1. Finally, all stacked difference maps are averaged in the spatial dimension and accumulated in the channel dimension.

We compare our approach with several state-of-the-art models on the CelebA-HQ [11], Place2 [10] and Paris [23] datasets. These models are: EdgeConnect (ICCVW 2019) [29] and CSA (ICCV 2019) [9], SF (ICCV 2019) [33], RFR (CVPR 2020) [32] and MEDEF (ECCV 2020) [35]. During the training process, for the Paris dataset, three pretrained models have been adopted(EC (ICCVW 2019) [29], MEDEF (ECCV 2020) [35] and RFR (CVPR 2020) [32]) through their official websites. And the CSA model is trained by us by using the official code and default parameters. For the Places2 dataset, the pretrained model of EC (ICCVW 2019) [29], SF (ICCV 2019) [33] and MEDEF (ECCV 2020) [35] have been adopted through their official websites. And for CelebA-HQ dataset, since the official pretrained CelebA model of comparison methods [9, 29, 32] cannot be generalized to the CelebA-HQ dataset, we retrained them on the CelebA-HQ dataset by using the default parameters and official code. And we choose the best results as the final output. Additionally, to evaluate fairly, we conduct experiments on same irregular holes provided by Liu [16]. These masks are classified based on different hole-to-image area ratios(e.g.,0-10%,10-20%,etc.). During testing, the comparison methods and our methods are evaluated by the same test dataset and irregular masks.The results are shown in Tables 1 and 2.

Table 1 Quantitative evaluation of the CelebA-HQ dataset and Places2 dataset
Table 2 Quantitative evaluation of the Paris dataset under the large holes, and the mask ratio is 40%-60%

4.2.2 Subjective evaluation

The objective evaluation may be inconsistent with the subjective perception of people. In this section, subjective evaluation, as an essential complement to objective evaluation indicators, is utilized to assess our proposed method.

According to whether the ground truth is required, we divided the user study into two types of experiments: no reference(NR) and full reference(FR). No reference means that users do not know what the real image looks like and where the mask is, and full reference is the opposite. Specifically, we invited 20 participants, both engaged and not engaged in image processing direction. In the first set of experiments, the participants are asked to whether the image is real in a bunch of random images that contain repaired images and ground truth. The results are summarized in Table 3, and the numbers in the table represent the probability that the completed images are considered as the real image. We can see that images inpainted by our method are always the most realistic under different mask ratios. In the second set of experiments, participants needed to rate the quality of displayed images, and these images are repaired from one broken image by our method and other comparison methods. Table 4 shows the results that sorted by participants, and the number in the table represents the probability that the image restored by a certain model is of that rank(No.1, No.2...). From Table 4, we can conclude that images completed by our method have the highest probability of being thought to be the best in human vision, indicating that our models can generate more natural images. Notably, images for subjective evaluation come from a portion of CelebA-HQ and Paris, and all images are shown to participants for no time limitation.

Table 3 NR scores for various mask ratios on CelebA-HQ and Paris
Table 4 FR scores for various mask ratios on CelebA-HQ and Paris

4.3 Qualitative comparisons

The qualitative comparison can test image quality in a intuitive way. Figures 67 and 8 show the results of our method that compared with other state-of-the-art methods on the CelebA-HQ, Places2 and Paris datasets, respectively. Compared with other methods, the images repaired by our method have less noticeable discontinuous structures and blurry textures and are the most genuine in terms of the human visual system. In Fig. 6, we find that EC easily generate irrational structures, and CSA fails to generate clear textures. We guess that when the missing hole is large, EC is difficult to repair the edges of missing areas accurately, and CSA has insufficient information to understand the connection between pixels in the missing region. For RFR, although the use of recurrent feature reasoning is suitable for repairing large holes, the restored images still have some unnatural content.

Fig. 6
figure 6

Qualitative results on the CelebA-HQ dataset

Fig. 7
figure 7

Qualitative results on the Places2 dataset

Fig. 8
figure 8

Qualitative results on the Paris dataset

Combining Fig. 6 with the Table 1, we observe the following two points. First, the higher PSNR value of an image does not mean better vision for human eye. For example, in the CSA method, repaired images look visually poor, but their PSNR value is not low. Therefore, there is a gap between objective indicator and subjective perception. Second, as the missing area increases, the advantages of objective metrics tested by our model are gradually emerge. As the mask ratio increases and the information in known areas is reduced, other methods generate more apparently distorted pixels than our method, so both objective indicators and subjective vision are inferior to our model.

4.4 Ablation study

In this study, the quantitative metrics are calculated from validation images on the CelebA-HQ dataset [11].

4.4.1 The effectiveness of mutual adaptive guidance

To demonstrate the effectiveness of the idea of mutual adaptive guidance, we compare the PAGN with several variants: a two-stage network, a network with one-way guidance, a network without guidance filter. The two-stage network divides image inpainting into two stages without end-to-end training. The first stage is used to recover damaged structures, and the second stage would recover the textures based on recovered structures.The network with one-way guidance means that the guidance direction in the encoder is the same as that in the decoder. We use ablation experiments which are seen in Table 5 and Fig. 9 to show the inpainting performance of the full network and corresponding variants.

Table 5 The effectiveness of PAGN
Fig. 9
figure 9

Effect of the guidance filter on inpainting results, and the mask ratio is 50%-60%

The objective results are given in Table 5, demonstrating that the proposed PAGN performs better than other variants in terms of PSNR, SSIM, LPIPS. Figure 9 shows the visual results. By observing the mouth in the first row, the network without guidance filter generates the wrong semantic structures, and the network with one-way guidance produces the inconsistent color. The results in the second row show that our PAGN can produce less blurring around the hair and ear.

4.4.2 Effect of joint-contextual attention mechanism

To investigate the effectiveness of cross-scale contextual attention(cs-ca), we perform two groups of experiments to make a comparison. The first group only use in-scale contextual attention(is-ca) [25] in the texture enrichment branch. The visual comparison is shown in Fig. 10. By observing the left eye and mouth of the first black man on the top row and the ear of the second man on the bottom row, we can see that the model with Joint-CAM helps to recover accurate structures and realistic textures with fewer artifacts. In the next group, we adopt the previous high-level feature map instead of the downsampling map to explore the relationship between different-scale features. The comparison objective metrics are shown in Table 6.

Fig. 10
figure 10

Results of different contextual attention

Table 6 The effectiveness of Joint-CAM,and the mask ratio is 50%-60%

In addition, we further demonstrate the effectiveness of our proposed Joint-CAM in the feature space. Since the gray-scale features maps are easier for observation, we display the input and output feature maps of Joint-CAM module in gray-scale. As shown in Fig. 11, we find that the feature maps processed by Joint- CAM have a more reasonable structures, clearer textures and brighter color.

Fig. 11
figure 11

The visual contrast diagram of the Joint-CAM

4.4.3 Effective of attention-based multiscale perceptive Res2Block

AMPR is used to capture different-scale features and improve model generalization. We conducted experiments to evaluate the importance of different components of AMPR. Table 7 shows that both PSNR and SSIM are the highest with the addition of all components. The visual quality as shown in Fig. 12, as the mask ratio increases, the areas repaired by the network without AMPR have more obvious texture artifacts. Noteworthy, a convolutional layers are applied in the last of the encoder in the case of without AMPR.

Table 7 The effectiveness of AMPR and the mask ratio is 50%–60%
Fig. 12
figure 12

Visual comparison results with or without AMPR, and the mask ratio is 30%-40%. Among the images without AMPR, blurry textures exist in the neck, eyes.

To further validate the AMPR, we show the feature visualization comparison as shown in Fig. 13. Specifically, we up-sample the size of feature maps with different sources from 32×32 to 64 × 64 and display them in colors(COLORMAP_JET) for better observation. One is generated by the last layer of encoder which is consist of a convolutional layer. And the other is generated by the last layer of encoder which is consist of AMPR. We can find that the feature maps generated by AMPR have larger receptive fields than those of convolutional layer. Other than that, the localization of key points and missing regions is more accurately by applying the AMPR module.

Fig. 13
figure 13

The visual contrast diagram of the AMPR

5 Conclusion and future work

In this paper, we introduce the parallel adaptive guidance network(PAGN), which repairs broken structures and textures separately in a parallel manner within one stage. Structural and texture features mutually guide through skip connections with guidance filters, which allows to pay attention on the communication of useful features. Furthermore, in the texture enrichment branch, we apply the joint-contextual attention mechanism(Joint-CAM), which leverages limited context from multiple perspectives, making it easier to yield details and accurate structures in missing regions. Finally, to give our inpainting network has robust feature representation, a novel multiscale structure called attention-based multiscale perceptual res2block(AMPR) is adopted into the bottleneck, to extract different-level features at a finer level. Experiments on the public datasets verified the effectiveness of our proposed models, which are especially suitable for repairing large holes.

In future work, we aim to combine inpainting tasks with super resolution to efficiently repair high-resolution images with complex textures and rich colors.