1 Introduction

Classical Chinese landscape paintings are invaluable treasures of Chinese culture, but many of the surviving works have been damaged or defaced due to historical events and storage issues. Despite the effort that artists have made to restore them through imitation, they have struggled to replicate the original painters’ unique style due to individual painting habits. With recent advances in machine learning and image processing, leveraging such technologies to restore ancient Chinese landscape paintings has emerged as a promising solution, as is shown in Fig. 1.

Figure 1
figure 1

The final restored Dwelling in the Fuchun Mountains. The original painting has a portion lost in a fire and is divided into two sections, i.e., WuYongShi Scroll and ShengShan Scroll

Image inpainting, which aims to fill in missing parts of images, is a typical method used to restore paintings. However, restoring Chinese landscape paintings presents three significant challenges that current inpainting approaches [110] cannot directly address. First, Chinese landscape paintings often comprise intricate natural scenes, such as trees and rocks, and thus generating missing regions based solely on the remaining parts of an image may lead to nonsensical results or disrupt the overall coherence of the painting. Second, Chinese landscape paintings are typically created using barely black ink, with brushstrokes determined by the ink’s intensity changes, which can be challenging for models to learn. Finally, there are usually very few surviving works of a given artist, therefore, learning their unique artistic style from limited data is a formidable challenge.

To address these challenges, we proposed an inpainting model for Chinese landscape paintings, especially concentrating on the restoration of the masterpiece Dwelling in the Fuchun Mountains, one of the most well-known classical Chinese landscape paintings by the early 14th century master Huang Gongwang (1269-1354). Our main idea is to hierarchically restore the missing parts, from object structure to stroke details. Therefore, we name our framework Hierarchical Painter, which not only restores the missing objects but also generates vivid strokes that represent the painter’s style. Specifically, we integrate the structure of a 17th century imitation copy of the original painting, the ZiMing Scroll, instead of generating out of the air to avoid nonsensical and inconsistent results. Our approach also utilizes well-designed edge detection as well as detail generation methods to accurately reflect the intensity of brushstroke ink. To enhance details and generate content with fine-grained styles, we segment trees, a key feature of Chinese landscape paintings, and generate them separately. In this way, we successfully generate the missing portion of Dwelling in the Fuchun Mountains and seamlessly merge the generated results with the existing portions to complete the restoration, as shown in Fig. 1. It is noteworthy that our framework only needs a single image to train, which outperforms many image inpainting methods that are data-hungry. This also makes our framework to be easily scalable to other broken Chinese landscape images.

In summary, our contributions are threefold:

  1. 1)

    To the best of our knowledge, we first proposed an image inpainting model specially designed for restoring Chinese landscape paintings except for commercial tools. This broadens the traditional image inpainting task’s application and sets a strong baseline for future research.

  2. 2)

    We designed an image processing algorithm that explores the edge detection method for better extraction of structural information from Chinese landscape paintings. We further utilized the estimated edges by designing a hierarchy-aware image-to-image translation algorithm, with which we restore fine-grained brushstrokes and painting styles without large-scale training.

  3. 3)

    We conducted extensive experiments on Dwelling in the Fuchun Mountains, which demonstrated the effectiveness of the proposed method, especially in detail restoration and style preservation.

2 Related work

2.1 Image generation

Image generation in our case specifically means unsupervised learning [11] and seed-based image synthesis [12, 13]. Generative adversarial networks [14], or GANs, are typical tools used for this purpose [1517]. Previous works have already shown the feasibility of this method [1823] and have proposed improvements and adaptations. Deep convolutional generative adversarial networks [24], or DCGAN, introduces convolutional neural networks and batch normalization [25] into unsupervised learning. It is adapted in a wide variety of implementations [2632] and their generative results show promising quality and diversity [33, 34].

StyleGAN [35] enhances controls on specific details of the synthesis of the image. Its style-based generator architecture [35] allows for better preservation of semantic details [13, 36, 37]. Its variation, StyleGAN3, fixes the unhealthy dependence on absolute pixel coordinates [3840], solving aliasing in signal processing of image synthesis. StyleGAN-XL [41] utilizes super-resolution [42] and stem-leaf staged training [13] that can produce meticulous images up to \(1024\times 1024\) resolution.

Despite the outstanding performance of current state-of-the-art image generation models, they are not suitable for our Chinese landscape painting restoration task due to their uncontrollable results. Moreover, such models are often built upon extra-large datasets and intense training resources [24, 43], and are not suitable for our task with limited data.

2.2 Edge detection

Edge detection is used to determine the contours of objects in an image. Currently, there are two popular types of edge detection algorithms: gradient-based [4448] and CNN-based [4957]. Canny edge detector [46], a popular gradient-based algorithm, adopts a Gaussian filter to smooth the image, thresholding intensity gradient magnitude to avoid spurious response and applying a double threshold to find the potential edge. The rich convolutional features (RCF) method [50], which is categorized as a CNN-based approach, extracts features from both the final layer and the intermediate layers. However, both types are not suitable for direct use in our Chinese landscape painting restoration task because gradient-based methods fail to learn the complex brushstrokes, while CNN-based approaches need to be trained over massive datasets.

2.3 Image inpainting

Image inpainting aims to fill in missing or corrupted parts of an image with plausible content. Currently, it can be classified into three categories: patch-based, diffusion-based, and GAN-based methods. Criminisi et al. [58] first proposed a patch-based inpainting method by sampling patches from known parts of the image and finding the best match for the missing parts. Such an idea has been extended by several future approaches [13]. Diffusion-based approaches [46] render missing regions based on the appearance of neighboring regions. Both types are not suitable for our Chinese landscape painting restoration task since predicting an unknown portion of a painting from existing ones is an ill-posed problem.

Some later GAN-based methods [7, 8] reached state-of-the-art performance by leveraging the generative adversarial network [14]. A generator is utilized to synthesize the missing parts directly, while a discriminator is adopted during adversarial training to assess the realism of the generated results. However, it is impossible to use these GAN-based methods for restoring Chinese landmark paintings because they all need to be trained over massive datasets.

2.4 Image-to-image translation

Image-to-image translation refers to transforming an arbitrary given image from one domain to another, while the two images share a certain style or structure. Pix2Pix [59] introduces the conditional GAN [60] into image translation, which was upgraded by Wang et al. [61] with multi-scale and coarse-to-fine discriminators. CycleGAN [62] regards the unaligned training sets as style domains and reduces the difficulty of constructing datasets. StarGAN [63] allows multi-domain training at the same time. Apart from these GAN-based approaches, VAE [64] uses a deep generative deconvolutional network as a decoder and convolutional neural network as an encoder to realize both supervised and unsupervised image translation. Our method leverages Pix2Pix [59] for generating the missing region while several designs including choosing proper image domains and normalizing are proposed.

3 Methods

3.1 Overview

To address the challenge brought by the unique characteristics of Chinese landscape paintings, we propose the hierarchical painter, a multi-hierarchical inpainting model for Chinese landscape painting restoration. Figure 2 demonstrates the overall pipeline of the proposed method. First, we design a normalization algorithm for Chinese landscape painting including purifying backgrounds and eliminating noise, seals, and inscriptions (Sect. 3.2). Then, we separately generate background and details to ensure the restoration performance in both overall structure and detailed texture. Specifically, we offer distinct designs to efficiently extract the structure and obtain the segmentation of details (Sect. 3.3 and Sect. 3.4). Finally, we integrate the superposition of the background and details into the original image by inpainting models to maintain the overall consistency of the final restored image (Sect. 3.5).

Figure 2
figure 2

The pipeline of the proposed hierarchical painter. Both the imitation image and the original image are pre-processed. The background and details of the painting are generated separately before the generated image is connected with the original portion to get the final restored image

3.2 Image pre-processing

Chinese landscape paintings share a common attribute in that their contents are grayscale, and the background is typically blank. However, due to technological limitations in the conservation of paintings during ancient times, many surviving works have backgrounds that contain unwanted noise. These noise artifacts can have a detrimental effect on the performance of our model. To address this challenge, we pre-processed the input images by reducing or eliminating irrelevant noise and unifying the background color. This allowed us to achieve more accurate and precise inpainting results, even in the presence of challenging noise artifacts.

We first calculated the mean value μ of the gray images. Then, we set a threshold value \(\epsilon =\mu +3\), with 3 being a hyperparameter of compensation. ϵ is used in the filter \(F_{\epsilon}\) to eliminate the noise in the background and unify its color purity. For each pixel \(a_{ij}\), we apply

$$ F_{\epsilon}(a_{ij})=\max \{a_{ij},\epsilon \}. $$
(1)

To normalize each input image, the background should be unified to white, while the color-depth relationship of the content within each image should also be preserved. Thus, we design a linear function \(\Phi _{\gamma ,\epsilon}\) to set the background to be pure white:

$$ \Phi _{\gamma ,\epsilon}=a_{ij}\cdot \gamma /\epsilon , $$
(2)

where \(\gamma =255\), the value of white in RGB. To enlarge the contrast among lines, a mapping needs to be designed so that points that are greater than μ should be mapped to where close to γ, while the points that are smaller than μ should be mapped to where close to 0. Inspired by [65], we design a mapping to enlarge the image color contrast:

$$ E_{\varphi}(a_{ij})=\bigl\lceil \mu \arctan \bigl(\varphi (a_{ij}-\gamma /2)\bigr)+ \gamma /2\bigr\rceil , $$
(3)

where φ is a parameter that controls the mapping range. The effect of such transformation is shown in Fig. 3 from (a) to (c).

Figure 3
figure 3

Demonstration of the image pre-processing process

Figure 4 compares images generated through the method given in Sect. 3.5 using datasets with and without pre-processing. We find that the output image without preprocessing displays evident inconsistencies in colors, which provides evidence for the necessity of pre-processing to keep colors consistent.

Figure 4
figure 4

Comparison between the results using datasets with and without pre-processing

3.3 Background generation

Background generation is a key component of our proposed model, as it aims to transfer the style of objects in the background of the imitation copy to those in the original painting. In our specific case, it is critical to preserve the content’s structure in the background of the ZiMing Scroll, including the trend of mountains and rivers, while also translating the style of the content, such as shading, ink techniques, and stroke texture, to match the original painting’s style. By incorporating these factors into our background generation algorithm, we were able to achieve highly realistic and faithful restorations of Chinese landscape paintings.

Conventional methods for style translation often result in blurred outputs and difficulty in constructing large training datasets [61].

To address this, we split this process into two stages: structure extraction and style translation. This method ensures a smooth and consistent style translation, preserving the integrity and continuity of the artwork. To be more specific, we first transform the imitation image to its edge map (structure extraction), then we convert this map to the style of the original image (style translation).

Selecting the appropriate edge maps is crucial. The algorithm to generate the edge maps should be designed to have the following characteristics:

  1. 1)

    Be able to extract the structure of the main content in the painting, such as the trends of lines composing mountains and rivers.

  2. 2)

    Be able to abstract the content and filter style information, including shading, ink techniques, and stroke texture.

  3. 3)

    Be able to create a reversible mapping model between the edge map and the original painting based on existing research.

  4. 4)

    Be able to keep the entire consistency between images after combination.

In the following subsections, we will present a comprehensive approach for selecting an appropriate edge map and will explain its functionality in detail. Additionally, we describe the methodology used for achieving style translation based on previous research.

3.3.1 Structure extraction

Chinese landscape paintings are renowned for their unique composition, which consists solely of lines that form complex shapes and structures. However, extracting the structure of these paintings is a challenging task that cannot be accomplished easily using common edge detection methods, such as Canny, or CNN-based methods, like RCF. The reason for this difficulty is that these methods typically detect the contours of objects in an image, rather than the lines themselves. In the case of Chinese landscape paintings, the lines represent the structure of the painting, making it challenging to apply traditional edge detection techniques. To overcome this challenge, we investigated the XDoG algorithm [65], which can directly extract the lines from the painting, resulting in a more accurate representation of its structure. By incorporating this algorithm into our proposed model, we were able to achieve highly faithful and detailed restorations of Chinese landscape paintings.

XDoG [65] applies two layers of Gaussian-blur functions \(G_{\sigma}\) with different kernel sizes σ and to determine the edge of an image, where k is a preset constant. Intuitively, regardless of how large the σ is, the blurred image will always preserve the rough location and structure information of the main body within an image, and the difference between the two blurred images with different kernel sizes σ and , i.e. \(\Delta G_{\sigma , k} = G_{k\sigma} - G_{\sigma}\), provides some information about the main body while clearing details and noise [65]. With a larger kernel size difference, the output of the main body will be more abstract so that more structure information will be reserved, while details will be removed [66]. With suitable (σ, k), the shape of the lines will be modified, losing details of the style features, but making the outline of the structure simpler and more abstract, as demonstrated in Fig. 3(d).

XDoG can extract the content’s structure of the painting clearly with a controllable level of abstraction. In addition, one advantage of XDoG is that it can work on an entire high-resolution image, and the abstraction process takes the whole painting into consideration. This attribute ensures the entire structure and abstraction consistency. Therefore, we choose XDoG as the algorithm for structure extraction in our model. In Sect. 5.4, we will describe in detail its advantages over other common edge detection methods in structure extraction.

3.3.2 Style translation

In this step, we translate the edge map to painting images and choose Pix2Pix [59] as our prototype.

Technically, an encoder-decoder with skip connections between mirrored layers is applied to the generator and the Markovian discriminator (PatchGAN) is used for the discriminator. In our task, the inputs are the edge-mapped imitation images, and the outputs are the fake images in the original images’ style, as displayed in Fig. 5.

Figure 5
figure 5

Style translation through XDoG and Pix2Pix-based models

Notably, we do not choose more advanced models such as Pix2Pix HD [61] and CycleGAN [62]. While these models have made significant improvements such as supporting non-aligned training sets, reducing the need for paired data, and optimizing image resolution by using coarse-to-refined generators, they suffer from the problem of excessive output content freedom. In our pipeline, the original images are cropped into patches for training due to the large resolution of the original image. Therefore, in the test phase, the generation results need to be composed to generate the final output image. Consequently, if the output image has too much freedom, it can easily result in discontinuity in content, as shown in Fig. 6. In addition, as the background images in our dataset contain details, it is not suitable to use those models that require great constraints on detailed information. In summary, we simply adopt a Pix2Pix-based model because it finds the balance between structure conservation and style translation.

Figure 6
figure 6

Examples of the stitching effect tested on (a) ZiMing Scroll (left) and (b) WuYongShi Scroll (right) using CycleGAN

3.4 Details generation

The image-to-image translation model is effective in natural image processing but less effective in restoring Chinese landscape paintings due to the complex structure of foreground trees, which is more like a blocky entity. The current model is unable to identify each stroke separately within the blocks. Additional image processing and model learning are required to learn the color transition within trees, considering their unique characteristics.

To address this issue, we propose a solution that includes creating a segmentation mapping to classify each pixel within the images and provide more information on how the color transitions within each stroke. Then, we employ the SPADE [67] model to learn the color style within each class and how the color transitions between classes.

3.4.1 Segmentation mapping

To learn how to transit color from light to dark within each stroke, especially in determining the stroke style of trees inside the painting, stroke information should be extracted and classified from the perspective of color depth. Segmentation mapping is designed to map each pixel \(a_{ij}\) to a color class: shallow class \(c_{s}\), transition class \(c_{t}\), dark class \(c_{d}\), and background class \(c_{b}\), i.e. \(a_{ij}\mapsto \{c_{s}, c_{t}, c_{d}, c_{b}\}\). With the mean value μ of the image and two preset thresholds \(\epsilon _{1}\) and \(\epsilon _{2}\), the model will map each pixel \(a_{ij}\) to different color-depth features with mapping \(\rho _{\epsilon _{1},\epsilon _{2},\mu}\):

$$ \rho _{\epsilon _{1},\epsilon _{2},\mu}(a_{ij})= \textstyle\begin{cases} c_{d}, & a_{ij}< \epsilon _{1}, \\ c_{t}, & \epsilon _{1}\leq a_{ij}\leq \epsilon _{2}, \\ c_{s}, & \epsilon _{2} < a_{ij} < \mu, \\ c_{b}, & \text{otherwise}. \end{cases}\displaystyle . $$
(4)

If \(\rho _{\epsilon _{1},\epsilon _{2},\mu}\) is applied on the normalized images, we will obtain the result as shown in Fig. 7(b). There will be many isolated points due to the discontinuity of pixel values. Such points should all be considered noise and should be reclassified. Based on Fig. 7(b), we apply Gaussian blur \(G_{\sigma}\) and then calculate \(\rho _{\epsilon _{1},\epsilon _{2},\mu}\) again, and we will obtain the result as displayed in Fig. 7(c). Most of the noise in both the orange and blue squares is removed. Such a process could reflect strokes more clearly and tell which part of the painting should have a darker stroke and how each stroke should transit in color in a more fluent way. Using this process, the normalized input is mapped to a segmentation image as illustrated in Fig. 3(e). We call this process “double classification”.

Figure 7
figure 7

Demonstration of double classification using Gaussian blur

The advantage of this algorithm is that it ignores the noise within each stroke and generates a continuous boundary between dark parts and light parts. With this algorithm, each stroke can have a more complete structure in color transition, giving more precise guidance to the downstream generation model.

3.4.2 Mapping guided generation

Chinese landscape paintings are known to have marvelous alternating details with the trace of ink and the hardness of each stroke of the brush. Inspired by SPADE [67], we design a structure-aware generation framework that preserves and recovers the magnitude of integrity and style of the original painter. The model takes in segmentation mappings and trains each segmentation class in parallel with their class indexes. Similar to Pix2Pix, SPADE is trained with an initial learning rate of 0.0002 for 50 epochs and then a linearly decreasing learning rate for another 1000 epochs. It normally takes a day for the loss curve to saturate.

As depicted in Fig. 8, our model takes in the segmentation maps instead of plain RGB pictures. Each patch stands for a particular segmentation class and is colorized for visualization purposes. To ensure that the delicate details of the brush traces are preserved and translated accurately, we further design the following steps for the input segmentation maps.

Figure 8
figure 8

Input segmentation mask with class labels

To provide better structure guidance for the generation model, a three-step segmentation mapping based on the XDoG edge map is introduced. First, the sketches are split into two classes based on the brightness of each pixel, and these classes are trained in parallel. Second, an extra transition layer is added between the two existing layers to create a smooth and natural transition between the hard and shallow strokes. Finally, Gaussian blur is applied to the image to merge discrete pixels into large continuous patches. This pipeline aims to provide more accurate and solid structure guidance for the generation model.

3.5 Inpainting integration

To improve the integration of the generated part with the original image, we intentionally inserted a mask between the joined parts and applied image inpainting to fill it. This approach allows the generated result to consider the overall consistency with the original image rather than solely relying on the structure from the imitation image.

We apply a model similar to CTSDG [68], which focuses on regenerating defective regions within an image while preserving its overall consistency. The model first transforms the input image into an edge map, and then fills the corrupted part based on the edge map to ensure the structural correctness of the fill content. Finally, the filled edge map is added with the texture of the original image. After research on several edge detection methods, we change the edge detection model in CTSDG from canny detection to RCF to further emphasize the structure and preserve some space for random generation.

To integrate the generated part seamlessly into the original image, we begin by placing the generated image into the missing part of the original image. Next, we add rectangle masks to the edge of the missing part with a fixed width. These masks are later filled with textures and structures generated by the inpainting model. The process is illustrated in Fig. 9. This allows the generated result to blend well with the original image and maintains overall consistency.

Figure 9
figure 9

Added masks and generated results

4 Results

Our methods of Chinese landscape painting restoration (Fig. 1) has displayed outstanding performance. The restored image reveals fine-grained details in a full resolution of 2363 pixels in height. Moreover, the layout of the restoration figures is consistent with the imitation painting, providing fundamental evidence of rationality and authenticity. Courtesy of customized inpainting methods, our restoration reconciles seamlessly with the original painting. Our finetuned styles for the trees are also unified in coherence with the original, both in sketches and in colorization styles.

To further validate the effectiveness of our methods, we compare our result with the restoration result of a commercial tool WENXINFootnote 1 created by Baidu.

To combat the inconsistencies between the generated image and the original ones, WENXIN leaves a space much wider than the actual missing part from the original painting, as shown in the arrow below the overall figure in Fig. 10. Its model is based purely on unsupervised image generation and pays little attention to the shape or style of the original image. Our research finds that WENXIN’s generation results are similar to a refracted copy of another sector of the painting from ZiMing Scroll. When interacting with users through sketch inputs, WENXIN also achieves less satisfactory results with respect to the coherence and consistency of objects.

Figure 10
figure 10

Comparison of the overall effect and detail between WENXIN’s restoration result and ours

The style of the trees is also prone to be less consistent. In many cases, WENXIN generates merely a chaotic cloud of black ink where there should be trees. The trees it generates depict a vastly different style than the original. Our results, on the other hand, are built directly upon the imitation ZiMing Scroll, which guarantees that the general layout and object structure are coherent with the original image.

We also conduct a group t-test on the average score. The t-statistic results are presented in Table 1. Except for the style consistency, our model outperformed WENXIN in terms of statistical significance, with a p-value of less than 0.001.

Table 1 Comparison of average scores between WENXIN and ours. S.C. is the style consistency, Tra is the transition, and Dtl is the details (tree)

To further evaluate our restoration result, we conducted a parallel comparison survey with WENXIN’s result. We distributed the survey to a diverse group of subjects and invited them to rate our model and WENXIN on four dimensions: structural integrity, style consistency, the naturalness of transition, and detail restorations such as tree strokes. We used a scale of 1 to 10 for each dimension and collect confidence intervals for each rating.

A total of 233 subjects (except for our group members) were involved in the survey, including 27% who are specialists in the field of arts, as shown in Fig. 11. As art specialists have more professional knowledge about Chinese landscape painting, as is reflected in Fig. 12, their scores in the comparison survey are more convincing and credible. To ensure the balance between objectivity and professionalism, we weighed the ratings from these specialists by a factor of 1.5.

Figure 11
figure 11

Statistics of the survey on the comparison between WENXIN’s restoration result and ours. A total of 233 participants completed the survey, including 67 experts, students, practitioners of art, and 170 other non-art-related individuals

Figure 12
figure 12

Score distributions of participants’ knowledge of Chinese landscape painting. Both sample groups are normally distributed in terms of scores. Generally, experts, students, and practitioners of art have higher scores than others, thus are able to offer more convincing scores in the comparison survey

According to the statistics in Fig. 11, both groups in general give higher scores on our result in the perspective of overall consistency, style consistency, transition interpretation, and detail restoration (in trees). Especially, art specialists give much higher scores on our results than on WENXIN’s. Such a rating indicates that our model has better performance on image restoration than WENXIN.

It is worth noting that WENXIN relies on traditional methods of training a large model with a vast amount of data, which is a luxury for those with limited datasets. In contrast, our research is only based on the original painting and the imitation copy. The survey result (Fig. 11) confirms that through careful designing our model, despite a smaller dataset and limited computing power, our results are superior to those of WENXIN in all the four aspects, including overall effect, style consistency, transition, and details. It is also worth mentioning that our style is not only consistent within the generated image, but also coherent with the style of the original Dwelling in the Fuchun Mountains, which can be illustrated by the gray-scale histogram analysis shown in Fig. 13.

Figure 13
figure 13

Gray-scale histogram analysis

5 Experiments

5.1 Experimental setup

We conduct our experiments on the Chinese landscape painting Dwelling in the Fuchun Mountains, which contains two original parts and an imitation work.

WuYongShi Scroll and ShengShan Scroll are two sections from the original painting, with the middle part lost. These two parts act as the ground truth in our experiments. Our work mainly focuses on mimicking the style and gestures of these two paintings. Our electronic copy of WuYongShi Scroll has a resolution of \(89{,}911\times 4854\) pixels, while ShengShan Scroll has a resolution of \(8253\times 5197\) pixels.

ZiMing Scroll is a complete copy of the original painting, which was mimicked by an unknown artist in the 17th century or earlier. This painting is intact in image form, but it compromises the style of the original by mixing in this artist’s own painting style. This image has a resolution of \(41{,}588\times 2363\) pixels.

We crop the large paintings into \(256\times 256\) or \(512\times 512\) pixel images with a 75% overlap with each other. Smaller images are designed for rapid learning of calligraphic details and larger images are designed for further enhancements of structural integrity, as a tree can typically be covered completely in a \(512\times 512\) patch. We also apply basic data-enrichment methods such as flipping, rotating, and multi-scaling to further enrich the training sets. A combination of the OpenCV detection algorithm and OCR engine is utilized to remove the stamps and calligraphy on the painting to prevent irrelevant information from being fed into training. After image cropping and data augmentation, we created a dataset of 19,074 patches. Self-contained training on our own datasets guarantees the consistency and authenticity of the restoration, as Chinese landscape paintings vary greatly in style, even in the scope of the same author.

5.2 Implementation detail

We split the pipeline into image translation, image inpainting, and detail enhancement. All the experiments are conducted with 4 NVIDIA GeForce RTX 3090 GPUs.

The style translation model is used to transform the style of the imitation copy into that of the original image. This model requires an aligned dataset. We feed the \(256\times 256\) image pairs into the model and set the starting learning rate to 0.0002. After training for 100 epochs, we linearly decrease the learning rate to 0 in approximately 1000 epochs. This process takes approximately 18 hours.

The mapping guided generation model is used for detail enhancements. The number of base filters for the generator and discriminator is reduced to 48 to accommodate our small dataset.

The image inpainting model is introduced for the connection of the generated image and the original ones. This model first connects the edge map and then colorizes the sketches to restore the original style. This model is trained with an initial learning rate of 0.0002 and a fine-tuned learning rate of 0.00005, and requires approximately 40,000 epochs to saturate.

5.3 Evaluation metrics

Our main objective is to ensure that the imitation that we have generated reconciles with the originals. Although commonly used quantitative evaluation metrics such as FID [69] and SSIM [7072] exist, they are not appropriate for our goal because the patch we are seeking to generate lacks a ground truth to which we may compare it. Finally, the subjective evaluation approach is adopted, where we assess the degree to which the generated part is consistent with the originals based on human perception. When evaluating our results, we emphasize the overall integrity, style consistency, transition coherency, and details of the trees.

5.4 Ablation study

As our painter is hierarchical, we conduct ablation studies in each step to verify their effectiveness.

5.4.1 Structure extraction

The XDoG algorithm is chosen as our structure extraction tool. Among the common edge detection methods [52, 7378], we perform comparative analysis with some of the most typical models, i.e., Canny and RCF. All the datasets processed by each of the edge detection methods are fed into Pix2Pix and are trained with the same parameters.

Figure 14 depicts the comparative results. The leftmost column shows the patches from the original copy and those from similar locations on the imitation copy. We select three representative objects for comparison, including stones, houses, and mudflats. The results reveal that XDoG outperforms the other two methods in terms of style translation. This is because XDoG gives better control in detail selection from the perspective of structural extraction. It enhances the contour of the objects while omitting some surrounding decorative strokes. Such selection enables Pix2Pix to better interpret the style translation of the objects and leaves the model with enough degrees of freedom to imitate and generate decorative strokes. We can see that the style translation in the stones succeeds in learning the style of strokes and generating the surroundings. In contrast, the Canny algorithm fails to determine the surrounding edges of small objects. The continuity and integrity of the edges are poor, not to mention the neglected abstractions, resulting in inadequate translation and color displacements. The RCF model, on the other hand, preserves too much structure information, which limits the model’s explanatory freedom, resulting in noisy outputs that contain information irrelevant to the original copy.

Figure 14
figure 14

Comparison of style translation in background detail (stone, house, and mudflat) with different edge detection algorithms. Row a is the original content in ZiMing Scroll. Row B is the similar content in WuYongShi Scroll to show the original style. The row edge shows the edge detection results using Canny, RCF, and XDoG, and the row output shows the corresponding results through Pix2Pix

In terms of the efficiency of preprocessing the datasets, XDoG also expresses superiority by extracting the structure in its entirety. In the preprocessing phase, XDoG can directly process the entire high-resolution painting, while RCF is limited by its excessive RAM occupation. Figure 15 depicts an example of inconsistencies in RCF edges on the boundaries of patches.

Figure 15
figure 15

Comparison of the consistency of the boundaries. There is an evident trace at the center of the RCF’s result

5.4.2 Style translation

To quantitatively evaluate the effectiveness of our style translation model, we visualize the image features with the gray-scale histogram. The x-axis of the histogram represents the gray value ranging from 0 to 255, while the y-axis represents the number of pixels. As depicted in Fig. 13, the gray-scale histogram of the original image has a smoother overall contour than that of the imitation copy. The concave-convex positions are also different, with the original and generated images demonstrating convexity at approximately 130 gray levels, while the imitation image demonstrates concavity. Additionally, the pixel counts in specific gray values reach unexpected peaks, which are critical indicators of the painting style, including ink characteristics and brushstrokes, according to experts in the field of Chinese landscape painting.

Our ablation study reveals that our generated image demonstrates a smoother contour compared to the imitation copy. Moreover, the gray values of the unexpected pixel peaks in the generated image become closer to those of the original one. The distributions of the gray histograms between the generated and original images are similar. As a result, our style translation model has produced promising pixel value distribution results.

5.4.3 Detail generation model

Our prior steps have already generated promising results. However, due to the efficacy of their edge detection techniques, their capacity is too limited to generate tree representations that should exhibit diverse forms and styles. More specifically, they fail to discern and isolate the density and concentration of ink strokes in individual sections of the trees. Considering this, we further incorporate another generative layer to pursue more detailed results.

To enhance the performance of our model, we design a 3-layer segmentation map accompanied by a layer of Gaussian blur. Figure 16 illustrates the comparison of different segmentation mapping processes, where (A) is the benchmark testing by inputting the segmentation map directly into Pix2Pix; (B) introduces our model with one segmentation map for the trees; and (C) differentiates stems from the leaves. From (A) and (B), it can be observed that our model outperforms its predecessor Pix2Pix. However, with the horizontal shadings’ information eliminated, this configuration fails to recover the horizontal brushstrokes of the trees. Segregating into two classes allows for discrepancies between shallow and dark brushstrokes, yet the transition is too stiff to be considered as a natural transition. A third layer is thus added to smoothen the transition between classes. Finally, the Gaussian layer merges the discrete pixels into continuous patches, which is crucial since the traces left by the brushstrokes should be continuous. The generated results also demonstrate significantly improved consistency and resistance against large chunks of trees, as demonstrated in (D4) and (E4) in Fig. 16. Through the three classes and the Gaussian layer, we manage to make the segmentation mapping remarkably close to the actual looks of brushstrokes.

Figure 16
figure 16

Different stages of using the detail generation model: (A) is the benchmark testing by inputting the segmentation map directly into Pix2Pix; (B) introduces our model with one segmentation map for the trees; (C) differentiates stems from the leaves; (D) surrounds a transition layer around the border of the layers; (E) adds a Gaussian layer to introduce a more continuous result

The ablation test reveals the superiority of our segmentation layering design, which ensures smooth connections between hard and soft calligraphical strokes. Moreover, the extra Gaussian blur layer significantly enhances the structural integrity and fluency, resulting in high-quality synthesized images with impeccable details. Designing segmentation mapping in this way guarantees a good restoration patch that is faithful to the original.

5.4.4 Image inpainting

The image inpainting model is adopted to ease the transition between the original painting and the translated painting after stitching them together. The transition parts are cropped separately into \(256\times 256\) images with a 75% overlap. Masks that take up 15%, 30% and 45% of the width of the cropped images are generated and tested. The results are demonstrated in Fig. 17. The leftmost column displays our input images cropped directly from the joint parts of the stitched image, where the seams are easy to spot. The three columns on the right show the results generated by the model with masks of different widths. These generated images are a good illustration of our inpainting model’s ability to coherently connect the contents of both ends of the mask together, eliminating the rigidity at the joints caused by direct stitching. However, when we scrutinize the detailed images, we find that while the results of wider masks are more coherent, it has difficulty in generating detailed objects such as trees, stones, and houses that can match both sides of the mask. Thus, we concluded that the inpainting model achieves a better effect when the width of masks is 15%.

Figure 17
figure 17

Comparison of inpainting results with masks of different widths (15%, 30% and 45%)

5.5 Model comparison

To demonstrate the superiority of our model, we fed the same set of images into other state-of-the-art inpainting models for comparison. Figure 18 is the result of the stable diffusion inpainting model [9]. Figure 19 is the result of the LAMA-Fourier model [10]. Both models fail to achieve our goal of restoring Chinese landscape paintings because they are unable to achieve a high degree of consistency in the trend of mountains and the intricate shapes of trees. Traditional inpainting models either directly copy another patch from other parts of the painting, or merely stretch and extend existing resources. These actions will inevitably bring inconsistencies to the overall style and coherence of the painting. In comparison, our model, with results displayed in Fig. 20, achieves an outstanding performance from both perspectives.

Figure 18
figure 18

The result generated using the Stable Diffusion Inpainting model: (A) is WuYongShi Scroll; (B) is generated; (C) is ShengShan Scroll

Figure 19
figure 19

The result generated using the LAMA-Fourier model: (A) is WuYongShi Scroll; (B) is generated; (C) is ShengShan Scroll

Figure 20
figure 20

The result generated using our model: (A) is WuYongShi Scroll; (B) is generated; (C) is ShengShan Scroll

6 Conclusion

In this paper, we present a novel hierarchical painter that aims to restore Chinese landscape paintings. Our proposed framework is capable of generating high-quality inpainting results and fine-grained details that closely resemble the original painting. By separating the background and details of the image, our model ensures that the overall structure is consistent while the details can exhibit stroke effects that mimic the brushwork of the original artist. Moreover, we discover an effective method for generating structures and translating styles that are specific to Chinese landscape painting. Our experimental results demonstrate that our hierarchical painter can successfully restore damaged Chinese landscape paintings to their original glory.

7 Discussion

Although our proposed method achieves satisfying results in restoring Chinese landscape paintings, there are still limitations that need to be addressed. First, it is impossible to generate the missing part completely based on the original image without referring to the information provided by the imitation copy. Therefore, the style of the generated part may be affected by the imitation copy. Additionally, the in-painted part lacks details, which may result in failure to fill in concrete objects that can truly connect the generated part and the original painting. Future research can explore and solve these issues to better restore paintings using artificial intelligence.

Additionally, during our experiments, we encountered a challenge when generating detailed trees using the SPADE model. The model struggled to automatically optimize the edges of the selected area, resulting in the edges of the generated trees being restricted by the mask and containing many small irregular shapes. To address this issue, we explored recent developments in the field of visual intelligence, specifically the stable diffusion model (SDM). This model is a text-manipulated image synthesis model that builds upon the previous findings of the latent diffusion model [9]. The SDM can generate remarkably natural and delicate images, with the correct parameters and datasets to fine-tune an existing generator. Our experiments showed that the SDM can generate tree styles similar to those in the original painting, with smooth and streamlined edges (Fig. 21).

Figure 21
figure 21

Generated image of the stable diffusion model

However, a drawback of the SDM is that it is challenging to construct appropriate constraints for the model, and it may be necessary to manually select results from thousands of output images that match the shape and overlap of the generated terrain features. This workload significantly exceeds the limits of our team, so we were unable to integrate this method into our pipeline. Nonetheless, recent research on diffusion models [79] shows the feasibility of constructing such constraints. Thus, the SDM may hold the key to a smooth and natural image synthesis that can faithfully restore any painting, given an accurate and informative network guide.