1 Introduction

Comic inpainting is necessary for language localization and transforming comics into animations. It involves removing speech boxes, textboxes, and sound effects while filling occluded areas to preserve artistic integrity. An example of this inpainting can be seen in Fig. 1. This paper introduces an automated approach to accomplish this task, leveraging the innovative Manga inpainting method introduced by Xie et al. [1], while incorporating automatic mask generation. The aim of this work is to extend this technique to western comics, enabling professionals in the comics and animation industries to convert western comics into animated stories while preserving their original style and narrative, and facilitating easier language localization. For this study, Tex Willer comics have been selected. Tex Willer comics combine Italian art with American Western imagery, characterized by artist Aurelio Galleppini’s bold compositions and attention to detail. Chiaroscuro shading adds depth to scenes set in rugged landscapes, while character designs balance realism and idealization, conveying distinct personalities and moral integrity. Known for their enduring cultural significance, Tex Willer comics represent a visually captivating fusion of Western iconography and Italian comic art.

Fig. 1
figure 1

Original Tex Willer comic and inpainted Tex Willer comic with automatic masking

Adapting inpainting techniques to western comics like Tex Willer poses challenges due to their unique styles and structural differences from Manga. Specialized methods are required to address these variations. The project aims to develop a precise model tailored to these intricacies, producing accurate and visually coherent results. Our contributions can be summarized as follows:

  • Customization of the Manga Restoration Code [2] and Manga Structural Line Extraction Code [3] specifically for Tex Willer Comics to ensure effective restoration and precise line extraction.

  • Building upon the technique introduced in Seamless Manga Inpainting with Semantics Awareness [1] to inpaint Tex Willer Comics.

  • Automation of mask generation using deep CNN-based speech balloon detection and segmentation [4] to enhance efficiency.

  • Creation of a fully automated system for western comics inpainting, streamlining the process and eliminating manual preprocessing.

This paper is organized as follows: We start with a review of related studies in the State-of-the-Art section. Then, we explain our methodology, present our findings in the Results section, and conclude by highlighting the main outcomes of our work.

2 State of the art

2.1 Comic analysis and processing

Significant advances in comic processing have been made, particularly in the area of automated colorization and background generation while preserving the artistic style. Sykora et al. [5] introduced an automated technique for colorizing black-and-white cartoons, which significantly reduces the need for manual input. Similarly, Qu et al. [6] offered methods not only for coloring manga while retaining its intricate details and styles but also for creating bitonal manga backgrounds from color photographs to ensure stylistic consistency [7]. Ito et al. [8] have developed a technique to automatically generate binary masks for separating line drawings and screentones, using advanced filtering methods. Additionally, Wu et al. [9] introduced a system for automating manga screening, which aids in the production process by adding properly selected and shaded screentones to line drawings.

2.2 Restoration

Restoration technologies play a critical role in the comic inpainting process, aimed at converting scanned images into high-fidelity digital formats and enhancing the visual quality of legacy comics. Kopf and Lischinski presented an automated method that transforms scanned color comic books into high-quality digital versions [10]. Xie et al. [11] described a screentone variational autoencoder (ScreenVAE) for the bidirectional translation between screened and color-filled manga, improving restoration outcomes. Additionally, they developed a two-stage method focusing on the restoration of low-quality legacy manga images, utilizing a Scale Estimation Network (SE-Net) and a Manga Restoration Network (MR-Net) to address the challenges of degraded screentones [2].

2.3 Segmentation and extraction

Effective segmentation and extraction techniques are crucial for comic analysis. Zhang et al. [12] introduced the trapped-ball segmentation method for stable frame segmentation in cartoon vectorization. Ho et al. [13] utilized region growing and mathematical morphology to extract panels and detect text in speech balloons. Xu et al. [14] proposed a method that preserves meaningful content and textural edges while eliminating unnecessary textures. Matsui et al. [15] developed a content-based manga retrieval system that features margin area labeling and objectness-based edge orientation histogram (EOH) feature description, alongside screentone removal. Yao et al. [16] introduced a system for decomposing manga elements into screentone regions, borders, and shading. Liu et al. and Xie et al. [17, 18] have further advanced the field by combining texture feature analysis and local smoothness for precise boundary extraction and by developing automated methods for screentone manipulation, respectively.

2.4 Speech balloon extraction

Speech balloon extraction is integral to text extraction from comics for translation and automation in inpainting. Rigaud et al. [19] developed a method using an active contour model to detect speech balloons, adapting it to various shapes and conditions, including challenging scenarios with missing outlines. Dubray and Laubrock [4] enhanced the automatic analysis, detection, and segmentation of comic book page components, utilizing a modified U-Net architecture originally designed for medical image segmentation, demonstrating its effectiveness in identifying speech balloons and their tails.

2.5 Structural Line Detection and Extraction

Structural line extraction is essential for colorization and inpainting tasks in comics studies. Mao et al. [20] introduced a novel method for detecting structural lines in low-quality cartoons. Additionally, Li et al. [3] tackled the challenge of extracting structural lines from intricate manga patterns using convolutional neural networks (CNNs), which also helps in reducing manual labeling efforts and effectively suppressing various screen patterns to yield clear and smooth lines.

2.6 Object Detection and Recognition

Object detection and recognition play a crucial role in inpainting while preserving the integrity of comics. Nguyen et al. [21] introduced a novel approach using deep neural networks for object detection in comics. Additionally, Chu et al. [22] developed Manga FaceNet for detecting faces in manga, addressing the challenge of identifying character faces in comics. Furthermore, Ogawa et al. [23] introduced Manga109-annotations, a detailed dataset with over half a million annotations for the Manga109 image collection. They also proposed SSD300-fork, a convolutional neural network (CNN) model tailored for detecting overlapped objects in comics. These contributions highlight the significance of object detection and recognition in maintaining the integrity of comics during inpainting processes.

2.7 Comics inpainting

Inpainting is crucial for transitioning comics into animations and addressing artwork imperfections, with our focus being on western comics. Sasaki et al. [24] introduced a CNN-based method for detecting and filling missing regions in line drawings, showing significant improvements over previous methods. Various approaches have been developed to address different aspects of image inpainting, including techniques by Yu et al., Zeng et al., and Nazeri et al. [25,26,27]. Liu et al. [28] introduced a method using partial convolutions, effectively managing irregular masks and reducing artifacts without the need for additional blending. Tsubota et al. [29] explored screentone synthesis to improve manga image generation from line drawings, while Ren et al. developed a two-stage network focusing on generating accurate structures and realistic textures [30]. Ono, Aizawa, and Matsui [31] proposed techniques for handling high-frequency components in line drawings, enhancing inpainting quality. The PatchMatch algorithm by Barnes et al. [32] optimizes nearest-neighbor matches between image patches. Xie et al. introduced the Seamless Manga Inpainting with Semantics Awareness (SMISA) technique, which uses a semantic network for predicting structural lines and screentones, and an appearance synthesis network for region synthesis. This dual-task approach ensures detailed understanding and application, showing superiority in visual quality over traditional one-step methods [1].

3 Methodology

To restore Tex Willer comics, our approach involved leveraging a modified Manga Restoration code, adapted from Exploiting Aliasing for Manga Restoration (EAMR)  [2]. This modification enabled effective restoration of Tex Willer comic images. Subsequently, we utilized a modified Manga Structural Line Extraction code derived from Deep Extraction of Manga Structural Lines (DEMSL) [3] to specifically extract structural lines from the Tex Willer Comics.

Creating datasets for mask generation was a crucial step in our process. We employed two distinct methods: one dataset was generated by incorporating the speech balloon detection from Deep CNN-based Speech Balloon Detection and Segmentation for Comic Books (DC-SBDS) [4], while the other was manually created. These datasets played a pivotal role in assessing the capacities of the Manga Inpainting code and evaluating the success of the inpainting process.

For the inpainting phase, we employed a modified version of the Inpainting code from SMISA [1]. This adapted code was specifically designed to generate inpainted Tex Willer comics. Given an input comic image, its corresponding structural lines, and a mask, the code seamlessly produced an inpainted version, effectively removing speech balloons and textboxes while ensuring visually cohesive results.

The overall process, starting with raw data and ending with inpainted results, can be seen in Fig. 2.

Fig. 2
figure 2

Schematic Overview of the Tex Willer Comics Inpainting Process: This diagram illustrates the step-by-step process involved in inpainting Tex Willer comics. Initially, three preprocessing steps are applied to the same dataset to generate three distinct datasets: the first one focuses on restoring the comic to high quality, the second one extracts the structural lines from the comic, and the third one automatically creates masks for speech balloons in the comic. Subsequently, these three datasets are utilized to automatically generate an inpainted version of the comic, where speech boxes are removed, and disoccluded areas are inpainted

3.1 Modification of manga restoration code

The algorithm for the modified two-stage restoration framework begins with the acknowledgment of modifying the Manga Restoration code from EAMR [2] to enhance its applicability for Tex Willer comics. The restoration process is divided into two stages: restorative scale estimation and discriminative restoration.

In the first stage, the problem is formulated with original (Igt) and degraded (Ix) manga images. The degradation of Ix involves convolution, downsampling with a scale factor, and the addition of noise. The restorative scale estimation utilizes the Scale Estimation Network (SE-Net), taking Ix as input and employing the Convolutional Block Attention Module (CBAM). SE-Net is guided by a loss function (LSE), comprising a scale loss encouraging the estimated scale factor (sy) to align closely with the ground truth and a consistency loss ensuring coherence for patches from the same manga image.

Moving to the second stage, discriminative restoration involves the Manga Restoration Network (MR-Net). MR-Net takes the degraded manga image Ix and the estimated scale factor sgt as input, producing a confidence map (Mc) and the restored manga image (Iy). The Residual Attention Module (RAM) serves as the backbone unit for MR-Net. The loss function for this stage (LMR) encompasses pixel loss, confidence loss, binarization loss, intensity loss, and homogeneity loss. These diverse loss components collectively ensure similarity with the ground truth, instill confidence in pattern-identifiable regions, generate bitonal pixels, maintain visual conformity, and uphold homogeneity within screentone regions.

The overall process entails SE-Net estimating the scale factor, subsequently employed by MR-Net for discriminative screentone restoration. SE-Net focuses on downscaling estimation, accommodating variations in degradation across screentone regions, while MR-Net capitalizes on attention mechanisms and a comprehensive set of loss functions, facilitated by neural networks and attention modules, to achieve effective manga image restoration tailored for Tex Willer comics.

3.2 Adaptation of manga structural line extraction code

The Manga Structural Line Extraction code, adapted from DEMSL [3], was employed for the specific task of extracting structural lines from Tex Willer Comics. This adaptation ensures a more accurate extraction process tailored to the distinctive features of western comic art. In the utilization of the CNN model, a pixel-wise architecture was selected for its capability to align structural lines at the pixel level and preserve pixel intensity. The model, consisting of an encoding network and a decoding network with a downscaling-upscaling structure, was applied without further development. The downscaling network, featuring three levels with a downscaling block followed by regular blocks, and the upscaling network, mirroring this structure, facilitated the extraction process. Residual blocks were incorporated to enhance the depth of the pre-existing model and aid in information propagation. To preserve spatial relations and image details, convolution/deconvolution layers with strides were used instead of traditional max-pooling. Additionally, skipping structures inspired by U-Net were integrated to minimize information loss during dimension reduction. This adaptation highlights the efficient use of the existing CNN model for the specific requirements of extracting structural lines from Tex Willer Comics.

3.3 Inpainting approach for Tex Willer comics

The inpainting code that is adapted from the manga inpainting framework [1] comprises two interconnected networks: a semantic inpainting network (Ginp) and an appearance synthesis network (Gsyn). The process begins with the decomposition of the input manga image into a structural line map (L) and a screentone image (Is) through a manga-tailored method. Simultaneously, the masked manga undergoes decomposition into a masked structural line map (LM) and a masked ScreenVAE map (SM). The semantic inpainting network utilizes a recurrent updating approach, iteratively generating inpainted structural lines (L\(^\sim \)) and ScreenVAE maps (S\(^\sim \)) from outermost contours to the interior. These inpainted maps serve as vital components for the appearance synthesis network, which incorporates a contextual attention design to enhance visual coherence with surrounding screentones. The overall process comprises two major steps: semantic inpainting and appearance synthesis.

In the Semantic Inpainting phase, the manga is decomposed into structural lines (LM) and screentone components (SM). The semantic inpainting network (Ginp) predicts disoccluded structures (L\(^\sim \)) and screentones (S\(^\sim \)) using a dual-branch structure for consistent semantic maps. A recurrent updating approach is employed to handle large disocclusions, performing iterative inpainting from the outer ring to the center.

The Appearance Synthesis step focuses on inpainting manga based on semantic guidance maps (L\(^\sim \) and S\(^\sim \)) using the appearance synthesis network (Gsyn). Notably, direct ScreenVAE reconstruction leads to visual inconsistencies, prompting the adoption of a contextual attention design. The appearance synthesis incorporates four essential loss terms: manga reconstruction, ScreenVAE map, adversarial, and binarization.

Additionally, in the context of the inpainting approach for Tex Willer Comics, we built upon the technique outlined in SMISA [1]. This adaptation involved implementing the approach specifically for inpainting Tex Willer Comics, seamlessly removing speech balloons and textboxes while ensuring visual coherence in the inpainted results.

Fig. 3
figure 3

Original Tex Willer comic, preprocess outputs, and result comparison

3.4 Automation of mask generation

A notable enhancement was achieved through the automation of the mask generation process for comics. In contrast to the manual approach outlined in SMISA  [1] for Manga Inpainting, we utilized DC-SBDS [4] to automatically detect speech balloons and generate masks. This automated process streamlines the inpainting pipeline, resulting in improved efficiency. The model architecture for speech balloon detection involves a fully convolutional approach based on the U-Net architecture, with an encoding and decoding branch. The encoding branch employs the convolutional part of the VGG-16 model, while the decoding branch performs upsampling with skip connections, L2 kernel regularization, and batch normalization. The implementation is carried out using the Keras library.

3.5 Development of a fully automated system

Through these modifications and adaptations, we have successfully developed a fully automated system for western comics inpainting. With the enhancements we’ve implemented, the system operates seamlessly with just a comic image input, eliminating the need for any manual preprocessing. All data preprocessing tasks, including mask creation and structural line extraction, are performed automatically. The system integrates advanced inpainting techniques, automated mask generation, and customized restoration processes, making significant contributions to the broader fields of computer vision and image processing.

4 Results

For our dataset, we selected five Tex Willer comic pages from the Internet Archive [33], which are available as open-source material. We generated restored versions of these pages using our adapted version of EAMR [2], as illustrated in Fig. 3a. Additionally, we extracted their structural lines using our modified implementation of DEMSL [3], as depicted in Fig. 3b. Automatic masks were created using our modified version of DC-SBDS [4], shown in Fig. 3c. For testing purposes, we also manually created masks, presented in Fig. 3d.

Table 1 Comparison of average PSNR, SSIM, and LPIPS between results inpainted with automatic and manual masks

We have generated two distinct result sets using our modified version of the SMISA method proposed by Xie et al. [1]. One set, showcased in Fig. 3e, was created with automatic masks, while the other, demonstrated in Fig. 3f, employed manual masks. Notably, the manual masks yielded nearly perfect results, whereas the automatic masks, while exhibiting some limitations, generally produced satisfactory outcomes.

For performance evaluation, we compare our results inpainted with automatic masks to those inpainted with manual masks. Three metrics are chosen for this evaluation: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). A higher PSNR value indicates better reconstruction fidelity. As shown in Table 1, inpainting with automatic masks is slightly superior, though the difference is marginal. For SSIM, higher values imply better structural similarity between the inpainted and original images. As shown in Table 1, results inpainted with automatic masks exhibit slightly better performance. Lower LPIPS scores indicate better perceptual similarity to the ground truth images, as the metric evaluates the perceptual difference between two images using deep neural networks. As shown in Table 1, the scores of results inpainted with automatic masks are better than those of results inpainted with manual masks. While the higher scores of results inpainted with automatic masks may be attributed to the automatic masking algorithm not masking textboxes, it’s worth noting that text boxes were indeed masked in manual masks, leading to more substantial changes in their structural integrity. However, despite this difference, the overall scores are very similar, demonstrating the effectiveness of automatic masking.

Our modified automatic masking code, inspired by DC-SBDS [4], encounters challenges in detecting textboxes positioned at the top of comic boxes. Additionally, incomplete speech balloons may lead to minor inpainting errors. However, it is crucial to note that despite these limitations, the code does not disrupt the overall integrity of the comics.

5 Conclusion

In conclusion, our work has produced two sets of results using our adapted SMISA method[1]. One set uses automatic masks (Fig. 3c), and the other uses manual masks (Fig. 3d). The manual masking method gives nearly flawless results, showing its effectiveness. While our automatic masking, inspired by Dubray et al.’s work[4] on speech balloon detection, has trouble finding some textboxes and speech balloons, resulting in minor errors when filling incomplete speech balloons, it maintains the comic’s overall integrity. Despite these challenges, our evaluation metrics demonstrate the reliability of our approach. Comparing results with automatic and manual masks shows slight differences in PSNR, SSIM, and LPIPS scores, with the automatic masking method performing slightly better, indicating its effectiveness in preserving image fidelity and perceptual similarity. This highlights the robustness and potential of our approach in inpainting comics and drawings.