Keywords

1 Introduction

Despite the proliferation of 3D animations and artworks, 2D drawings and hand-drawn animations are still important art communication media. This is mainly because drawing in 2D is not tied to any constraining tools and brings the highest freedom of expression to artists. Artists usually work in three steps: firstly, they create the raw animation or drawing which includes finding the right scene composition, character posture, and expression. Secondly, they refine the artwork, digitalise it and clean the line-art. Finally, they add color or decorative textures, lights and shades. When working on numerous drawings some of these steps can become quite tedious. To help with these time-consuming and repetitive tasks, scientists have tried to automate parts of the pipeline, for example, by cleaning the line-art [36, 37], scanning [24], coloring [42, 52], and by developing image registration and inbetweening techniques [41, 49, 50].

Fig. 1.
figure 1

Our method takes as input a drawing of any resolution and estimates a plausible normal map suitable for creating shading effects. From left to right: Input drawing and flat colors, normal estimation and two renderings with different lighting configurations.

In this paper, we consider the shading task. Besides bringing appeal and style to animations, shades and shadows provide important visual cues about depth, shape, movement and lighting [19, 29, 48]. Manual shading can be challenging as it requires not only a strong comprehension of the physics behind the shades but also, in the case of an animation, spatial and temporal consistency within and between the different frames. The two basic components required to calculate the correct illumination at a certain point are the light position with respect to the point and the surface normal. These normals are unknown in hand-drawn artwork. Several state-of-the-art approaches have tried to reconstruct normals and/or depth information directly from line-drawing [9, 12, 14, 17, 26, 29, 40, 43], however, most of these works seem to be under-constrained or require too many user inputs to really be usable in a real-world artistic pipeline.

We propose a method to estimate high-quality and high-resolution normal maps suitable for adding plausible and consistent shading effects to sketches and animations (Fig. 1). Unlike state-of-the-art methods, our technique does not rely on geometric assumptions or additional user inputs but works directly, without any user interaction, on input line-drawings. To achieve this, we have built a rich dataset containing a large number of training pairs. This dataset includes different styles of characters varying from cartoon ones to anime/manga. To avoid tedious and labor intensive manual labelling we also propose a pipeline for efficiently creating a rich training database. We introduce a deep Convolutional Neural Network (CNN) inspired by Lun et al. [26] and borrow ideas from recent advances such as symmetric skipping networks [32]. The system is able to efficiently predict accurate normal maps from any resolution input line drawing. To validate the effectiveness of our system, we show qualitative results on a rich variety of challenging cases borrowed from real world artists. We also compare our results with recent state-of-the-art and present quantitative validations. Our contributions can be summarized as follows:

  • We propose a novel CNN pipeline tailored for predicting high-resolution normal maps from line-drawings.

  • We propose a novel tiled and multi-scale representation of input data for efficient and qualitative predictions.

  • We propose a sampling strategy to generate high-quality and high resolution normal maps and compare to recent CNNs including a fully convolutional network.

2 Related Work

Before introducing our work, we first review existing methods on shape from sketches. They can be classified into two categories: geometry-based methods and learning-based methods.

2.1 Inferring 3D Reconstruction from Line Drawings

Works like Teddy [15] provide interactive tools for building 3D models from 2D data by “inflating” a drawing into a 3D model. Petrovic’s work [29] applies this idea to create shades and shadows for cel animation. While this work reduces the labor of creating the shadow mattes compared to traditional manual drawing, it also demonstrates that a simple approximation of the 3D model is sufficient for generating appealing shades and shadows for cel animation. However, it still requires an extensive manual interaction to obtain the desired results. Instead of reconstructing a 3D model, Lumo [17] assumes that normals at the drawing outline are coplanar with the drawing plane and estimates surface normals by interpolating from the line boundaries to render convincing illumination. Olsen et al. [27] presented a very interesting survey for the reconstruction of 3D shapes from drawings representing smooth surfaces. Later, further improvements were made such as handling T-junctions and cups [18], also using user drawn hatching/wrinkle strokes [6, 16] or cross section curves [35, 46] to guide the reconstruction process. Recent works exploit geometric constraints present in specific types of line drawings [28, 34, 51], however, these sketches are too specific to be generalized to 2D hand-drawn animation. In TexToons [40], depth layering is used to enhance textured images with ambient occlusion, shading, and texture rounding effects. Recently, Sỳkora et al. [43] apply user annotation to recover a bas-relief with approximate depth from a single sketch, which they use to illuminate 2D drawings. This method clearly produces the best results, though it is still not fully automatic and still requires some user input. While these state-of-the-art methods are very interesting, we feel that the bas-relief ambiguity has not yet been solved. High-quality reconstructions require considerable user effort and time, whereas efficient methods rely on too many assumptions to be generalized to our problem. Although the human brain is still able to infer depth and shapes from drawings [5, 7, 21], this ability still seems unmatched in computer graphics/vision using geometry-based methods.

2.2 Learning Based Methods

As pure geometric methods fail to reconstruct high-quality 3D from sketches or images without a large number of constraints or additional user input, we are not the first to think that shape synthesis is fundamentally a learning problem. Recent works approach the shape synthesis problem by trying to predict surface depth and normals from real pictures using CNNs [8, 31, 47]. While these works show very promising results, their inputs provide much more information about the scene than drawn sketches, such as textures, natural shades and colors. Another considerable line of work employs parametric models (such as existing or deformed cars, bikes, containers, jewellery, trees, etc.) to guide shape reconstruction through CNNs [3, 13, 30]. Recently, Han et al. [11] presented a deep learning based sketching system relying on labor efficient user inputs to easily model 3D faces and caricatures. The work of Lun et al. [26], inspired by that of Tatarchenko et al. [44], is the closest to our work. They were the first to use CNNs to predict shape from line-drawings. While our method and network were inspired by their work, it differs in several ways: their approach makes use of multi-view input line-drawings whereas ours operates on a single input drawing, which allows our method to also be used for hand-drawn animations. Moreover, we present a way of predicting high-resolution normal maps directly, while they only show results for predicting \(256\times 256\) sized depth and normal maps and then fusing them into a 3D model. They provide a detailed comparison between view-based and voxel-based reconstruction. More recently Su et al. [39] proposed an interactive system for generating normal maps with the help of deep learning. Their method produces relatively high quality normal maps from sketch input combining a Generative Adversarial Network framework together with user inputs. However, the reconstructed models are still low resolution and lack details. The high-quality and high-resolution of our results allow us to qualitatively compete with recent work, including animation and sketch inflation, for high quality shading (see Sect. 2.1).

Fig. 2.
figure 2

System overview.

3 Proposed Technique

Figure 2 illustrates our proposed pipeline. The input to our system is a single arbitrary digital sketch in the form of a line drawing of a character. The output is an estimated normal map suitable for rendering effects, with the same resolution as the input sketch. The normals are represented as 3D vectors with values in the range \([-1, 1]\). The normal estimation relies on a CNN model trained only once offline.

The high resolution input image is split into smaller tiles/patches of size \(256\times 256 \times 3\), which are then passed through the CNN normal prediction and finally combined into a full resolution normal map. The training phase of the CNN requires a large dataset of input drawings and corresponding ground truth normal maps. As normal maps are not readily available for 2D sketches or animations, such a dataset cannot be obtained manually, and we therefore make use of 3D models and sketch-like renderings (Freestyle plugin in Blender ©) to generate the training dataset.

Fig. 3.
figure 3

Structure of the input data. Here the target normal reconstruction scale is the blue channel. The two other channels provide additional multi-scale representation of the local area

3.1 Input Preparation

One key factor that can improve the success of a CNN-based method is the preparation of input data. To take advantage of this, we propose a new data representation to feed to our network. Instead of simply feeding tiles of our input line drawing to the encoder, we feed in many multi-scale tile representations to the CNN, with each tile capturing a local area of interest of the sketch. This multi-scale representation prevents our network from missing important higher scale details. A multi-scale tile example is shown in Fig. 3. Each channel of the tile is used to represent the lines of the sketch in a local area at one of 3 different scales. In each channel, pixels on a line are represented by a value of 1. The first channel in each tile (blue channel in Fig. 3) represents the scale at which the normal vectors will be estimated.

Fig. 4.
figure 4

Overall structure of the proposed network.

3.2 CNN Model

Our main requirements for the CNN is that the output has to be of the same size as the input and that extracted normals have to be aligned pixel-wise with the input line drawing. This is why a U-net based CNN is an appropriate choice. A typical pixel-wise CNN model is composed of two parts: an encoding branch and a decoding branch. Our proposed network is represented in Fig. 4, the top branch being the encoding network and the bottom branch the decoding network.

Encoder. The encoder, inspired by Lun et al. [26], compresses the input data into feature vectors which are then passed to the decoder. In the proposed network, this feature extraction is composed of a series of 2D convolutions, down-scaling and activation layers. The size of every filter is shown in Fig. 4. Rather than using max-pooling for down-scaling, we make use of convolution layers with a stride of 2 as presented by Springenberg et al. [38]. As our aim is to predict the normals, which take values in the range \([-1,1]\), our activation functions also need to let negative values pass through the network. Therefore, each group of operations (convolution/down-scale), is followed by leaky ReLUs (slope 0.3), avoiding the dying ReLUs issue when training.

Decoder. The decoder is used to up-sample the feature vector so that it has the original input resolution. Except for the last block, the decoder network is composed of a series of block operations composed of up-sampling layers (factor of 2), 2D convolutions (stride 1, kernel 4) and activation layers (Leaky ReLUs with slope = 0.3). Similarly to U-Net [32] we make use of symmetric skipping between layers of the same scale (between the encoder and the decoder) to reduce the information loss caused by successive dimension reduction. To do so, each level of the decoder is merged (channel-wise) with the block of the corresponding encoder level. Input details can then be preserved in the up-scaling blocks. The last block of operations only receives the first channel (Blue channel in Fig. 3) for merging as the other two channels would be irrelevant at this final level of reconstruction. Our output layer uses a sigmoidal hyperbolic tangent activation function since normals lie in the range \([-1,1]\). The last layer is an \(L_2\) normalization to ensure unit length of each normal.

3.3 Learning

Training Data. Training a CNN requires a very considerable dataset. In our case, our network requires a dataset of line drawing sketches with corresponding normal maps. Asking human subjects to provide 2D line drawings from a 3D model would be too labour-intensive and time consuming. Instead we apply an approach similar to Lun et al. [26], automatically generating line drawings from 3D models using different properties such as silhouette, border, contour, ridge and valley or material boundary. We use the Freestyle software implemented for Blender©[10] for this process. Using Blender©scripts we were able to automatically create high-resolution drawings from 3D models (42 viewpoints per 3D model), along with their corresponding normal maps. As we generate our normal maps from 3D models alone, all background pixels have the same background normal value of \([-1,0,0]\). This will be used in the loss function computation (see Sect. 3.3). Tiling the input full resolution image into \(256 \times 256\) tiles also provides a tremendous data augmentation. We extract 200 randomly chosen tiles from every drawing, making sure that every chosen tile contains sufficient drawing information. Therefore our relatively small dataset of 420 images (extracted from ten 3D models) is augmented to an 84000 elements training dataset.

Loss Function. During the training, as we are only interested in measuring the similarity of the foreground pixels, we use a specific loss function:

$$\begin{aligned} L = \sum _{p} (1 - N_e (p) \cdot N_t (p)) \times \delta _{p}, \end{aligned}$$
(1)

where \(N_e(p)\) and \(N_t(p)\) are the estimated and ground truth normals at pixel p respectively, and \(\delta _{p}\) ensures that only foreground pixels are taken into account in the loss computation, being 0 whenever p is a background pixel (i.e. whenever \(N_t = [-1,0,0]\)) and 1 otherwise.

We train our model with the ADAM solver [20] (learning rate = 0.001) against our defined loss function. Finally, the learning process ends when the loss function converges.

Fig. 5.
figure 5

Normal map reconstruction of the input sketch (a) using a direct naive sampling (b) and our multi-grid diagonal sampling (c). Sampling grids used in direct naive sampling (d) and multi-grid diagonal sampling (e).

3.4 Tile Reconstruction

As the network is designed to manage \(256\times 256\times 3\) input elements, high-resolution drawings have to be sampled into \(256\times 256\times 3\) tiles. These tiles are then passed through the network and outputs have to be combined together to form the expected high-resolution normal map. However, the deep reconstruction of normal maps is not always consistent across tile borders, and may be inaccurate when a tile misses important local strokes, as can be seen in Fig. 5(b). Direct naive tile reconstructions can therefore lead to inconsistent, blocky normal maps, which are not suitable for adding high-quality shading effects to sketches.

In order to overcome this issue, we propose a multi-grid diagonal sampling strategy as shown in Fig. 5(e). Rather than processing only one grid of tiles, we process multiple overlapping grids, as shown in Fig. 5(e). Each new grid is created by shifting the original grid (Fig. 5(d)) diagonally, each time by a different offset. Then at every pixel location, the predicted normals are averaged together to form the final normal map, as shown in Fig. 5(c). The use of diagonal shifting is an appropriate way to wipe away the blocky (mainly horizontal and vertical) sampling artifacts seen in Fig. 5(b) when computing the final normal map. Increasing the number of grids also improves the accuracy of the normal estimation, however, the computational cost also increases with the number of grids. We therefore measured the root mean squared error (RMSE) of the output normal map versus ground truth depending on the number of grids and found that using 10 grids is a good compromise between quality of estimation and efficiency (<1 s for a \(1K\times 2K\) px input image).

Fig. 6.
figure 6

Different types of shading: One can also employ a stylized toon shader such as [2, 4, 23, 45], or classic non-photorealistic technique imitating global illumination such as diffuse and/or specular, Fresnel effect etc.

3.5 Texturing and Rendering

Since only the normal maps are predicted, and the viewing angle remains the same, properties from the input image such as flat-colors or textures can be directly used for rendering. Once textured or colorized, any image based rendering can be applied to the sketch, as in Fig. 6(c), such as diffuse lighting, specular lighting or Fresnel effect (as presented in [33]). For a more stylized effect one can also apply a toon shader, as in Fig. 6(b).

Fig. 7.
figure 7

Comparison of our normal estimation versus ground truth normal of a 3D model. From left to right: the ground truth normal map, the estimated sketch from the 3D model, our estimated normal map and finally the error map. This 3D model was not part of the training database.

4 Results and Discussion

In this section, we evaluate the full system using both qualitative and quantitative analysis. All the experiments are performed using line-drawings that are not included in our training database. Furthermore, as the training database only contains drawings with a line thickness of one pixel, we pre-process our input drawings using the thinning method proposed by Kwot [22].

We have implemented our deep normal estimation network using Python and Tensorflow [1]. The whole process takes less than 2.5 s for a \(1220\times 2048\) image running on an Intel Core i7 with 32 GB RAM and an Nvidia TITAN Xp. See Table 1 for a timing breakdown of each individual stage.

Table 1. Timings
Table 2. Tests results: Average error with L1, L2 and Angular metrics using our test database of 126 1K \(\times \) 2K images, (FullyConv) Fully convolutional networks, (S2N) sketch to normals [39] with no user input on 256 \(\times \) 256 images, (OursNoMS) our method without multi-scale input, (OursMS) our method.

4.1 Quantitative Evaluation

We measure the accuracy of our method using a test database composed of 126 1K \(\times \) 2K images, created with 3D models which are converted into sketches with the same non-photorealistic rendering technique we used to create our training database (see Sect. 3.3). This way, we can compare the result of our technique with the original normal ground truth of the 3D model which, of course, was not included in the training database.

Table 2 shows the numerical results of our different quantitative tests including our network with (OursMS) and without (OursNOMS) the multi-scale input, a fully convolutional network (FullyConv) trained on our database with patch-wise training [25]. We also trained the network from [39] with our database, however with the code provided we were only able to process images at low resolution. The results shown in Table 2 for this method (S2N) are for \(256 \times 256\) images, hence input images and ground truth normal maps had to be re-sized beforehand. For every metric, our method (with multi-scale input) was the most accurate. Also note that using our network in a fully convolutional way to process images of high resolution is not as accurate as using our tiling technique.

Figure 7 shows a test sketch along with the ground truth normal map, our reconstructed normal map with 10 grids and a visual colored error map. While the overall root mean square error measured on the normal map is relatively low, the main inaccuracies are located on the feet and details of the face and the bag (which might be because there are no similar objects to the bag in our database). Also, as the reconstruction depends highly on the sketch strokes, minor variations and/or artistic choices in the input drawings can lead to different levels of accuracy. For example, artists commonly draw very simplified face details with very few strokes rather than realistic ones, which is especially visible in the nose, ears and eyes. In fact, the non-photorealistic method used to generate the sketches is already an estimation of what a sketch based on a 3D model could look like. We see this effect, for example, looking at the difference between the ground truth and predicted normals at the bottom of the jacket in Fig. 7, as some lines are missing on the sketch estimation; an artist might make different drawing choices.

Fig. 8.
figure 8

Shading effects with different light configurations.

4.2 Visual Results

To show the versatility of our method we test it with a set of sketches with very different styles. Figure 12 presents results for four of these sketches: column (a) shows the original sketch, (b) the estimated normal map, (c) the flat-shaded sketch, and (d) shows the sketch after applying shading effects. As shown in column (b), normal maps present extremely fine details, which are correctly predicted in difficult areas such as fingers, cloth folds or faces; even the wrench, in the third row, is highly detailed. Column (d) shows the quality of shading that can be obtained using our system: final renders produce believable illumination and shading despite the absence of depth information. Once the normals are predicted, the user is able to render plausible shading effects consistently from any direction. This can be seen in Fig. 8, where we show several shading results using the same input drawing, by moving a virtual light around the cat.

Fig. 9.
figure 9

Normals obtained with our method (top) and possible shading (bottom).

Fig. 10.
figure 10

Comparison with state-of-the-art methods. Input drawing (a), normals from the 3D reconstruction by [43] (b), our predicted normals (c), normal map based shading using Lumo [17] (d), 3D-like shading used in TexToons [40] (e), 3D reconstruction with global illumination effects [43] (e), Our approach (f). (3D model, line drawing and shading results from [43] kindly provided by the authors. Source drawing ©Anifilm. All rights reserved.)

Fig. 11.
figure 11

Line-drawing of the Utah teapot (left), normal obtained with our method (middle), shading effect (right).

Fig. 12.
figure 12

Our algorithm takes as input the artist line-art (a) and estimates a high-resolution and quality normal map (b). Flat colours (c) can now be augmented with shading effects (d).

Moreover, our method can be applied to single drawings as well as animations, without the need of any additional tools, as our method can directly and automatically generate normal maps for each individual frame of an animation. Our normal prediction remains consistent across all frames without the need to explicitly address temporal consistency or add any spatiotemporal filters. Figure 9 shows how shading is consistent and convincing along the animated sequence. As our normal vectors are estimated using training data, errors can occur in areas of sketches that are not similar to any object found in the database, such as the folded manual in the third row of Fig. 12. Such artifacts can be minimised by increasing the variety of objects captured by the training data. Furthermore, when boundary conditions are only suggested rather than drawn in the input drawing, it can result in unwanted smooth surface linking elements. An example of this effect can be seen between the cat’s head and neck (Fig. 12, third row). The Ink-and-Ray approach presented in [43] handles such \(C^0\) and \(C^1\) boundary conditions (sparse depth inequalities, grafting, etc.) at the cost of additional user input. Finally, while most strokes in the sketch enhance the normal vector estimation, such as those around cloth folds, others only represent texture, and texture copying artifacts may appear when these are considered for normal estimation. However, an artist using our tool could easily avoid such unnecessary texture copying artifacts by drawing texture on a separate layer. In Fig. 10, we compare our method to other state-of-the-art geometry based approaches: Lumo [17], TexToons [40] and the Ink-and-Ray pipeline [43]. We created shading effects using our own image-based render engine and tried to match our result as closely as possible to the lighting from [43] for an accurate qualitative comparison. While both Lumo [17] and our technique create 2D normal maps to generate shades, Lumo requires significant user interaction to generate their results. As neither our technique nor Lumo create a full 3D model, neither can add effects such as self-shadowing or inter-reflections. TexToons and Ink-and-Ray are capable of producing such complex lighting effects, however, again at the cost of significant user interaction. Relative depth ordering added in TexToons via user input, allows for the simulation of ambient occlusion, while Ink-and-Ray requires a lot of user input to reconstruct a sufficient 3D model to allow for the addition of global illumination effects. However, in Fig. 10 we can observe that even without a 3D model, our technique can create high quality shading results without any user interaction, while also being very fast. Furthermore, also shown in Fig. 10, the normal map estimated by Ink-and-Ray is missing many of the finer details of the sketch. Our normal estimation in comparison is more accurate in areas such as folds, facial features, fingers, and hair.

The tiling method used to train the CNN has the effect that we learn normals of primitives rather than of full character shapes. Therefore we can also estimate normals for more generalised input data. Examples of this are shown in Fig. 12 (third row, wrench) and in Fig. 11: even though these objects are not represented in the training database, normals are correctly estimated, allowing us to render high-quality shading effects.

5 Conclusion and Future Work

In this paper we presented a CNN-based method for predicting high-quality and high-resolution normal maps from single line drawing images of characters. We demonstrated the effectiveness of our method for creating plausible shading effects for hand-drawn characters and animations with different types of rendering techniques. As opposed to recent state-of-the-art works, our method does not require any user annotation or interaction, which drastically reduces the labor that is drawing shades by hand. Our tool could be easily incorporated into the animation pipelines used nowadays, to increase efficiency of high quality production. We also showed that using a network in a fully convolutional way does not necessarily produce the most accurate results even when using patch-wise training. We believe and have proven that CNNs further push the boundaries of 3D reconstruction, and remove the need for laborious human interaction in the reconstruction process. While this work only focuses on reconstructing high fidelity normal maps, the CNN could be further extended to also reconstruct depth as in [26] and therefore full 3D models as in [43]. While we created a substantial training database we strongly believe that the predictions could be further improved and applicability extended by simply extending the training database. Other types of drawings could be added to the database such as everyday life objects. Dedicated CNNs could be pre-trained and made available for different types of objects (such as characters in our example here) if better customization is required.