Facial-sketch Synthesis: A New Challenge

This paper aims to conduct a comprehensive study on facial-sketch synthesis (FSS). However, due to the high cost of obtaining hand-drawn sketch datasets, there is a lack of a complete benchmark for assessing the development of FSS algorithms over the last decade. We first introduce a high-quality dataset for FSS, named FS2K, which consists of 2 104 image-sketch pairs spanning three types of sketch styles, image backgrounds, lighting conditions, skin colors, and facial attributes. FS2K differs from previous FSS datasets in difficulty, diversity, and scalability and should thus facilitate the progress of FSS research. Second, we present the largest-scale FSS investigation by reviewing 89 classic methods, including 25 handcrafted feature-based facial-sketch synthesis approaches, 29 general translation methods, and 35 image-to-sketch approaches. In addition, we elaborate comprehensive experiments on the existing 19 cutting-edge models. Third, we present a simple baseline for FSS, named FSGAN. With only two straightforward components, i.e., facial-aware masking and style-vector expansion, our FSGAN surpasses the performance of all previous state-of-the-art models on the proposed FS2K dataset by a large margin. Finally, we conclude with lessons learned over the past years and point out several unsolved challenges. Our code is available at https://github.com/DengPingFan/FSGAN.


Introduction
Facial-sketch synthesis (FSS) aims to generate grayscale sketches from RGB images of human faces (image-to-sketch, I2S) or the other way around (sketch-to-image, S2I) [1,2].FSS is commonly used by law enforcement or used in surveillance to assist in face recognition and retrieval, based on a sketch drawing from an eyewitness [1].Entertainment is also used in mobile apps, such as TikTok and Facebook.† Contributed equally.* Corresponding authors.
In addition, it is an attractive topic in digital entertainment [3].Research into FSS has achieved significant progress over the past decade.
Different from other face-related datasets, such as those for face recognition [4][5][6], face detection [7], face key-points detection [8], face alignment [9], and face synthesis [10], which can be manually labelled by annotators with limited training, face sketch datasets are much more difficult to obtain because only professional artists can produce high-quality references.Due to the high costs of obtaining professional * Att.= Attributes.In [18] and [19], CUFS is divided into 268 and 338 images for training and testing.For image resolution, we provide the width and height as Wavg ± W std and Havg ± G std , respectively.Wavg and W std denote the mean value and standard deviation, respectively.
sketches, existing image-sketch datasets [1,2,17] are relatively small with limited diversity.This dataset shortage has limited the development, especially for data-hungry deep learning models.
In addition, how to evaluate FSS remains an open question.Structural similarity (SSIM) [20] is one of the most widely used metrics for evaluating image quality, so it is also typically used to assess the performance of S2I models.Nevertheless, the characteristics of facial sketches are very different from RGB-based facial images, which makes it challenging to apply the current evaluation metrics to I2S tasks.Therefore, a new objective and quantitative metric, which is also highly consistent with human assessment, is needed for benchmarking the FSS task.
Moreover, due to the lack of high-quality datasets and proper evaluation metrics, different FSS models (e.g., [1,2]) are usually built and tested on diverse training datasets 1 and with different evaluation methods.Hence, it is not easy to provide fair and comprehensive comparisons.Furthermore, many cutting-edge transformation models (e.g., CycleGAN [21], UNIT [22], Pix2pixHD [23], SPADE [24], DSMAP [25], NICE-GAN [26], and DRIT++ [27]) designed for related image-to-image transfer tasks could potentially be employed in FSS tasks.However, as mentioned above, these models lack performance evaluations for the FSS task because of the shortage of datasets and evaluation metrics.Therefore, thorough comparisons and assessments of FSS-related models on a standard FSS dataset with unified evaluation metrics are long overdue.To this end, we have introduced and maintained an online paper list (https://github.com/DengPingFan/FaceSketch-Awesome-List) to track the progress of this fast-developing field. 1 Because they want to learn a different style of sketches.

Contributions
Our goal is to solve the discussed issues (i.e., limited datasets, metrics, and benchmarks) and further contribute to a new challenge for the FSS community.The main contributions are as follows: 1) FSS Dataset.We build a new high-quality FSS dataset, termed FS2K.It is the largest (see Table 1) publicly released FSS dataset,2 consisting of 2,104 image-sketch pairs with a wide range of image backgrounds, skin patches, sketch styles, and lighting conditions.
In addition, we also provide extra attributes, e.g., gender, smile, hair style, etc., to enable deep learning models to learn more details.2) FSS Review and Benchmark.We conduct the largest-scale FSS study, reviewing 89 representative approaches, including 25 methods using handcrafted features, 29 models for the general transfer task, and 35 I2S transfer algorithms.Based on our FS2K, we adopt the SCOOT metric [29] and conduct a rigorous evaluation of 19 state-of-the-art models from the perspective of content and style.3) FSS Baseline.We design an efficient GAN-based baseline, termed FSGAN, which consists of two simple core components, i.e., facial-aware masking and style-vector expansion.The former is utilized to restore details of the facial components, while the latter is adopted to learn different face styles.FSGAN serves as a unified baseline model for both I2S and S2I tasks (Fig. 1) on our newly built FS2K dataset.Our project is available at https://github.com/DengPingFan/FSGAN.Fig. 1 Left: Our FSGAN (I2S) learns from artist drawings and intelligently turns an input photo into a vivid face sketch.In contrast, the five cutting-edge style transfer approaches cannot obtain visually appealing results.Only UPDG [16] and Pix2pixHD [23] perform relatively well, but they generate worse content and style than FSGAN.Right: Given a sketch, our FSGAN (S2I) can also transform the input into a vivid facial photo.Meanwhile, the results from the five representative deep learning models are either structurally damaged (i.e., CycleGAN [21], NICE-GAN [26], and UGATIT [30]) or blurry (i.e., Pix2pix [31]).More results can be found in Fig. 8-11.

FSGAN (I2S)
4) Discussions and Future Directions.In addition to an overall performance assessment, we also conduct an attribute-level evaluation, present detailed discussions, and explore some promising future directions.

Related Works
This section first conducts a complete literature review of the existing FSS datasets.Then, in the second part, we discuss the taxonomy of facial-sketch synthesis and highlight particularly innovative and successful approaches for this task, including traditional facial synthesis, image-to-image translation, neural style transfer, and deep photo-sketch synthesis.The taxonomy of facial-sketch synthesis is shown in Fig. 3.A summary of the models, including their key innovations, datasets, code links, and citation information, can be found in Table 2 and Table 3.
CUFS [1] is one of the earliest and most commonly used datasets.It contains 606 photo-sketch pairs, which include 123 samples from the AR face database [33], 188 samples from the CUHK student database, and 295 samples from the XM2VTS database [34].A sketch drawn by an artist and a corresponding photo are provided for each sample.Each photo is taken in a frontal pose under normal lighting conditions and maintains a neutral expression.All three sub-databases use solid backgrounds, e.g., cyan, white, blue, etc.However, real-world scenes are complex and diverse, and it is difficult to guarantee that photos will be captured in such a fixed environment.Besides, the sketches in this dataset were created by the same artist, so they are of limited style.
CUFSF [12] is a commonly used database for assessing the performance of FSS models.It contains 1, 194 photo-sketch pairs, collected from the FERET database [35].An artist drew all sketches after viewing the corresponding photo.CUFSF has a similar photo collection environment to CUFS but is more challenging.Because the photos in the dataset undergo illumination changes, each face has low contrast with the background, and each sketch contains exaggerated shapes.
VIPSL [14] contains 200 face photos collected from the FRAV2D [36], FERET [35], and Indian face databases [14].Unlike CUFS and CUFSF, VIPSL has five sketches for each face, drawn by five artists with different styles, while viewing the same photo under the same conditions as CUFS.
IIIT-Delhi [11, 37] consists of three types of sketch databases, including a viewed sketch database, a semi-forensic sketch database, and a forensic sketch database.All photos are derived from the CUHK student database and IIIT-Delhi Sketch database [11].The first viewed sketch database contains 238 sketch-digital image pairs, with all sketches drawn by the professional artist based on a given photo.The second sub-database has 140 sketch-face image pairs, where all the sketches are drawn by memory after the artist has observed the corresponding photo.The third forensic sketch database consists of 190 sketches that a sketch artist draws according to the description of an eyewitness based on their recollection of a crime scene.IIIT-Delhi contains multiple styles of sketch portraits, making it more challenging.However, obtaining forensic sketches is tricky since they are usually derived from law enforcement.
Portrait Sketching Dataset.Yi et al. [16,17] provided two datasets that simulate artistic portrait drawing (APDrawing).The first dataset [17] contains 140 pairs of face photos and corresponding sketch portraits drawn by a single portrait artist.This was later extended to a larger dataset in [16], with 952 face photos and 625 portrait sketches.Of the collected photos, 220 are from three famous painters, and the remaining 212 photos are from a photography website. 3It is worth noting that the photos and portraits in this dataset are not paired.Disney Research published a portrait dataset [15] composed of 24 faces from the face database [38] and 672 sketches from seven artists under four levels of abstraction.Besides, they also provided each stroke as a transparent bitmap to be used later to create new sketches.
Unlike existing datasets, we provide a more challenging, high-quality, and attribute-annotated dataset, which is currently the largest FSS dataset.The new dataset contains 2, 104 pairs of photos and sketches, 1, 058 used for model training, and the remaining for evaluation.The strengths of our FS2K include multiple drawing styles, highly accurate alignment between sketches and photos, multiple attribute information, complex backgrounds, etc. Detailed comparisons of the datasets are shown in Table 1.

General Image Synthesis
Deep Photo-Sketch Synthesis

Traditional Facial Synthesis
Researchers have used heuristic image transformations to interactively or automatically synthesize facial sketches [3,[39][40][41][42][43] in the early years.However, these methods tend to generate artificial and inexpressive sketches that lack artistic style.Therefore, in recent years, more attention has been focused on learning-based facial synthesis schemes, whose taxonomy is shown in Fig. 3.These can be categorized into Bayesian inference models, representation learning models, subspace learning models, etc.

Bayesian Inference Models
Bayesian inference exploits evidence to update the states of the sketch components over probability models, which has been widely used in FSS [44].In [45], Chen et al. first introduced an example-based facial-sketch synthesis system that uses a non-parametric sampling algorithm to learn subtle sketch styles.Later, the embedded hidden Markov model [46] was used to model the non-linear relationships in photo-sketch pairs, followed by a selective ensemble strategy to generate facial sketches [47].Wang and Tang [1] followed a similar idea but considered face structures across different scales, using a multi-scale Markov Random Field (MRF) to build the relationships between photo-sketch pairs.Xu et al. [48] proposed a hierarchical compositional model that considers the regularity and structural variation of faces.These methods have made significant progress in generating sketches, but they only consider simple controlled conditions, ignoring variations in lighting and pose.Zhang et al. [49] addressed this issue by simultaneously considering patch matching, intensity compatibility, gradient compatibility, and shape priors, resulting in better visual effects.However, MRF-based models have two main drawbacks: (1) they struggle to synthesize unseen facial information and (2) their optimization is NP-hard.Zhou et al. [50] used Markov weight fields and cascaded decomposition to build a robust facial synthesis system, using a linear combination of candidate patches to approximate new sketch patches.Wang et al. [51] built a non-parametric model to transform a photograph into a portrait painting, where an MRF is used to enhance the spatial coherence of the style parameters, and an active shape model and a graph-cut model are used to learn the local information of facial features.Wang et al. [52] presented a transductive learning method to synthesize facial sketches, which employs an on-the-fly optimization process to minimize the loss of the given test samples.Peng et al. [53] designed a superpixel method built on the Markov model to improve the flexibility without dividing the photo into regular rectangular patches.Then, they not only used the Markov network to model the relationships between image patches but also retained many visual aspects of the cues (such as edges) through multiple visual features [54].

Subspace Learning Models
Subspace learning has been widely studied in the FSS task [44], which learns a low dimensional manifold space embedded in a high dimensional space [55].Tang and Wang [56][57][58] proposed a series of example-based approaches based on the linear eigen-transformation method.These methods are global linear systems, and they cannot fully explain the relationships between photo-sketch pairs because such a transformation is not a simple linear relationship.Liu et al. [59] used the LLE to handle this problem, making photo and sketch patches have manifolds with similar local geometric shapes in two different image spaces.However, pseudo-image generation and representation learning are divided into two independent processes, leading to sub-optimal results.Huang and Wang [60] proposed a joint learning framework, which contains domain-specific dictionary learning and subspace learning.

Representation Learning Models
Sparse coding and dictionary learning, a.k.a.representation learning, are used for the FSS task [44].Ji et al. [61] demonstrated that personalized features are not effectively captured through the synthesis process.As such, several works [61][62][63] use different regression models, such as k-NN [61], Lasso [61], multivariate output regression [62], and support vector regression [63], to build the transformation between photos and sketches.To improve the quality of the generated facial sketches, Wang et al. [13,14] used local linear embedding (LLE) [64] to estimate an initial sketch or photo and then introduced a sparse multi-dictionary representation model that can focus on high-frequency and detailed information.However, most representation-based models assume that the same representations are shared by the source input and the target output, limiting a particular style's local structures in the synthesis process.To relax this constraint, Wang et al. [65] introduced a semi-coupled dictionary learning method, in which a linear transformation is used to bridge the gap between two different domain-specific representations.Gao et al. [14] also took a two-step algorithm [63] into consideration, presenting a selection scheme to generate the initial pseudo-images and introducing a sparse-representation-based enhancement (SRE) to synthesize sketches.

Combination Models
Recently, some works have explored combination models, which combine different machine learning models, e.g., combing Bayesian inference and subspace learning methods.Berger et al. [15] proposed a model to simulate the styles of the different artists and the process of abstraction, which can be used for facial-sketch synthesis.Song et al. [66] introduced a real-time FSS method, which first uses a k-NN algorithm to find the top-k similar local patches.Then a linear combination is used to compute the corresponding sketch image and image denoising technology is adopt to enhance the visual quality.However, the model [66] is still time-consuming due to the k-NN process, so Wang et al. [67] addressed this problem by replacing offline random sampling with an online scheme that is further combined with a recognition weight representation.Most existing traditional methods are entirely dependent on the scale of the training data, so Zhang et al. [68] presented a robust model trained on a template stylistic sketch.The model includes representation learning, MRF, and a cascaded model.Li et al. [69] proposed a free-hand sketch synthesis method, combining a perceptual grouping model with a deformable stroke model.The work in [70] introduces an adaptive learning method that combines representation learning and a Markov network.Men et al. [71] proposed a common framework for interactive texture transfer with structure guidance.Their model implements the synthesis process dynamically using multiple channels, including structure extraction, structure propagation, and guided texture transfer.

General Image Synthesis
Deep facial-sketch synthesis belongs to the task of image generalization.Therefore, general image synthesis methods, such as image-to-image translation and neural style transfer, can also be used to generate facial sketches.We will overview various cutting-edge transformation models.

Image-to-Image Translation
Image-to-image translation (I2I) [72] is a hot topic in computer vision and machine learning.The goal is to transform the input image from a source domain to a different target domain while retaining the intrinsic source content and transferring the extrinsic target style.Current I2I models are typically built on a generative adversarial network (GAN) [73].They can be generally categorized into supervised and unsupervised I2Is.
Supervised I2I.Supervised I2I uses aligned image pairs as the source and target domains to learn a transformation model that can convert the source image into the target image.One representative I2I method is Pix2pix [31], which applies a conditional GAN (cGAN) [74] to the task.The main difference from the original cGAN is that the generator in Pix2pix is a U-Net [75].However, Wang et al. [23] observed that the adversarial training in Pix2pix is unstable, preventing the model from generating high-resolution images.Therefore, they extended the original Pix2pix with a new feature matching loss, which can generate high-resolution images of size 2048 × 1024.Zhu et al. [76] proposed the BicycleGAN, which includes a conditional VAE and a conditional latent regression GAN, to resolve the collapse problem and achieve improved performance.Furthermore, to reduce the loss of semantic information in the Pix2pixHD model [23], Park et al. [24] introduced a SPADE-based generator, which adds spatially-adaptive normalization into the generator of Pix2pixHD so as to enhance the semantic information throughout the network.
Unsupervised I2I.Collecting paired data is not practical because it is labor-intensive.Therefore, several unsupervised I2I models have been proposed to train two different generative networks under the constraint of a cycle-consistency loss.If we convert a zebra image to a horse image and then back to a zebra image, we should get the same input image back.Examples include CycleGAN [21], DiscoGAN [77], and DualGAN [78].Later, Liu et al. [22] proposed an unsupervised I2I model (UNIT), in which the same latent code in a shared latent feature space can represent image pairs in different domains.Kim et al. [30] later proposed a novel attention module with a new normalization function, which they integrated into a GAN model to supervise texture and shape variations flexibly.By rethinking the standard GAN model, Chen et al. [26] proposed a NICE-GAN with the key idea of coupling discriminators and encoders, i.e., reusing the discriminator parameters for encoding the input.Zhao et al. [79] proposed ACL-GAN, which utilizes a new adversarial consistency loss instead of a cyclic loss to emphasize the commonality between the source and target domains.To improve the content representation ability, Chang et al. [25] proposed DSMAP to leverage the relationship between content and style.Specifically, the model maps content features from a shared domain-invariance feature space into two separate domain-specific features.Furthermore, DRIT++ [27] uses two image generators, two content encoders, a content discriminator, two attribute encoders, and two domain discriminators to embed an image into a domain-invariant content space and a domain-specific attribute space.Besides, Jiang et al. [80] proposed two-stream I2I translation (TSIT) to learn both semantic structural features and stylistic features and then fuse the feature maps of the content and style in a coarse-to-fine manner.More recently, Zhang et al. [81] proposed a CoCosNet for exemplar-based image translation, which contains two sub-networks.The first embeds the inputs from different domains into a feature domain that depends on the semantic correspondence.Meanwhile, the second uses a series of denormalization blocks to progressively synthesize the target images.Zhou et al. further extended CoCosNet with full-resolution semantic correspondence learning [82], with the main difference being the use of a regular and GRU-based propagation applied iteratively at each semantic level.More recently, Chen et al. [83] proposed a SofGAN, which decouples the portrait feature into a geometric feature and a texture feature.These two features are then fed into two network branches.The first branch is a hyper network to decode the geometric feature into the weight of the SOF net that represents the semantic occupancy field (SOF) among 3D space.Then, a segmentation map is rendered via a ray-casting-marching scheme using the output features of the SOF net.The second branch is a texture transformation of each semantic region using a GAN generator with a style code sampled from the texture space.Finally, a novel Semantic Instance Wise (SIW) StyleGAN module is used to stylize the generated segmaps and output a photorealistic portrait regionally.

Neural Style Transfer
Neural style transfer (NST), which aims at generating visually appealing images via neural networks, has been introduced into the FSS task [84].Specifically, NST is used to render a content image in different styles.NST methods can be categorized into optimization-based methods and model-based methods. 4ptimization-based methods.The online NST algorithm iteratively updates a given input image to match the desired CNN features, including the photo's content and artistic style information.Gatys et al. [87,88] made the first contribution to this field, using a classical CNN (i.e., VGG [89]) to render an image with famous painting styles.Besides, StyleGAN [90] uses a latent space to maintain consistent results for image synthesis.However, it is challenging to achieve promising results under the given conditions.Recently, Abdal et al. [91] integrated the classical NST [87,88] into the StyleGAN model, using NST to project the input image into the latent space defined in StyleGAN.Then, Kotovenko et al. [92] further enhanced the classical NST [87,88] by optimizing parameterized brushstrokes, which is built on a simple differentiable rendering mechanism.
Model-based methods.Optimization-based online methods achieve satisfactory results, but there are still some limitations.One major drawback is the slow computational speed and high cost of online iterative optimization.To address this issue, several works introduce a feed-forward network to mimic the optimization objective of style transfer [84].
End-to-end models can be divided into those that design a basic deep neural architecture and those that introduce a new loss function.For basic architectures, Johnson et al. [93] took advantage of the benefits of the neural network and optimization-based NST model and proposed a method for training a feed-forward network using a new perceptual loss.TextureNet [94] follows a similar idea but with different neural network architecture.Both [93] and [94] are real-time style transfer methods.Chen and Schmidt [95] introduced a style swap operation to exchange the patches with visual context and those with style, further formulating a new optimization objective that aims to learn an inverse neural network for arbitrary style transfer.In terms of methods based on the loss function, CartoonGAN [85] was presented to transfer real-world photos into cartoon-style images.It consists of two novel loss functions designed to preserve clear edge information and cope with the stylistic difference between photos and cartoons.

Deep Photo-Sketch Synthesis
Deep photo-sketch synthesis is a recent branch of the FSS task, in which deep learning is used to improve performance and quality.The related works can be divided into three categories.The first aims to translate any sketch images into their corresponding RGB images.The second tries to convert any RGB images into sketch images.The last mainly focuses on facial-sketch synthesis.General S2I.Xian et al. [130] proposed the TextureGAN model to synthesize an image under the supervision of a sketch, color, and texture.TextureGAN consists of a ground-truth  [131] proposed the first deep stroke-level photo-to-sketch synthesis method, which is a hybrid model with a shortcut cycle consistency constrained by a VAE-style reconstruction loss.As the default settings of I2I and NST, both can synthesize artistic portrait drawing (APD) images.However, they do not meet practical requirements because APD images usually have a highly abstract style and graphic elements.Therefore, Yi et al. [2] proposed APDrawing to transform an input face image into its corresponding APD image, in which a hierarchical GAN model is built by combining both a global and a local network.Then, they further proposed an APDrawing++ [17], in which they used an auto-encoder to refine subtle facial features and presented a novel line continuity loss to enhance the line continuity of APDrawing.However, both of these APDrawing methods require pair-wise data for training.To handle this problem, Yi et al. thus proposed an asymmetric cycle-structure GAN [16], which contains a relaxed forward cycle consistency loss (a.k.a.truncation loss) to prevent the reconstructed photo from being noisy, and a strict cycle consistency loss to enhance the performance.This method also uses multiple local discriminators to ensure the quality of the facial portrait drawings.Different from portrait drawing, Wang et al. [146] observed the behavior and properties of cartoon paintings and proposed three different representations considering surface, texture, and shape information, respectively.In addition, they also released the new SketchyCOCO dataset to better train and evaluate the performance of their model.Based on Pix2pix, Li et al. [144] designed a two-branch network (called im2Pencil) to implement photo-pencil translation, which can simulate sketch outlines and shadows.Wang et al. [155] presented a GAN sketching method to rewrite a GAN with one or more sketches.This new method uses regularizations to preserve the original GAN's diversity and image quality while matching the generated sketch images with users' needs through a cross-domain adversarial loss.Bhunia et al. [156] introduced a new transformer architecture to generate various yet realistic creative sketches consisting of two networks.The first part of locator networks aims to capture the coarse structure by observing the relationship between local patterns.The second part of the sketcher network, follows the standard GAN, which aims to synthesize high-quality sketches.
Photo-Sketch Synthesis.Zhang et al. [126] were the first to use a fully convolutional neural network (FCNN) to build a deep photo-to-sketch synthesis model.Then, the works [108,129,134] integrated deep features into probabilistic graph model learning, achieving better performance than traditional models [1,50].To make the network more flexible, Zhang et al. [133] took the key idea of CycleGAN and proposed a novel pGAN, which uses a special parametric Sigmoid activation function to reduce the effects of photo priors and illumination variations.To improve the quality of generated photo/sketch, Wang et al. [135] introduced a synthesis method using multi-adversarial networks (PS 2 MAN).Their model uses two U-Nets to generate high-quality images from low to high resolution.To achieve the same goal, Zhang et al. [18] further proposed a facial-sketch synthesis by multi-domain adversarial learning (MDAL), which overcomes the defects of blur and deformation.The basic idea behind MDAL is the concept of "interpretation through synthesis", which is built upon two diverse generators.Kazemi et al. [137,138] proposed an improved version of CycleGAN, which focuses on the facial attributes during the portrait synthesis process.Zhang et al. [140,142] introduced two methods by combining an auto-encoder and traditional subspace learning, which is more effective than the traditional FSS methods.Besides, Zhu et al. [141] proposed a collaborative framework that exploits the interaction information of two opposite generators by introducing a collaborative loss.However, it is difficult to train a good model due to the lack of large-scale training data.Therefore, Zhu et al. [143] proposed using classical knowledge distillation to learn two well-defined student mapping networks via two strong teacher networks.More recently, the works in [151, 152] introduced identity-aware models, which use a new perceptual loss to train a better image generative model, and thus consider the downstream task, e.g., face recognition, as the final goal.Yu et al. [150] proposed a new composition-assisted generative adversarial network, which helps synthesize realistic facial sketches/photos by using facial composition information.By leveraging the relationships between features, [154] implemented a multi-scale self-attention residual learning framework for face photo-sketch conversions.Finally, the method proposed in [153] does not need any images from the source domain for training, enabling it to leverage both deep features (extracted from the CNN) and handcrafted features flexibly.

Proposed FS2K Dataset
In this section, we introduce the proposed FS2K.Some example images are shown in Fig. 2. We describe FS2K in terms of two key aspects, namely dataset collection, and data annotation.Overall, FS2K includes 2,104 photo-sketch pairs, which are split into 1, 058 for training and 1, 046 for testing.The complete dataset is available at https: //github.com/DengPingFan/FS2K.

Data Collection
To establish a long-lasting benchmark, the data should be carefully selected to cover diverse scenes from different views, such as lighting conditions, skin colors, sketch styles, and image backgrounds.To this end, we introduce FS2K, a new high-quality dataset 5 for the FSS task.
Our FS2K includes 2,104 photos from real scenes, the Internet, and other datasets.The majority, however, come from CASIA-WebFace [158], which is a large-scale (i.e., 500K images) labelled dataset of faces in the wild.CASIA-WebFace was collected from the IMDb6 website and contained well-organized information, such as name, gender, and birthday.Thanks to the rich and clean open-source data from CASIA-WebFace, it could be used to build our high-quality and representative benchmark.We manually selected 1, 529 images to cover a large span of major challenges faced in realistic scenes, such as varying background, hairstyle (e.g., long, short), accessories (e.g., glasses, earrings), and skin information (e.g., patch image on a given face).Because the photos selected in CASIA-WebFace are taken from a single angle, multi-angle face images for the same person are missing.To this end, we invited eight actors to take 98 photos under different settings (e.g., lighting conditions, face angles).In addition, to further increase the diversity, we also collected some children's photos and some faces with smaller face-to-image ratios.The remaining 477 face photos come from other free stock photos websites, including Unsplash,7 Pexels,8 Pngimg, 9and Google.

Data Annotation
There are four types of annotations in our FS2K, including sketch drawing, sketch style, color, and contour feature annotations.

Sketch Drawing
Participants.Three senior artists (including two male and one female) from the Sichuan Fine Arts Institute were hired to participate in the study. 10ll three participants had normal or corrected to normal vision.None of the participants suffered color-blindness or color-weakness.The participants ranged in age from 20 to 23 years, with an average of five years of professional experience in sketch drawing.
Apparatus.The three artists drew all sketch images with the assistance of a Copy Table LED  Board. 11Fig. 5 shows the copy table we used and an example (Fig. 5-d) of a face sketch drawn by our artists.The touch switch region in our device supports three levels of adjustable brightness, so the artists can use the button to change the brightness they desire.This helped them locate the contours of facial features according to the photo information from the bottom of the LED board.Moreover, this equipment also helped to ensure content similarity and face alignment between sketches and corresponding photos.At the same time, the drawings retain the artist's sketch style.

Sketch Style Annotation
Our FS2K contains three different styles, which enrich the diversity of sketches, as shown in Fig. 6.This enables different artists' skills to be 11 Fig. 5-a presents the copy table, which has an LCD backlight.It requires a high voltage input of 100 ∼ 240V and 0.6A working current.Its size is A4 (i.e., 300 × 200 × 3.5mm) in Fig. 5-b, and the luminous intensity is 300 ∼ 350LM.Therefore, it has become the most popular copy table product, after the aluminum alloy copy table, for animators (see Fig. 5-c).
captured while making FS2K more challenging than previous FSS datasets.
We created a balanced dataset to facilitate the comparison of different methods, i.e., the number of the images with the three different styles are equally distributed.Specifically, in the training set, the samples with style1, style2, and style3 are 357, 351, and 350, respectively.In the test set, they are 619, 381, and 46, respectively.

Facial Feature Annotation
Sketches are rapidly executed freehand drawings, which have less attribute information than the original images, e.g., facial texture, facial expressions [159], and facial posture.Therefore, it is challenging to restore real images (i.e., S2I task) based on a single sketch image.Meanwhile, in real-world applications, we can use auxiliary facial information (such as gender, accessories, and hairstyle) to narrow down a suspect in a database.Following [160], we added some additional facial feature annotations, including gender, smile, face pose, hair condition, hair color, earring, and skin texture.We hired two data annotators to label all photos and performed cross-checking to ensure the accuracy of the final annotations.Overall labels can be found in Table 4, while the details of each are described below.
Gender.Gender is a high-level human attribute commonly used in traditional face databases such as CelebA [28] and LFW [161].It has been extensively studied in face detection and recognition [162][163][164].Therefore, we carefully labelled all photos in FS2K with gender attributes.Specifically, there were 574 male photos and 484 female photos in the training set, and 632 male photos and 414 female photos in the test set.
Smile.Smiling is a primary human activity that represents a positive emotional state.As such, many studies have focused on smile detection [165,166] or used smile as an attribute for recognition [167].Therefore, we also consider a smile a key attribute in our dataset.Specifically, the training set contains 645 smiling people and 413 with no obvious expression, while the test set contains 670 smiling people and 376 with no expression.We ensured that the proportion of smiling people in the training and test sets was as close as possible.
Face Pose.The facial attributes may cover only a small part of the image, but the photo is usually dominated by the effects of pose [168].Moreover, pose will affect the performance of face recognition [169], tracking [170], and synthesis [171].Therefore, the facial pose is useful auxiliary information.We define a portrait with the head rotated within 30 degrees as a frontal face pose.According to this definition, the training set has 917 frontal photos, while the test set has 872.The remaining have side face poses.
Hair Status and Color.Hair is a saliency feature of the head that may change in different situations.Even if there is sufficient information in the internal features of the face for recognition, manipulating the hair can harm the performance [172,173].Moreover, facial synthesis and retrieval systems often use hair as an important cue [174,175] to improve the quality of generated images.For FSS, although the sketches contain the hair contour, the corresponding color information and hair status (with or without hair) are missing.Therefore, in FS2K, we provide annotations of the hair status, which includes four available colors (i.e., black, brown, red, and blond) and another status (i.e., bald or wearing a hat), as shown in Fig. 4. In other words, for faces with hair, we mark the color information directly, while cases of thinning hair or wearing a hat are marked as separate attributes.The statistical results of this annotation can be found in Table 4.
Earrings.The simplified characteristics of sketch drawings lead to unclear earring contours.Meanwhile, as shown in Fig. 4, earrings in real photos are visible.Therefore, in FS2K, we provide annotations for whether earrings are present, which can help the model training.Specifically, the training set has 209 people with earrings, and the test set has 187.
Skin Texture.Skin texture provides a large amount of detailed local information and is used as a vital feature for face recognition [176,177].However, this critical information is completely lost in sketch images.Therefore, we clip a small patch from the real photo and use it as the skin texture, as shown in Fig. 6.We also include the average RGB value for the corresponding lip and eyeball region to provide more information for future research.

Problem Definition
Facial synthesis (FS) aims to generate target representations of human faces based on the given inputs.This process can be formulated as X o = F (X i ), where X i and X o denote the input and output (e.g., RGB images and sketches) of facial representations F indicates the synthesis function.In this paper, based on the overall architecture of [2,17], we design the baseline, FSGAN, for both the I2S task12 and S2I task,13 inspired by pix2pixHD [23].Instead of focusing on direct image-level facial synthesis, we propose a two-stage "bottom-up" facial synthesis architecture, as shown in Fig. 7. Hence, our FSGAN consists of two cascaded stages built upon multiple generative models (i.e., GANs).
The first stage comprises of five parallel GANs, which are designed to synthesize the local facial components separately.Given an input, four facial regions (e.g., left eye, right eye, nose, and mouth) and the rest of the inputs are cropped and fed into their corresponding GANs in the first stage to synthesize key facial features.These synthesized facial component patches are then stitched together to obtain the intact facial representation.Since the local facial patches are synthesized independently, the connecting region  of the stitching, as well as their appearances, are inconsistent with each other.Therefore, the second stage is introduced to further refine the results by considering the global structure and texture.In this stage, the style vectors of the facial sketches are utilized to assist the synthesis.

Facial Components Synthesis
Almost all human faces have the same global structure.The differences lie in the details of the local facial components, such as eyes, eyebrows, nose, and mouth.To capture more details of different facial components, the first stage of our model synthesizes them separately.Specifically, given a facial input, the four key patterns, including the left eye, right eye, nose, and mouth, are first detected by MTCNN [178].The input X i is then divided into five parts, X parts = {X leye , X reye , X nose , X mouth , X rest }, based on the detection results.These include the left eye, right eye, nose, mouth, and remaining components.Five parallel GANs are utilized to synthesize their corresponding patches for these parts.Therefore, the problem can be formulated as First, the four GANs synthesizing the left eye, right eye, nose, and mouth have the same architecture.Each GAN consists of a generator and a discriminator.The generator is designed as an encoder-decoder, consisting of an encoder, a bottom connection, and a decoder.The encoder is composed of three convolutional blocks, each of which is a combination of a convolutional layer (with a kernel size of 3 and stride of 2), a batch normalization layer, and a ReLU activation layer.Meanwhile, the second bottom connection consists of nine bottleneck residual blocks that are similar to [179].Finally, the decoder is built upon three deconvolutional blocks: a deconvolutional layer, a batch normalization layer, and a ReLU activation layer.Note that the GAN, which is used for synthesizing X rest , is similar to the previously described ones.However, the encoder contains four convolutional blocks, and the decoder comprises four deconvolutional blocks to achieve larger receptive fields.
The discriminators of the above five GANs are the same.Each consists of three cascaded convolutional layers (with a kernel size of 3 and stride of 2) followed by global average pooling.
Then, a 1 × 1 convolutional layer and a sigmoid function are used to predict the probability of the generated results being real or fake.
Based on the above design, the first stage of FSGAN can restore details of the facial components in both the I2S and S2I tasks.At the end of this stage, the synthesized patches are stitched together to restore the intact facial synthesis result X intact .Since different generators synthesize the patches, their overall appearances are inconsistent, which becomes even more obvious in the stitched result.To this end, the stitched result is then fed to the next stage to adjust and refine the global structure and appearance.

Facial-Sketch Synthesis
To address the inconsistency issue of the output from the first stage, we introduce the second stage, which is designed as another GAN model inspired by Pix2pixHD [23], for local detail refinement and global structure adjustment.
In this stage, we use the multi-scale discriminators D f s and the coarse-to-fine generator G f s following Pix2pixHD [23].Specifically, the generator G f s consists of two sub-networks G1 and G2, both of which follow encoder-decoder architecture, as shown on the right part of Fig. 7.We sample the output of the first stage using a downsampling operation with a sampling rate of 50%.This newly sampled image X 1/2 intact (height/2, width/2) is then fed into the first sub-network G1, which is designed to capture global features.The other sub-network G2 is employed to capture the local details, which takes the output of the first stage as input.We use both concatenation and element-wise addition operations to fuse the style, local, and global information.Specifically, the concatenation combines the style feature map and the output of G1 and generates a new fused feature map.Then, the element-wise addition is utilized to combine this new feature map with the latent feature of the encoder part of G2.Finally, we use the decoder part of G2 to generate the final output X o .It is worth noting that the style vector can control the style of the generated sketches, which helps improve their quality and diversity.Besides, the style of the real photo is often fixed, and independent from the artists' style.Therefore, we introduce the style information in the I2S task but exclude that in the S2I task.

Loss Function
We use a combination of several loss functions to train our model.We denote X and Y as the input and its corresponding reference, respectively.For simplicity, we define G(X) as the generated output of the given input X and D k (X, Y ) as the corresponding predicted probabilities of the k-th discriminator.Then, we denote the i-th layer feature extractor of discriminator D k as D i k , where k is the index of the discriminator.
Adversarial Loss.We use the adversarial loss [73] to make the generated image more visually appealing.The adversarial loss we use is defined as: Feature Matching Loss.Similar to [23], we use the feature matching loss to improve the adversarial loss based on the k-th discriminator.The feature matching loss is defined as: where T denotes the total number of layers in each discriminator and N i is the number of feature maps in the i-th layer.This loss is used to match the intermediate feature maps of the real and synthesized images, making the generator produce multi-scale statistical information.Besides, it stabilizes the training process and restores highly realistic outputs.
Perceptual Loss.To maintain perceptual and semantic consistency, we use a perceptual loss [93] to measure the difference between the original image and the corresponding synthesized image.We extract the perceptual features from the i-th layer activations of a pre-trained VGGNet [89], which is denoted as φ i (•).The perceptual loss is defined as follows: Pixel-Wise Loss.The L 1 distance between a generated image G(X) and reference Y is regarded as the pixel-wise loss, which is defined as: where (i, j) and (h, w) are the pixel coordinates and the (height, width) of the output, respectively.
Style Classification Loss.Similar to [180,181], we define an auxiliary classifier to predict the sketch style of the generated image.For any generated image G(X), the style classification loss is defined as: where l ce (•, •) is the cross-entropy loss, S(•) is a CNN that outputs the probability over different styles, and c is the label of a given artist's style.Note that we only use the style classification loss in the second stage for the I2S task.
Overall Loss.Finally, the overall loss function for the multi-scale discriminators is: and the overall loss function for generator is: where λ fm , λ 1 , λ per , and λ sty are hyperparameters that control the importance of the feature matching loss, pixel-wise loss, perceptual loss, and style classification loss, respectively.

Implementation Details
We use PyTorch [182] to implement the baseline FSGAN.The experiments are conducted on an NVIDIA V100S.
For the I2S task, we set λ f m = 25.0,λ 1 = 25.0, and λ per = 12.5 to train the model in the facial components synthesis stage, and set λ f m = 100.0,λ 1 = 100.0,λ per = 50.0,and λ sty = 100.0for facial synthesis.The Adam optimizer [183] is used for training the whole network.The initial learning rates for the generator and discriminator are 2e − 4 and 1e − 5, respectively.The other hyperparameters of the optimizer are set to the default values as recommended in PyTorch.We set the number of epochs to 50.All generators and discriminators are trained iteratively.
For the S2I task, we set λ f m = 50.0,λ 1 = 50.0,and λ per = 0.2 to train the neural network for the facial component synthesis stage, and set λ f m = 100.0,λ 1 = 100.0,and λ per = 0.2 for facial synchronization.We again use the Adam optimizer, with initial learning rates of 2e − 4 for both the generators and discriminators.The training strategy is almost the same as that for the I2S task.However, we set the number of epochs to 400,14 freezing the weights of the facial components synthesis module after 250 epochs and further training the facial synthesis module for the remaining epochs.

Benchmark
This section provides comprehensive comparisons and analyses of the existing models on FS2K, in terms of both the I2S and S2I tasks.

Evaluation Metrics
For the I2S task, the most popular facial sketch metric is the structural similarity index metric (SSIM) [20,44].However, it ignores the perceptual similarity between a prediction and the reference.Therefore, we further adopt the recently proposed structure co-occurrence texture (SCOOT) metric [29], which provides a unified evaluation for both structure and texture.For the S2I task, we still adopt the widely used SSIM metric to evaluate the synthesized faces.Our evaluation toolbox is available at https://github.com/DengPingFan/FS2KToolbox.

Comparison of the Models
To evaluate the performance on the I2S task and S2I task, we present the empirical results of 19 representative approaches and the FSGAN baseline.

Training/Testing Protocols
All compared methods are selected on three criteria: a) widely regarded technology, b) open-source code, and c) state-of-the-art performance.The models are trained and tested on our FS2K with the image sizes specified in their papers.If the size setting is not provided in their paper, 512 × 512 is utilized as the default.

I2S Task
We first provide a performance summary of the I2S task regarding both SCOOT and SSIM scores.Quantitative results and qualitative comparisons are shown in Table 5 and Fig. 8-10, respectively.The experimental observations indicate that the FSGAN baseline achieves better results.For further analysis, we divide all compared methods into three categories based on their SCOOT score: Analysis.Methods in the first group achieve a SCOOT below 0.3.These include DualGAN [78], FPST [95], NST [87,88], Pix2pix [31], ACL-GAN [79], and WCT [99].As shown in Fig. 8, DualGAN, NST, and WCT suffer from structural distortion, where many local facial details are lost.The images produced by the DualGAN are poor, and it is challenging to detect facial components in them.This explains why it has lower SSIM and SCOOT scores.In addition, compared with other results, Pix2pix and FPST generate blurred results.ACL-GAN seems to achieve satisfactory results in visual appeal, yielding a higher SSIM score.However, ACL-GAN reproduces the original facial structure almost exactly, lacking artistic style.
The second group includes AdaIN [98], UNIT [22], TSIT [80], DRIT++ [27], CartoonGAN [85], UGATIT [30], NICE-GAN [26], and CycleGAN [21], whose SCOOT scores range from 0.3 to 0.35.As shown in Fig. 9, the synthesized sketch images are better in terms of structure-preservation compared to the first group.However, except for AdaIN, all models are thrown off by the complex backgrounds (see the hair region in the second row).Besides, the results of CartoonGAN seem to alter the color of the input images, leading to lower SSIM scores.
backgrounds.Pix2pixHD generates relatively good sketches with global structure and clean background, but it does not generate the best facial components.For example, in Fig. 10-e, the region around the eyes is unclear, and many details are lost.Take the third row, for instance; the eyeglasses are partially lost, while the eyeball is entirely black.We further observe that DSMAP and MDAL tend to achieve better sketch images but with distortions in local facial information.Finally, the baseline can synthesize high-quality sketches that focus on the global structure and local details while considering diverse styles.Moreover, as shown in the highlighted boxes (with green, blue and red), we find that the outputs of the FSGAN are more similar to the reference compared to other state-of-the-arts methods.

S2I Task
We report our experimental results in Table 6 and Fig. 11.We find that FSGAN achieves the best results on our challenging FS2K compared to the existing state-of-the-art models.
The results presented in Fig. 11 show that FNS and FPST fail to transfer the sketches into colored images.SPADE and Pix2pix generate poor results with facial outlines (e.g., Pix2pix) or black background (e.g., SPADE).Five models (i.e., NST, WCT, DeepPS, DSMAP, and UNIT) produce noise patches in salient regions, which corrupt the global facial structure.Meanwhile, AdaIN, ACL-GAN, DualGAN, and UGATIT perform better than the models mentioned above, resulting in unrealistic cartoon-style images.Only CycleGAN, NICE-GAN, TSIT, pSp, and Pix2pixHD overcome various challenges and achieve good results in terms of facial completeness.In particular, the eye regions from Pix2pixHD [23] and pSp [86] are better than those from the other models.However, compared with the results of the FSGAN, the facial features of Pix2pixHD are relatively inferior because a pixel-wise rather than block-wise strategy learns.
Although pSp [86] can generate high-quality results, its results lack diversity compared with the FSGAN baseline.For example, pSp generates similar facial expressions under two different sketch styles, while the baseline can synthesize diverse contents, as shown in Fig. 12.

SCOOT Metric Results
To provide a deeper understanding of the models, we present an attribute-based performance evaluation in Table 7.
Analysis.Hair is one of the dominant features of the head.In Table 7, we find that most models achieve slightly better or comparable performance on images without hair than with, except for three models, such as AdaIN, CartoonGAN, and CycleGAN.Meanwhile, we find that red and black hair are the most challenging and easiest to detect/reconstruct, respectively.We argue that this is because images with red and black hair make up the lowest and largest (>40%) proportion of all data, respectively.Thus, the models are unfamiliar/familiar with these attributes.
In addition, we also notice that females (F) are more challenging than males (M) for almost all models since women usually have various accessories and hairstyles.For example, the models perform worse on images with earrings (w/ E) than those without earrings.Additionally, facial images with smiles are more challenging than those without smiles.Interestingly, existing models achieve diverse performance irrespective of the color of hair (e.g., H(b), H(bl), H(r), and H(g)).Finally, compared to style 1 (simple lines) and style 3 (i.e., repeated wispy details), we see that style 2 (long strokes) is the most challenging for all models.

SSIM Metric Results
In addition to the SCOOT metric, we also provide the SSIM metric for the I2S task in Table 8.
Analysis.We find that the overall performance tends to be similar to the SCOOT metric results in several key attributes, such as  hair, gender, accessories, and style.We note that the performance on "w/ F" is lower than on "w/o F", as shown in Table 8.One possible reason is that frontal faces preserve more structural features than non-frontal faces.Therefore, in the I2S task, images with attributes such as "w/ F" are more challenging than "w/o F".

Ablation Study
This section provides a detailed analysis of FSGAN on the proposed FS2K dataset.Unlike most existing facial synthesis models [23], our model has a two-stage GAN architecture for both I2S and S2I tasks.Besides, a sketch style vector is introduced to enable diversified style synthesis in the second stage of the I2S task.Therefore, the ablation studies on the I2S task are conducted on the following two key components: (1) the facial components synthesis stage and (2) the style vector assisted generation.Note that we adopt the same hyperparameters described in Sec.4.5 during our ablation experiments.Table 9 shows the ablation results for the I2S task.We find that the facial components synthesis stage increases the SCOOT and SSIM scores by 1.31% (relative) and 2.67%, respectively, while the style vector increases them by 6.30% and 4.72%.As illustrated in Fig. 13, without the multi-patch strategy, the lines in the synthesized lips are often missing structural details.Meanwhile, with the multi-patch stage, the lines become smoother.Moreover, the synthesized drawings are messier   For the S2I task, an ablation study is conducted to validate the effectiveness of the facial component synthesis stage, as shown in Table 10.Similar to the I2S task, the multi-patch component achieves a significant performance gain (i.e., 3.3%) over the baseline model.Fig. 14 provides examples of the results produced by our model and the model without the facial components synthesis stage.Our model with facial component synthesis captures more details and ensures a more realistic overall appearance (see Fig. 14-c).

Discussion
Although FSS has achieved significant progress, there is still a large room for improvement.This section summarizes the possible future research directions related to FSS.
(    more diversified sketch (or drawing) styles are needed to build more attractive models and achieve better synthesis results.To address these issues, we believe novel data augmentation techniques [116,184,185] and transfer learning strategies [186][187][188] designed for FSS are promising directions of study.
(2) Models.Currently, most state-of-the-art models are trained with a large number of paired images, and sketches [16,23] to overcome data shortages.However, more attention could be paid to techniques such as few-shot [189], semi-supervised [190], weakly-supervised [191], self-supervised [192], and non-pairwise unsupervised [83] learning to achieve style transfer with limited datasets.Besides, developing novel, human-in-the-loop [193] models is another promising direction, that would provide more interactive options to users for generating and editing personalized styles.Interactive models that utilize the attributes in our FS2K could also serve as drawing tools provided to professional artists for facilitating the creation of sketches and other styles of drawing.Furthermore, FSS in the wild is still challenging because the image quality, including resolution, noise, and background, varies drastically.In addition to the techniques mentioned above, basic model units could also be focused on to develop new strategies.For example, most current models are built upon CNN [194] units.Therefore, more exploration of other frameworks, such as MLPs [195] and Transformers [196,197], could also be conducted.
(3) Evaluation.Evaluation metrics are essential for the development of new models and the benchmarking of existing models.Currently, several quantitative evaluation metrics [20,198] and human visual ranking methods [66] are used.However, as these aim to provide relatively objective and fair comparisons between all models, the different applications of FSS are not considered.This may lead to biased or unreliable evaluation of specific tasks.Therefore, more task-specific evaluation metrics and methods could be another important direction for future research.
(4) Applications.Currently, the only direct applications of FSS (I2S and S2I) are entertainment, and law enforcement [1,44].With the development of FSS techniques, many other promising applications could also be implicitly or explicitly facilitated by FSS research, such as art design and animation production.In addition to these industrial applications, we believe that FSS methods and ideas could also benefit other research fields.For example, sketches could be used to assist image resizing [199], super-resolution [200], etc. Further, the sketches usually contain the most conspicuous information of an image and can therefore be considered compressed versions of RGB images [201].This characteristic makes sketches useful for the image compression task.Besides, the S2I task can be considered a specific case of image super-resolution in a broad sense because both tasks aim to reconstruct detailed RGB images from the given inputs.The difference is that the input of S2I is high-frequency information, while that of the standard super-resolution task is the low-frequency information of the original image.

Conclusion
We have presented a complete review of the facial-sketch synthesis problem.To the best of our knowledge, this is the first systematic study on deep FSS in sketch-to-image and image-to-sketch tasks.To achieve this, we established a new challenging dataset, named FS2K.We also introduced a copy table for the proposed FS2K to address the alignment issue between the sketches drawn by artists and the original images.The proposed simple baseline, FSGAN, achieves the new state-of-the-art performance with a two-stage architecture.Finally, as the most extensive survey (i.e., 89 literature methods) and benchmark (i.e., 19 cutting-edge models), we have revealed that the development of this field is still in its infancy.Therefore, the main goal of this paper is to spark novel ideas rather than rank all existing models.It is not easy to benchmark all of the existing models due to the prosperity of the field.We hope this investigation will attract the community's attention and yield exciting follow-up directions, such as generating vivid sketches with music, developing cartoons from sketches, synthesizing sketch videos, and fake faces [202].
[11] H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, "On matching sketches with digital face images," in International Conference on Biometrics: Theory, Applications and Systems.

Fig. 4
Fig. 4 Statistics and examples from the FS2K dataset.Please refer to Sec. 3 for details.

Fig. 5
Fig.5Use of the copy table and an example.Zoomed-in for the best view.See Sec.3.2 for more details.

Fig. 6
Fig. 6 Three sketch styles in our FS2K.As shown in the cheek region, the styles include simple lines (style 1), long strokes (style 2) and repeated wispy details (style 3).

Fig. 7
Fig. 7 Pipeline of our FSGAN baseline for the I2S task.It consists of two stages: 1) facial components synthesis and 2) facial-sketch synthesis.Please refer to Sec. 4.2 and Sec.4.3 for more details.
and D parts ={D leye , D reye , D nose , D mouth , D rest }, where G and D indicate the generator and discriminator, respectively.

Fig. 12
Fig. 12 Visual diversity of the data generated for S2I task.
) Datasets.Due to the relative shortage of professional sketch artists, achieving large numbers of images remains an open problem, impeding the development of FSS.Furthermore, (a) Input (b) w/o multi-patch (c) FSGAN

Table 1
Comparison with other FSS datasets.

Table 2
Summary of popular related works.These can be categorized into three types: Traditional Facial Synthesis, General Image Synthesis, and Deep Image-to-Sketch Synthesis.
Publ.: Publication information.Year: Publication year.Code: The link of the corresponding open resources.Components: The key components of each model.Dataset: A = TU-Berlin Sketch Dataset [101], B = Disney Portrait Dataset

Table 3
Summary of popular related works.Please refer to Table2for more detailed descriptions.
[149]mulate the coarse-to-fine painting process of human artists.Chen et al.[149]proposed a local-to-global framework to allow any user to produce high-quality face images.Their model consists of three modules: component embedding, feature mapping, and image synthesis.General I2S.Song et al.

Table 4
Number of images for each attribute in the training and test datasets.

Table 5
Quantitative results of popular models on the I2S task."↑" means the higher, the better.Publ.: Publication information.

Table 6
Quantitative results of popular models on the S2I task."↑" means the higher, the better.

Table 7
Comparison of 19 state-of-the-art models in terms of attribute-based performance on the I2S task.

Table 8
Comparison of 19 top models in terms of attribute-based performance on the I2S task.

Table 9
Ablation study of FSGAN on the I2S task.

Table 10
Ablation study of our model on the S2I task.