1 Introduction

Fonts are important visual designs that often provide additional information, such as whether the current text content is formal or casual. Some artistic fonts can even create a frightening or playful atmosphere. However, it is time-consuming to design a new font because there are many factors to consider, such as strokes, decoration, and effects. In addition, all characters within the same font must be designed in a consistent style and appropriate size. Some font libraries may contain thousands of characters from multiple languages (e.g., Microsoft YaHei contains over 20,000 characters including Chinese, Korean, Japanese, Latin, and Greek). Artists usually spend considerable time maintaining a consistent style between these characters to ensure visual compatibility. This labor-intensive process can cause many problems. Artists often design fonts in only one language, and it takes much time to extend the style to other languages in the future.

With the development of deep neural networks [1,2,3], automatic font synthesis without human intervention has become possible. Deep neural networks take both shape and texture transfer into account in an end-to-end manner. Early approaches [4, 5] were proposed to synthesize an entire font library by observing a subset of it. These methods treat the font style transfer as an image-to-image translation [6] or a cGAN [7] task. While these image-to-image translation approaches have shown remarkable generative ability, there remain several obvious drawbacks. First, they divide the training process into two phases. Usually, models are pre-trained on a large dataset, and for the specific task, the model then needs to be fine-tuned. The fine-tuning process makes these methods less practical with limited computational resources. Second, hundreds of training samples are needed during the fine-tuning phase, and creating these training samples is another labor-intensive task. To reduce the training time and manual work during the fine-tuning phase, several few-shot learning methods have been proposed [8,9,10,11,12,13]. These models can synthesize a high-quality font library by observing only a few samples.

However, all the methods described only transfer styles within the same language and cross-language font style transfer is more challenging. Mostly, artists will only design fonts in one language (e.g., movie posters), and it is time-consuming to extend the font style to other languages. Therefore, a model that can learn font style from another language is necessary. However, characters can vary greatly from language to language. More specifically, some parts of Chinese characters are complex and do not appear in English letters. Furthermore, unlike artistic [14,15,16] or text [17] style transfer tasks that require consideration of only global style, the font style often consists of both local (e.g., stroke, decoration, and thickness) and global features (e.g., effect and shape). Therefore, it is difficult to learn font styles from one language and apply them to another. Especially for font generation works [12, 13] that only consider the global features.

To solve these issues, we propose FTransGAN (Font Translator GAN), which can synthesize a high-quality font library by observing only a small number of samples from other languages without fine-tuning. We use two encoders to extract the style and content representation separately. Then, we concatenate them and input them into the decoder. Two discriminators are used to check the degree of matching from style and content perspectives. Our style encoder contains two modules: the Context-aware Attention Network and the Layer Attention Network, which work together to capture local and global style features. Experimental results from a collected multi-language dataset show high visual quality for both handwritten and printing fonts. We illustrate some application examples in Fig. 1. The main contributions are summarized as follows:

  • We developed a novel model, FTransGAN, and first applied an end-to-end solution to cross-language font style transfer. Two novel modules: the Context-aware Attention Network and the Layer Attention Network were introduced to capture both local and global style features simultaneously.

  • We constructed a new multi-language font dataset consisting of 847 fonts, each with 52 English letters and more than 1000 Chinese characters.

  • We demonstrate that the generative ability of the proposed model can be easily improved by a transfer learning technique.

Fig. 1
figure 1

A few application examples. The English letters on the red background are the style images, and the Chinese characters on the green background are the content images. The rest are images synthesized by our FTransGAN, which extracts the strokes, thickness, decoration, or effect of the font from several observed English letters and automatically applies the extracted style to the given Chinese character. The Chinese meaning in the figure is “Computer Vision”

The remaining sections of this paper are organized as follows. In Section 2, we introduce existing works related to the topic of this paper. In Section 3, the proposed FTransGAN model is demonstrated. Section 4 reports the experimental results to show the effectiveness of the model. We conclude our paper and show some limitations in Section 5.

This paper is an extension of [18]. The differences between the journal and conference versions are as follows: (1) A detailed ablation study including loss and hyper-parameter analysis was added to the journal version. (2) In the journal version, we compare our study with two recent works [19, 20]. (3) The conference version only included the “English2Chinese” experiment. Here, we show that the proposed model is also capable of ”Chinese2English” transfer, and (4) A transfer learning technique was adopted to improve the performance of the proposed FTransGAN.

2 Related work

Font generation is a long-standing challenge that many studies have attempted to address. In this section, we selectively review some papers that are closely related to our work. In Sections 2.1 and 2.2, we introduce two main topics linked to our study: style transfer and image-to-image translation. We also present some related works on self-attention in Section 2.3, as self-attention is a key technique in the proposed model. In Section 2.4, we introduce some deep learning- based font generation methods.

2.1 Style transfer

Style transfer [14, 15] is a long-standing problem in computer vision, where styles are extracted from one or several images and applied to another image. The input usually consists of a content image and a style image. The output is obtained by optimizing content loss and style loss. The content loss is the distance at each position of two feature maps. The style loss can be calculated by comparing the summary statistics of each layer. The limitation of these works is that they only support one-to-one style transfer. StyleBank [21] solved the problem of multiple style transfers by storing multiple style layers, and [22] proposed adaptive instance normalization that can be adapted to arbitrary new styles. In addition, [23] introduced an arbitrary style transfer method by reshuffling deep features of style images. Recently, StyleGAN [24,25,26] proposed a novel generator architecture that automatically divides attributes into different levels and provides intuitive, scale-specific control of the synthesis. However, these methods are mainly designed for artworks, and they usually define style as a set of colors and textures. As already noted, font styles are relatively abstract and consist of local and global features. Therefore, it is difficult to directly apply these methods to the font style transfer problem.

2.2 Image-to-image translation

Pix2Pix [6] and CycleGAN [27] proposed a generalized image-to-image translation framework that aims to learn the mapping between two domains. The former has paired data, while the latter is unpaired. Similarly, [28, 29] proposed the coupled GAN (CoGAN), which learns a joint distribution of two domains by weight sharing. However, these models have a few problems. First, they can only translate images between two domains. Second, they usually require a large number of training data, which is less practical. starGAN [30] and MUNIT [31] solved the first problem by adding domain information to the generator. However, the generative ability of the model is limited to a few domains – it cannot synthesize images of unknown domains. Recently, FUNIT [11] was proposed as a few-shot unsupervised image generation method. It solved the second problem by using an additional encoder to extract domain information from several images. Inspired by these works, our model also adopts the few-shot generation structure and adversarial loss.

2.3 Attention mechanism

The attention mechanism [32, 33] was first proposed in the field of natural language processing [34, 35] to mitigate the damage caused by a fixed-length vector in the encoder-decoder architecture. This mechanism enables the machine to focus on certain words. Later, Xu et al. [36] applied an attention network to the computer vision field to solve the image captioning problem. Yu et al. [37] improved the attention network into a multi-level architecture to obtain both spatial and semantic information from a single image. Recently, [38] made several breakthroughs in the natural language field by exploiting the self-attention module, and SAGAN [39] also successfully applied the self-attention layer to the image generation task. Here, we directly use the self-attention block proposed by SAGAN [39] for our Context-aware Attention Network.

2.4 Font generation

Font generation can be considered a special case of image style transfer if we consider each character as an image. Existing works can be roughly divided into two groups - many-shot learning and few-shot learning.

Many-shot learning methods [4, 5, 40, 41] mainly employ the image-to-image translation architecture. The condition is a character in the source font, and the output should match the corresponding character in the target font. However, these methods usually need many reference images (e.g., 700 glyph images) to learn the style.

Several recent models [8,9,10, 12, 13, 42, 43] have been proposed for few-shot font generation. These models can generate an entire font library with only a few samples (e.g., 6 glyph images). MC-GAN [8] was the first end-to-end approach to synthesize artistic fonts. However, its number of input and output images was fixed (26 English letters), meaning that the approach cannot handle a large font library (e.g., Chinese) because of the limitations of its architecture. AGIS-Net [10] and EMD [12] later solved this problem by combining the content and style features. They can also be considered image-to-image translation tasks. The difference is that these models require two conditions: style and content images. The output image should be a combination of the two conditions. To improve the quality of the generated images, [13] proposed the Deep Feature Similarity (DFS) architecture to leverage the feature similarity between the input content and style images to synthesize target images. Recently, researchers [9, 19, 20, 44,45,46] have made significant progress by exploiting the compositionality of compositional scripts. However, our experimental results indicate poor performance for the constructed multi-language dataset.

3 FTransGAN

In this section, we demonstrate the proposed model, FTransGAN. We introduce the problem setting and model overview in Section 3.1. We then describe the proposed Context-aware Attention Network and Layer Attention Network in Section 3.2 and Section 3.3, respectively. Finally, we describe the loss function in Section 3.4.

3.1 Problem setting and model overview

Like other few-shot font generation models, our goal was to generate font images by taking two conditions: style and content images. We consider the generation of font images as the process of estimating the conditional probability p(xs,c). Where x is the target image, c is one content image in a standard style (e.g., Microsoft YaHei in Chinese or Times New Roman in English), and s is a set of style images {s1,s2,…,sK}, which have the same style but different content. The content and style images should come from different languages. For example, if the content image is a Chinese character, the style images should consist of English letters, and vice versa.

As shown in Fig. 2, our model has a Generator G and two discriminators: a Content Discriminator Dcontent, and a Style Discriminator Dstyle. The two discriminators follow the design of PatchGAN [6] to check the real and fake patches locally. The Generator G takes one content image c and K style images s as inputs and generate target image x:

$$ x = G (s, c). $$
(1)

The detailed structure of our generator can be found in Fig. 3. It consists of a style encoder fstyle, a content encoder fcontent, and a decoder fdecoder:

$$ z_{s} = f_{\text{style}}(s), $$
(2)
$$ z_{c} = f_{\text{content}}(c), $$
(3)
$$ x = f_{\text{decoder}}(z_{s}, z_{c}). $$
(4)

Where zs and zc are extracted feature codes for style and content respectively. we specifically designed the style encoder fstyle as a multi-level attention form [37, 47], using two attention modules - the Context-aware Attention Network and the Layer Attention Network - to capture both local and global style features. More details of these two networks are given in the following sections.

Fig. 2
figure 2

An overview of our proposed FTransGAN. Each time we randomly select K style images from the dataset as style input, and one content image from the source font as content input. Content and style images are from different languages. We input them into the generator where two discriminators check the matching degree from a style and content perspective, respectively

Fig. 3
figure 3

Overview of the proposed Generator G. It consists of a style encoder, a content encoder, and a decoder. The two encoders extract the style representation and content representation, respectively. The decoder then takes the extracted information and generates the target image. We specifically designed the style encoder to capture both local and global style features using three parallel Context-aware Attention Networks and a Layer Attention Network. The three Context-aware Attention Networks have the same structure, except for the size of receptive fields

3.2 Context-aware attention network

As shown in Fig. 3, the style encoder has three parallel Context-aware Attention Blocks. After multiple convolutions of 3×3 kernels, they have 13×13, 21×21, 37×37 receptive fields respectively. Thus, the shallower layer can only perceive local features, while the deeper layer can perceive almost the whole image. Figure 4 shows the details of the Context-aware Attention Block. The input is a feature map of size C×H×W given by the last convolution layer, where C, H, and W denote the number of channels, height, and width, respectively. Here, we denote each region of the feature map as \(\{v_{r}\}_{r=1}^{H\times W}\). Unlike previous works [37, 47], which use an LSTM [48] or GRU [49] block to obtain the contextual information recurrently, we incorporated contextual information into the feature map of each region through a Self-attention [39] layer to improve computational efficiency. It is given by:

$$ h_{r}=f_{a}(v_{r}), $$
(5)

where fa denotes the Self-attention layer, the new features vectors hr contain information limited to their receptive fields and contextual information from other regions.

Fig. 4
figure 4

The architecture of the proposed Context-aware Attention Network. It takes a feature map given by the last convolution layer and performs a weighted sum over the feature maps to a feature vector. The weighted map is given by a self-attention block and a random initialized context vector

We introduced the attention mechanism to give each region a score because not all regions have the same contribution. Specifically,

$$ u_{r}= f_{c}(h_{r}), $$
(6)
$$ a_{r}=\text{softmax}({u_{r}^{T}} u_{c}), $$
(7)
$$ y=\sum\limits_{r=1}^{H\times W} a_{r} v_{r}. $$
(8)

That is, we input each contextual vector hr one by one into a single-layer neural network fc to obtain ur as a latent representation of hr. Then, a context vector uc was used to measure the importance of the current region. uc was randomly initialized and trained jointly with the entire model. After that, we can obtain the normalized attention score by a softmax layer. Finally, we computed a feature vector y as a weighted sum of each region. Note that we have three parallel Context-aware Attention Networks. Thus, we can obtain three feature vectors y1, y2, and y3.

3.3 Layer attention network

Given a style image, should the machine focus on local or global features? We believe that depends on the image itself. Based on this assumption, we designed the Layer Attention Network.

As shown in Fig. 5, The Layer Attention Network receives four inputs, including a feature map given by the last convolution layer. We flatten it to obtain a feature vector ym and three feature vectors y1, y2, and y3 given by three Context-aware Attention Networks. We use a single-layer neural network fl here to give each feature vector a score. These scores explicitly indicate the level of features that the model should focus on. Specifically,

$$ w_{1},w_{2},w_{3} = f_{l}(y_{m}), $$
(9)
$$ z = \sum\limits_{i=1}^{3} w_{i} y_{i}, $$
(10)

where wi,i = {1,2,3} are three normalized scores given by a neural network, and z is the weighted sum of the three feature vectors. Note that each time, the style encoder will accept K images. Thus, the final latent code zs is the mean of all vector z:

$$ z_{s} = \frac{1}{K} \sum\limits_{K} z^{k}. $$
(11)

In addition, the size of the content code is C×H×W, but the style code zs is a C-dimensional vector. We copied zs several times to match their size. After getting the content code zc and expanded style code zs, we simply concatenated and fed them into the decoder. The decoder will generate images based on the style zs and content codes zc.

Fig. 5
figure 5

The architecture of the proposed Layer Attention Network. The task of the Layer Attention Network is to do a weighted sum of three feature vectors. We input the feature map given by the last convolution layer to a neural network, and it generates three weights for us. We then we just summarize them based on these weights

3.4 End-to-end training

Our model consists of two discriminators Dcontent and Dstyle. They have almost the same architecture and consist of several convolution layers. Dcontent receives the generated and content image, and then checks whether they are the same character. Dstyle receives the generated and style image, and checks whether they are the same style. We directly concatenate these images in the channel dimension to input to the discriminators, and the entire model is jointly trained in an end-to-end manner.

The loss function of our model consists of three terms: L1 loss, style loss Lstyle, and content loss Lcontent,

$$ L = \lambda_{1} L_{1} + \lambda_{s} L_{\text{style}} + \lambda_{c} L_{\text{content}}, $$
(12)

where λ1, λs, and λc are three weights for balancing these terms. For higher quality results and to stabilize GAN training, both Lcontent and Lstyle employ hinge loss [50] functions:

$$ L_{\text{content}}=L_{\text{contentD}}+L_{\text{contentG}}, $$
(13)
$$ L_{\text{contentG}}=-E_{x,c\sim P(x,c)}[D_{\text{content}}(x,c)], $$
(14)
$$ \begin{array}{@{}rcl@{}} L_{\text{contentD}}& = &-E_{\hat{x},c\sim P(\hat{x},c)}[\min(0,D_{\text{content}}(\hat{x},c) - 1)]\\ & & -E_{x,c\sim P(x,c)}[\min(0,-D_{\text{content}}(x,c) - 1)], \end{array} $$
(15)
$$ L_{\text{style}}=L_{\text{styleD}}+L_{\text{styleG}}, $$
(16)
$$ L_{\text{styleG}}=-E_{x,s\sim P(x,s)}[D_{\text{style}}(x,s)], $$
(17)
$$ \begin{array}{@{}rcl@{}} L_{\text{styleD}}&=&-E_{\hat{x},s\sim P(\hat{x},s)}[\min(0,D_{\text{style}}(\hat{x},s)-1)]\\ & & -E_{x,s\sim P(x,s)}[\min(0,-D_{\text{style}}(x,s)-1)], \end{array} $$
(18)

where \(\hat {x}\) is the ground truth image, x is the generated image, and c and s denote the content and style images, respectively. To stabilize our training, we also adopted an L1 loss in our loss function to calculate the pixel-wise error between generated images and the ground truth images:

$$ L_{1}=E_{\hat{x},x\sim P(\hat{x},x)}[\Arrowvert x-\hat{x} \Arrowvert_{1}]. $$
(19)

4 Experiments

In this section, we demonstrate the generative ability of the proposed FTransGAN from multiple perspectives. In Section 4.1, we introduce the proposed dataset and demonstrate how we split it. Section 4.2 shows some hyperparameters that we used to train and test the models. In Section 4.3, we introduce four state-of-the-art models, EMD [12] DFS [13], LF-Font [19], and MX-Font [20] that we compared with our model. After that, we report the quantitative and qualitative results in Section 4.4. In the remaining sections, we comprehensively analyze the proposed model. Specifically, in Section 4.5, we conduct an ablation study to show the contribution of each component in the proposed model. In Section 4.6, we demonstrate that a simple knowledge transfer method can improve the generative quality. Finally, in Section 4.7, we show that the proposed self-attention architecture also increases the model’s explainability.

4.1 Font dataset and experiment settings

To evaluate the generative ability of our model, we constructed a dataset including 847 gray-scale fonts, each with approximately 1000 commonly-used Chinese characters and 52 English letters of the same style. In addition, we used a common font, Microsoft YaHei, as the content images, and this font was only used to index the character categories. We processed the dataset by finding a bounding box around each image and resizing it so that the larger dimension reached 64 pixels. We then created 64×64 font images by padding. All pixel values were normalized to a range of -1 to 1 before being fed into the model.

Next, we divided the experiment into two parts: “Chinese2English” and “English2Chinese”. In the “English2Chinese” part, the model needed to transfer the style of English letters to Chinese characters. We randomly chose 29 Chinese characters as unknown contents and 29 fonts as unknown styles, and the rest were the training data. Therefore, the whole dataset was divided into three parts: C1, images for training; C2, images with known content but unknown styles for testing, and C3, images with known styles but unknown content, used for testing. In the “Chinese2English” part, the model needed to transfer the style of Chinese characters to English letters. For this part, we randomly chose 29 fonts as unknown styles and 6 English letters as unknown contents, and left the rest as training data. This time, E1 was for training; E2 was used to test unknown styles, and E3 to test unknown content. Figure 6 shows several examples of the font dataset and the partition rules. Before the experiments, we made sure that there was no overlap between the training and the testing set by computing the nearest neighbor for all fonts.

Fig. 6
figure 6

Examples of the font dataset that we constructed for our experiments and the partition rule. We undertook two experiments: “Chinese2English” and “English2Chinese”. In the “Chinese2English” experiment, images in E1 are the labels used to train our network. E2 and E3 were used to evaluate unseen styles and content. In the “English2Chinese” experiment, images in C1 are the labels used to train our network, and images in C2 and C3 were used to evaluate unseen style and contents

4.2 Hyperparameter settings

We provide the hyperparameter settings of our model in this section. Our basic setup follows Pix2Pix [6]. In the following experiments, we set λ1 = 100, λc = λs = 1 and K = 6. Both Generator G, Content Discriminator Dcontent, and Style Discriminator Dstyle were initialized with Normal Initialization. In the “English2Chinese” experiment, we trained both our model and competitors for 20 epochs by using Adam optimizer [51] with β1 = 0.5, β2 = 0.999, and learning rate lr = 0.0002 in the first 10 epochs and a linear decay in the remaining 10 epochs. In the “Chinese2English” experiment, we trained them for 200 epochs. Empirically, we set the batch size to 256. Unlike previous work [8] that employed dropout to their generator to obtain randomness, we did not use dropout because we have observed in experiments that this behavior will reduce the generative ability of the model. Instead, we added slight random noise to the style code zs.

4.3 Competitors

To our knowledge, cross-language font style transfer has not been done before. Therefore, we had to select models that could be modified for this task. We excluded models that cannot handle large font libraries [8] or were originally designed for the unsupervised generation [11, 52]. Some few-shot learning methods, like DFS [13] and AGIS-Net [10] require fine-tuning during testing. However, fine-tuning is computationally expensive and cannot be applied to some real-time systems. Most recent works [19, 20] require component labels as an additional supervision. However, comparison with methods that need component labels is not suitable. First, the comparison is unfair because it is very easy for models to learn styles and structures of characters with component labels. However, this type of label is expensive for both training and testing. Second, these methods cannot be applied to the cross-language transfer because some languages – like English – do not have compositional characters. Nevertheless, we visually compared our method with two compositional-based methods, namely, LF-Font [19] and MX-Font [20].

Finally, we chose four supervised few-shot style transfer models – EMD [12], DFS [13], LF-Font [19], and MX-Font [20] – as our competitors. To ensure fair comparison, all models were NOT fine-tuned when processing unseen style or content images in the experiments. The generation behavior was merely a forward propagation. Here, we slightly modified the input and output channels of DFS to synthesize gray-scale images. LF-Font and MX-Font require component labels, which we also generated for our dataset (C1, C2, and C3 parts). In addition, as LF-Font and MX-Font do not support font style transfer between English and Chinese, we only conducted the “Chinese2Chinese” experiment using these models. All methods were trained with the same iterations (20 epochs) as our proposed model.

4.4 Quantitative and qualitative evaluation

In this section, we evaluate both numerical and visual results of the proposed model, comparing findings with the competitors mentioned in Section 4.3. Quantitative evaluation of generative models is inherently difficult because there is no universal rule for comparing ground truth images with generated images. Moreover, there is no standard response for tasks like artistic style transfer. Several evaluation metrics [15, 53, 54] have been proposed to measure model performance based on different assumptions, but they remain controversial. Therefore, we used three aspects: pixel-level, perceptual-level, and human-level accuracy to evaluate the model in a comprehensive way. In Table 1, we illustrate quantitative results, including mean absolute error (MAE), structural similarity (SSIM), multi-scale structural similarity (MSSSIM), accuracy, and mFID for unseen characters and styles for these three aspects. For both “English2Chinese” and “Chinese2English” parts, our model outperformed methods of EMD [12] and DFS [13] on most evaluation metrics.

Table 1 Quantitative evaluation of the proposed multi-language dataset. Bold entries represent the best performance

Pixel-level evaluation

Pixel-wise evaluation compares pixels at the same position between the ground truth image and generated image. We used MAE, SSIM, and MS-SSIM. Nevertheless, evaluation metrics at the pixel level often contradict human intuition. Therefore, we also used the other two levels of metrics to fully evaluate all models.

Perceptual-level evaluation

[54] proposed a method to evaluate generative models by computing the Fréchet Inception Distance (FID) between the feature maps of the ground truth and generated images. Liu et al. [11] modified it to a conditional version (mFID) by averaging FID for each target class. To evaluate the generated images from both the style and content perspective, we trained two ResNet-50 [55] networks on our proposed dataset to classify content (character) and style (font), respectively. We report the top-1 accuracy and mean FID (mFID) based on these two networks. Thus, the perceptual-level evaluation metric consists of 4 components: style-aware accuracy, content-aware accuracy, style-aware mFID, and content-aware mFID.

Human-level evaluation

Our ultimate target is to synthesize images that satisfy users. Therefore, we randomly sampled 39 sets of images from the output of all methods and asked users to select their preferred images given the content and style references. We asked them to comprehensively evaluate the synthesized images in terms of both style matching and content recognizability. All experiments were completely anonymous, and the synthesized images were randomly shuffled so that participants could not know which model the image came from. We collected a total of 390 valid responses from 10 people who were fluent in English and Chinese. Table 2 shows most participants liked the images synthesized by our model.

Table 2 User preference data based on 390 responses. Bold entries represent the best performance

Visual comparison

As shown in Fig. 7, we demonstrated results for both “English2Chinese” and “Chinese2English” parts. It was observed that EMD [12] erases some thinner fonts and performs poorly on the highly artistic fonts. DFS [13] does not perform well on printed fonts. Our method can synthesize high-quality images of different font types.

Fig. 7
figure 7

Visual comparison of our FTransGAN (4th rows) with EMD [12] (2nd rows) and DFS [13] (3rd rows). The observed style images are illustrated in the 1st rows and the ground truth images are in the 5th rows. For each font, we randomly selected 6 generated images as references

The compositional-based methods, LF-Font [19] and MX-Font [20], do not support font style transfer between Chinese and English. Thus, we only made comparisons with their methods for ”Chinese2Chinese” transfer. Considering the comparison is unfair (they require component labels), we only provide visual comparison results in Fig. 8 for reference and do not report quantitative results. Even without the component labels, our method still has similar generative ability to these methods. LF-Font [19] sometimes failed with characters that have complicated structures, and MX-Font failed with highly artistic fonts. Our FTransGAN performed well in both those instances.

Fig. 8
figure 8

Visual comparison with the compositional-based methods. They do not support font style transfer between Chinese and English. Thus, we only conducted the ”Chinese2Chinese” experiment with these methods. We compared our FTransGAN (4th rows) with LF-Font [19] (2nd rows) and MX-Font [20] (3rd rows). The observed style images are illustrated in the 1st rows and the ground truth images are in the 5th rows. For each font, we randomly selected 6 generated images as references

4.5 Ablation study

To evaluate the contribution of each component in our FTransGAN, we gradually removed some modules, and then demonstrated the results. We also implemented a baseline method to replace the proposed multi-level attention module by concatenating all style images in the channel axis and feeding it to a style encoder, named CAT model. In Table 3, we show several evaluation metrics based on the ablation study. The values of these metrics clearly demonstrate that the adversarial losses and L1 loss play an important role in our model. They work together to synthesize high-quality font images. The capability of the model can be further improved by our Layer Attention Network and Context-aware Attention Network. When these modules or losses are removed, both pixel-level and perceptual-level metrics decrease rapidly. The visualization of the results of the ablation study is demonstrated in Fig. 9. An experiment about the hyper-parameter λ analysis is also reported in Table 3. Specifically, we set different weights between the supervised L1 loss and the adversarial losses Lc and Ls. When there was no L1 loss (see the row w/o L1), the model performed the worst. When the weight of L1 loss was gradually increased (see the rows λ1 = 1, λ1 = 10, and Full model), the performance of the model improved. However, when the weight of L1 loss was infinite (without adversarial losses), the model performance was not good for the perceptual-level metrics (see the row w/o Lcontent and Lstyle for reference). Thus, we recommend setting the value of λ1 much greater than Lc and Ls. In this work, we set L1 = 100, Lc = Ls = 1.

Table 3 Ablation study on the proposed multi-language dataset. Bold entries represent the best performance
Fig. 9
figure 9

Visualization of ablation study results. The observed style images are in the 1st rows. Without GAN loss (2nd rows) and L1 loss (3rd rows) the proposed model cannot generate high-quality images. For some artistic fonts, the CAT model (4th rows) generates blurred images. For most fonts, our FTransGAN (5th rows) performs well, but when dealing with some handwritten fonts, it is prone to producing noise (bottom left) or losing details (bottom right). The proposed model with knowledge transfer (6th rows) is more stable. The ground truth images are illustrated in the 7th rows

4.6 Knowledge transfer

In this section, we demonstrate that a simple transfer learning technique can further improve the generative ability of the proposed model. The main focus of this paper is cross-language font style transfer. In the ”English2Chinese” part, we used a few English letters as style input and one Chinese character as content input for every iteration. However, compared with the same-language font style transfer like ”Chinese2Chinese”, the diversity of style input was not enough. For each font in the dataset, we have more than 1000 Chinese characters but only 52 English letters. Also, in the ”Chinese2English” part, we only had 46 English letters (the remaining 6 English letters were used for testing) as content input to train the model. This may have led to an overfitting problem. Humans can learn from the experience of same-language font style transfer and apply it to cross-language font style transfer. Inspired by this, we show that a simple transfer learning structure can further boost performance. Specifically, we first pretrained the FTransGAN model with 10 epochs with Chinese characters as both content and style input. Then, for “English2Chinese”, we retrained it with 20 epochs with English letters as style input and Chinese characters as content input. For the “Chinese2English” part, we retrained it with 100 epochs with Chinese characters as style input and English letters as content input. Theoretically, only the style encoder or content encoder should be retrained. However, during experiments, we observed that freezing the other parts will degrade the model performance. Therefore, we kept the entire model unfrozen for retraining. We denote our FTransGAN with knowledge transfer as Full model*. We did not evaluate the full model* on the unseen character images in the ”English2Chinese” part because the model can see all Chinese characters during pretraining. Table 3 shows that for most evaluation metrics, Full model* performs best, the transferred knowledge improved model a lot. As observed in Fig. 9, Full model* is more stable when generating images. Note that this knowledge transfer design is only used to show the improvement of generative ability with transfer learning. In the section on comparison with the state-of-the-art models, we did not pretrain our model for a fair comparison.

4.7 Attention analysis

To further understand and analyze the proposed multi-level attention structure, we visualize the weights given by the Layer Attention Network. The three Context-aware Attention Networks have different receptive field sizes. The layer with the smaller receptive field can only see a small region of the original image, while the layer with a larger receptive field can see almost the entire image. These weights indicate whether the model should focus more on local or global features for the current image. When dealing with handwritten fonts, our model tends to observe local features. In contrast, when dealing with printing or artistic fonts, our model tends to focus on a global feature. We speculate that this is because the features of handwritten fonts are mostly concentrated in local regions (e.g., stroke or line thickness), while for some artistic fonts, global considerations are required. We randomly illustrate some fonts in Fig. 10.

Fig. 10
figure 10

Analysis of the proposed Layer Attention Network. On the left are several observed style images. The bar charts on the right show the weights given by the Layer Attention Network. The horizontal axis shows the receptive field of each Context-aware Attention Net, and the vertical axis shows their weights

5 Conclusion

We proposed a cross-language font style transfer system that can synthesize an entire font library by using only a few samples from another language. We also built a large-scale multi-language dataset to train and evaluate our model. The experimental results demonstrate that our model has high generative ability compared with several state-of-the-art approaches, and the proposed Context-aware Attention Network and Layer Attention Network play an important role.

Nonetheless, our model has some drawbacks. First, although the number of style images was arbitrary during testing, the architecture design of the proposed model could only receive a fixed number of style images during the training phase. Second, for some highly artistic fonts, our model does not perform well. In addition, we used paired data to supervise the training. However, it is challenging to collect paired data in the real-world. Thus, we need to modify the model to an unsupervised learning mode. Dealing with these issues ensures exciting and challenging directions for future research.