Cross-language font style transfer

In this paper, we propose a cross-language font style transfer system that can synthesize a new font by observing only a few samples from another language. Automatic font synthesis is a challenging task and has attracted much research interest. Most previous works addressed this problem by transferring the style of the given subset to the content of unseen ones. Nevertheless, they only focused on the font style transfer in the same language. In many cases, we need to learn font style from one language and then apply it to other languages. Existing methods make this difficult to accomplish because of the abstraction of style and language differences. To address this problem, we specifically designed the network into a multi-level attention form to capture both local and global features of the font style. To validate the generative ability of our model, we constructed an experimental font dataset of 847 fonts, each containing English and Chinese characters with the same style. Results show that our model generates 80.3% of users’ preferred images compared with state-of-the-art models.


Introduction
Fonts are important visual designs that often provide additional information, such as whether the current text content is formal or casual. Some artistic fonts can even create a frightening or playful atmosphere. However, it is time-consuming to design a new font because there are many factors to consider, such as strokes, decoration, and Akita University, 1-1Tegata Gakuen-machi, Akita-shi, 010-8502, Japan effects. In addition, all characters within the same font must be designed in a consistent style and appropriate size. Some font libraries may contain thousands of characters from multiple languages (e.g., Microsoft YaHei contains over 20,000 characters including Chinese, Korean, Japanese, Latin, and Greek). Artists usually spend considerable time maintaining a consistent style between these characters to ensure visual compatibility. This labor-intensive process can cause many problems. Artists often design fonts in only one language, and it takes much time to extend the style to other languages in the future.
With the development of deep neural networks [1][2][3], automatic font synthesis without human intervention has become possible. Deep neural networks take both shape and texture transfer into account in an end-to-end manner. Early approaches [4,5] were proposed to synthesize an entire font library by observing a subset of it. These methods treat the font style transfer as an image-to-image translation [6] or a cGAN [7] task. While these image-to-image translation approaches have shown remarkable generative ability, there remain several obvious drawbacks. First, they divide the training process into two phases. Usually, models are pre-trained on a large dataset, and for the specific task, the model then needs to be fine-tuned. The fine-tuning process makes these methods less practical with limited computational resources. Second, hundreds of training samples are needed during the finetuning phase, and creating these training samples is another labor-intensive task. To reduce the training time and manual / Published online: 8 February 2023 Applied Intelligence (2023) 53:18666-18680 work during the fine-tuning phase, several few-shot learning methods have been proposed [8][9][10][11][12][13]. These models can synthesize a high-quality font library by observing only a few samples.
However, all the methods described only transfer styles within the same language and cross-language font style transfer is more challenging. Mostly, artists will only design fonts in one language (e.g., movie posters), and it is timeconsuming to extend the font style to other languages. Therefore, a model that can learn font style from another language is necessary. However, characters can vary greatly from language to language. More specifically, some parts of Chinese characters are complex and do not appear in English letters. Furthermore, unlike artistic [14][15][16] or text [17] style transfer tasks that require consideration of only global style, the font style often consists of both local (e.g., stroke, decoration, and thickness) and global features (e.g., effect and shape). Therefore, it is difficult to learn font styles from one language and apply them to another. Especially for font generation works [12,13] that only consider the global features.
To solve these issues, we propose FTransGAN (Font Translator GAN), which can synthesize a high-quality font library by observing only a small number of samples from other languages without fine-tuning. We use two encoders to extract the style and content representation separately. Then, we concatenate them and input them into the decoder. Two discriminators are used to check the degree of matching from style and content perspectives. Our style encoder contains two modules: the Context-aware Attention Network and the Layer Attention Network, which work together to capture local and global style features.
Experimental results from a collected multi-language dataset show high visual quality for both handwritten and printing fonts. We illustrate some application examples in Fig. 1 The remaining sections of this paper are organized as follows. In Section 2, we introduce existing works related to the topic of this paper. In Section 3, the proposed FTransGAN model is demonstrated. Section 4 reports the experimental results to show the effectiveness of the model. We conclude our paper and show some limitations in Section 5.
This paper is an extension of [18]. The differences between the journal and conference versions are as follows: (1) A detailed ablation study including loss and hyperparameter analysis was added to the journal version. (2) In the journal version, we compare our study with two recent works [19,20]. (3) The conference version only included the "English2Chinese" experiment. Here, we show that the proposed model is also capable of "Chinese2English" The English letters on the red background are the style images, and the Chinese characters on the green background are the content images. The rest are images synthesized by our FTransGAN, which extracts the strokes, thickness, decoration, or effect of the font from several observed English letters and automatically applies the extracted style to the given Chinese character. The Chinese meaning in the figure is "Computer Vision" transfer, and (4) A transfer learning technique was adopted to improve the performance of the proposed FTransGAN.

Related work
Font generation is a long-standing challenge that many studies have attempted to address. In this section, we selectively review some papers that are closely related to our work. In Sections 2.1 and 2.2, we introduce two main topics linked to our study: style transfer and image-to-image translation. We also present some related works on selfattention in Section 2.3, as self-attention is a key technique in the proposed model. In Section 2.4, we introduce some deep learning-based font generation methods.

Style transfer
Style transfer [14,15] is a long-standing problem in computer vision, where styles are extracted from one or several images and applied to another image. The input usually consists of a content image and a style image. The output is obtained by optimizing content loss and style loss. The content loss is the distance at each position of two feature maps. The style loss can be calculated by comparing the summary statistics of each layer. The limitation of these works is that they only support oneto-one style transfer. StyleBank [21] solved the problem of multiple style transfers by storing multiple style layers, and [22] proposed adaptive instance normalization that can be adapted to arbitrary new styles. In addition, [23] introduced an arbitrary style transfer method by reshuffling deep features of style images. Recently, StyleGAN [24][25][26] proposed a novel generator architecture that automatically divides attributes into different levels and provides intuitive, scale-specific control of the synthesis. However, these methods are mainly designed for artworks, and they usually define style as a set of colors and textures. As already noted, font styles are relatively abstract and consist of local and global features. Therefore, it is difficult to directly apply these methods to the font style transfer problem.

Image-to-image translation
Pix2Pix [6] and CycleGAN [27] proposed a generalized image-to-image translation framework that aims to learn the mapping between two domains. The former has paired data, while the latter is unpaired. Similarly, [28,29] proposed the coupled GAN (CoGAN), which learns a joint distribution of two domains by weight sharing. However, these models have a few problems. First, they can only translate images between two domains. Second, they usually require a large number of training data, which is less practical. starGAN [30] and MUNIT [31] solved the first problem by adding domain information to the generator. However, the generative ability of the model is limited to a few domainsit cannot synthesize images of unknown domains. Recently, FUNIT [11] was proposed as a few-shot unsupervised image generation method. It solved the second problem by using an additional encoder to extract domain information from several images. Inspired by these works, our model also adopts the few-shot generation structure and adversarial loss.

Attention mechanism
The attention mechanism [32,33] was first proposed in the field of natural language processing [34,35] to mitigate the damage caused by a fixed-length vector in the encoderdecoder architecture. This mechanism enables the machine to focus on certain words. Later, Xu et al. [36] applied an attention network to the computer vision field to solve the image captioning problem. Yu et al. [37] improved the attention network into a multi-level architecture to obtain both spatial and semantic information from a single image. Recently, [38] made several breakthroughs in the natural language field by exploiting the self-attention module, and SAGAN [39] also successfully applied the self-attention layer to the image generation task. Here, we directly use the self-attention block proposed by SAGAN [39] for our Context-aware Attention Network.

Font generation
Font generation can be considered a special case of image style transfer if we consider each character as an image. Existing works can be roughly divided into two groupsmany-shot learning and few-shot learning.
Many-shot learning methods [4,5,40,41] mainly employ the image-to-image translation architecture. The condition is a character in the source font, and the output should match the corresponding character in the target font. However, these methods usually need many reference images (e.g., 700 glyph images) to learn the style.
Several recent models [8-10, 12, 13, 42, 43] have been proposed for few-shot font generation. These models can generate an entire font library with only a few samples (e.g., 6 glyph images). MC-GAN [8] was the first endto-end approach to synthesize artistic fonts. However, its number of input and output images was fixed (26 English letters), meaning that the approach cannot handle a large font library (e.g., Chinese) because of the limitations of its architecture. AGIS-Net [10] and EMD [12] later solved this problem by combining the content and style features. They can also be considered image-to-image translation tasks. The difference is that these models require two conditions: style and content images. The output image should be a combination of the two conditions. To improve the quality of the generated images, [13] proposed the Deep Feature Similarity (DFS) architecture to leverage the feature similarity between the input content and style images to synthesize target images. Recently, researchers [9,19,20,[44][45][46] have made significant progress by exploiting the compositionality of compositional scripts. However, our experimental results indicate poor performance for the constructed multi-language dataset.

FTransGAN
In this section, we demonstrate the proposed model, FTransGAN. We introduce the problem setting and model overview in Section 3.1. We then describe the proposed Context-aware Attention Network and Layer Attention Network in Section 3.2 and Section 3.3, respectively. Finally, we describe the loss function in Section 3.4.

Problem setting and model overview
Like other few-shot font generation models, our goal was to generate font images by taking two conditions: style and content images. We consider the generation of font images as the process of estimating the conditional probability p(x | s, c). Where x is the target image, c is one content image in a standard style (e.g., Microsoft YaHei in Chinese or Times New Roman in English), and s is a set of style images {s 1 , s 2 , . . . , s K }, which have the same style but different content. The content and style images should come from different languages. For example, if the content image is a Chinese character, the style images should consist of English letters, and vice versa.
As shown in Fig. 2, our model has a Generator G and two discriminators: a Content Discriminator D content , and a Style Discriminator D style . The two discriminators follow the design of PatchGAN [6] to check the real and fake patches locally. The Generator G takes one content image c and K style images s as inputs and generate target image x: ( 1 ) The detailed structure of our generator can be found in Fig. 3. It consists of a style encoder f style , a content encoder f content , and a decoder f decoder : Where z s and z c are extracted feature codes for style and content respectively. we specifically designed the style encoder f style as a multi-level attention form [37,47], using two attention modules -the Context-aware Attention Network and the Layer Attention Network -to capture both local and global style features. More details of these two networks are given in the following sections.

Context-aware attention network
As shown in Fig. 3, the style encoder has three parallel Context-aware Attention Blocks. After multiple convolutions of 3×3 kernels, they have 13×13, 21×21, 37×37 receptive fields respectively. Thus, the shallower layer can only perceive local features, while the deeper layer can perceive almost the whole image. Figure 4 shows the details of the Context-aware Attention Block. The input is a feature map of size C×H×W given by the last convolution layer, where C, H, and W denote the number of channels, height, and width, respectively. Here, we denote each region of the feature map as {v r } H ×W r=1 . Unlike previous works [37,47], which use an LSTM [48] or GRU [49] block to obtain the contextual information recurrently, we incorporated contextual information into the feature map of each region Fig. 2 An overview of our proposed FTransGAN. Each time we randomly select K style images from the dataset as style input, and one content image from the source font as content input. Content and style images are from different languages. We input them into the generator where two discriminators check the matching degree from a style and content perspective, respectively Fig. 3 Overview of the proposed Generator G. It consists of a style encoder, a content encoder, and a decoder. The two encoders extract the style representation and content representation, respectively. The decoder then takes the extracted information and generates the target image. We specifically designed the style encoder to capture both local and global style features using three parallel Context-aware Attention Networks and a Layer Attention Network. The three Context-aware Attention Networks have the same structure, except for the size of receptive fields through a Self-attention [39] layer to improve computational efficiency. It is given by: where f a denotes the Self-attention layer, the new features vectors h r contain information limited to their receptive fields and contextual information from other regions. We introduced the attention mechanism to give each region a score because not all regions have the same contribution. Specifically, That is, we input each contextual vector h r one by one into a single-layer neural network f c to obtain u r as a latent representation of h r . Then, a context vector u c was used to measure the importance of the current region. u c was randomly initialized and trained jointly with the entire model. After that, we can obtain the normalized attention score by a softmax layer. Finally, we computed a feature vector y as a weighted sum of each region. Note that we have three parallel Context-aware Attention Networks. Thus, we can obtain three feature vectors y 1 , y 2 , and y 3 .

Layer attention network
Given a style image, should the machine focus on local or global features? We believe that depends on the image itself. Based on this assumption, we designed the Layer Attention Network.
As shown in Fig. 5, The Layer Attention Network receives four inputs, including a feature map given by the last convolution layer. We flatten it to obtain a feature vector y m and three feature vectors y 1 , y 2 , and y 3 given by three Context-aware Attention Networks. We use a single-layer neural network f l here to give each feature vector a score. These scores explicitly indicate the level of features that the model should focus on. Specifically, where w i , i = {1, 2, 3} are three normalized scores given by a neural network, and z is the weighted sum of the three feature vectors. Note that each time, the style encoder will accept K images. Thus, the final latent code z s is the mean of all vector z: In addition, the size of the content code is C×H×W, but the style code z s is a C-dimensional vector. We copied z s several times to match their size. After getting the content code z c and expanded style code z s , we simply concatenated and fed them into the decoder. The decoder will generate images based on the style z s and content codes z c . Fig. 4 The architecture of the proposed Context-aware Attention Network. It takes a feature map given by the last convolution layer and performs a weighted sum over the feature maps to a feature vector. The weighted map is given by a selfattention block and a random initialized context vector Fig. 5 The architecture of the proposed Layer Attention Network. The task of the Layer Attention Network is to do a weighted sum of three feature vectors. We input the feature map given by the last convolution layer to a neural network, and it generates three weights for us. We then we just summarize them based on these weights

End-to-end training
Our model consists of two discriminators D content and D style . They have almost the same architecture and consist of several convolution layers. D content receives the generated and content image, and then checks whether they are the same character. D style receives the generated and style image, and checks whether they are the same style. We directly concatenate these images in the channel dimension to input to the discriminators, and the entire model is jointly trained in an end-to-end manner.
The loss function of our model consists of three terms: L 1 loss, style loss L style , and content loss L content , where λ 1 , λ s , and λ c are three weights for balancing these terms. For higher quality results and to stabilize GAN training, both L content and L style employ hinge loss [50] functions: (18) wherex is the ground truth image, x is the generated image, and c and s denote the content and style images, respectively. To stabilize our training, we also adopted an L 1 loss in our loss function to calculate the pixel-wise error between generated images and the ground truth images:

Experiments
In this section, we demonstrate the generative ability of the proposed FTransGAN from multiple perspectives. In Section 4.1, we introduce the proposed dataset and demonstrate how we split it. Section 4.2 shows some hyperparameters that we used to train and test the models. In Section 4.3, we introduce four state-of-the-art models, EMD [12] DFS [13], LF-Font [19], and MX-Font [20] that we compared with our model. After that, we report the quantitative and qualitative results in Section 4.4. In the remaining sections, we comprehensively analyze the proposed model. Specifically, in Section 4.5, we conduct an ablation study to show the contribution of each component in the proposed model. In Section 4.6, we demonstrate that a simple knowledge transfer method can improve the generative quality. Finally, in Section 4.7, we show that the proposed self-attention architecture also increases the model's explainability.

Font dataset and experiment settings
To evaluate the generative ability of our model, we constructed a dataset including 847 gray-scale fonts, each with approximately 1000 commonly-used Chinese characters and 52 English letters of the same style. In addition, we used a common font, Microsoft YaHei, as the content images, and this font was only used to index the character categories. We processed the dataset by finding a bounding box around each image and resizing it so that the larger dimension reached 64 pixels. We then created 64×64 font images by padding. All pixel values were normalized to a range of -1 to 1 before being fed into the model. Next, we divided the experiment into two parts: "Chinese2English" and "English2Chinese". In the "English2Chinese" part, the model needed to transfer the style of English letters to Chinese characters. We randomly chose 29 Chinese characters as unknown contents and 29 fonts as unknown styles, and the rest were the training data. Therefore, the whole dataset was divided into three parts: C1, images for training; C2, images with known content but unknown styles for testing, and C3, images with known styles but unknown content, used for testing. In the "Chi-nese2English" part, the model needed to transfer the style of Chinese characters to English letters. For this part, we randomly chose 29 fonts as unknown styles and 6 English letters as unknown contents, and left the rest as training data. This time, E1 was for training; E2 was used to test unknown styles, and E3 to test unknown content. Figure 6 shows several examples of the font dataset and the partition rules. Before the experiments, we made sure that there was no overlap between the training and the testing set by computing the nearest neighbor for all fonts.

Hyperparameter settings
We provide the hyperparameter settings of our model in this section. Our basic setup follows Pix2Pix [6]. In the following experiments, we set λ 1 = 100, λ c = λ s = 1 and K = 6. Both Generator G, Content Discriminator D content , and Style Discriminator D style were initialized with Normal Initialization. In the "English2Chinese" experiment, we trained both our model and competitors for 20 epochs by using Adam optimizer [51] with β 1 = 0.5, β 2 = 0.999, and learning rate lr = 0.0002 in the first 10 epochs and a linear decay in the remaining 10 epochs. In the "Chinese2English" experiment, we trained them for 200 epochs. Empirically, we set the batch size to 256. Unlike previous work [8] that employed dropout to their generator to obtain randomness, we did not use dropout because we have observed in experiments that this behavior will reduce the generative ability of the model. Instead, we added slight random noise to the style code z s .

Competitors
To our knowledge, cross-language font style transfer has not been done before. Therefore, we had to select models that could be modified for this task. We excluded models that cannot handle large font libraries [8] or were originally designed for the unsupervised generation [11,52]. Some few-shot learning methods, like DFS [13] and AGIS-Net [10] require fine-tuning during testing. However, finetuning is computationally expensive and cannot be applied to some real-time systems. Most recent works [19,20] require component labels as an additional supervision. However, comparison with methods that need component labels is not suitable. First, the comparison is unfair because it is very easy for models to learn styles and structures of characters with component labels. However, this type of label is expensive for both training and testing. Second, these methods cannot be applied to the crosslanguage transfer because some languages -like English -do not have compositional characters. Nevertheless, we visually compared our method with two compositionalbased methods, namely, LF-Font [19] and MX-Font [20].
Finally, we chose four supervised few-shot style transfer models -EMD [12], DFS [13], LF-Font [19], and MX-Font [20] -as our competitors. To ensure fair comparison, all models were NOT fine-tuned when processing unseen style or content images in the experiments. The generation behavior was merely a forward propagation. Here, we slightly modified the input and output channels of DFS to synthesize gray-scale images. LF-Font and MX-Font require component labels, which we also generated for our dataset (C1, C2, and C3 parts). In addition, as LF-Font and MX-Font do not support font style transfer between English Fig. 6 Examples of the font dataset that we constructed for our experiments and the partition rule. We undertook two experiments: "Chinese2English" and "English2Chinese". In the "Chinese2English" experiment, images in E1 are the labels used to train our network.
E2 and E3 were used to evaluate unseen styles and content. In the "English2Chinese" experiment, images in C1 are the labels used to train our network, and images in C2 and C3 were used to evaluate unseen style and contents and Chinese, we only conducted the "Chinese2Chinese" experiment using these models. All methods were trained with the same iterations (20 epochs) as our proposed model.

Quantitative and qualitative evaluation
In this section, we evaluate both numerical and visual results of the proposed model, comparing findings with the competitors mentioned in Section 4.3. Quantitative evaluation of generative models is inherently difficult because there is no universal rule for comparing ground truth images with generated images. Moreover, there is no standard response for tasks like artistic style transfer. Several evaluation metrics [15,53,54] have been proposed to measure model performance based on different assumptions, but they remain controversial. Therefore, we used three aspects: pixel-level, perceptual-level, and human-level accuracy to evaluate the model in a comprehensive way. In Table 1, we illustrate quantitative results, including mean absolute error (MAE), structural similarity (SSIM), multi-scale structural similarity (MSSSIM), accuracy, and mFID for unseen characters and styles for these three aspects. For both "English2Chinese" and "Chinese2English" parts, our model outperformed methods of EMD [12] and DFS [13] on most evaluation metrics.
Pixel-level evaluation Pixel-wise evaluation compares pixels at the same position between the ground truth image and generated image. We used MAE, SSIM, and MS-SSIM. Nevertheless, evaluation metrics at the pixel level often contradict human intuition. Therefore, we also used the other two levels of metrics to fully evaluate all models.
Perceptual-level evaluation [54] proposed a method to evaluate generative models by computing the Fréchet Inception Distance (FID) between the feature maps of the ground truth and generated images. Liu et al. [11] modified it to a conditional version (mFID) by averaging FID for each target class. To evaluate the generated images from both the style and content perspective, we trained two ResNet-50 [55] networks on our proposed dataset to classify content (character) and style (font), respectively. We report the top-1 accuracy and mean FID (mFID) based on these two networks. Thus, the perceptual-level evaluation metric consists of 4 components: style-aware accuracy, content-aware accuracy, style-aware mFID, and contentaware mFID.

Human-level evaluation
Our ultimate target is to synthesize images that satisfy users. Therefore, we randomly sampled 39 sets of images from the output of all methods and asked We evaluated the models on both the unseen characters and styles. ↑ indicates larger numbers are better, ↓ indicates smaller numbers are better users to select their preferred images given the content and style references. We asked them to comprehensively evaluate the synthesized images in terms of both style matching and content recognizability. All experiments were completely anonymous, and the synthesized images were randomly shuffled so that participants could not know which model the image came from. We collected a total of 390 valid responses from 10 people who were fluent in English and Chinese. Table 2 shows most participants liked the images synthesized by our model. Fig. 7, we demonstrated results for both "English2Chinese" and "Chinese2English" parts. It was observed that EMD [12] erases some thinner fonts and performs poorly on the highly artistic fonts.

Visual comparison As shown in
DFS [13] does not perform well on printed fonts. Our method can synthesize high-quality images of different font types.
The compositional-based methods, LF-Font [19] and MX-Font [20], do not support font style transfer between Chinese and English. Thus, we only made comparisons with their methods for "Chinese2Chinese" transfer. Considering the comparison is unfair (they require component labels), we only provide visual comparison results in Fig. 8 for reference and do not report quantitative results. Even without the component labels, our method still has similar generative ability to these methods. LF-Font [19] sometimes failed with characters that have complicated structures, and MX-Font failed with highly artistic fonts. Our FTransGAN performed well in both those instances.

Ablation study
To evaluate the contribution of each component in our FTransGAN, we gradually removed some modules, and then demonstrated the results. We also implemented a baseline method to replace the proposed multi-level Fig. 7 Visual comparison of our FTransGAN (4th rows) with EMD [12] (2nd rows) and DFS [13] (3rd rows). The observed style images are illustrated in the 1st rows and the ground truth images are in the 5th rows. For each font, we randomly selected 6 generated images as references They do not support font style transfer between Chinese and English. Thus, we only conducted the "Chinese2Chinese" experiment with these methods. We compared our FTransGAN (4th rows) with LF-Font [19] (2nd rows) and MX-Font [20] (3rd rows). The observed style images are illustrated in the 1st rows and the ground truth images are in the 5th rows. For each font, we randomly selected 6 generated images as references attention module by concatenating all style images in the channel axis and feeding it to a style encoder, named CAT model. In Table 3, we show several evaluation metrics based on the ablation study. The values of these metrics clearly demonstrate that the adversarial losses and L1 loss play an important role in our model. They work together to synthesize high-quality font images. The capability of the model can be further improved by our Layer Attention Network and Context-aware Attention Network. When these modules or losses are removed, both pixel-level and perceptual-level metrics decrease rapidly. The visualization of the results of the ablation study is demonstrated in Fig. 9. An experiment about the hyper-parameter λ analysis is also reported in Table 3. Specifically, we set different weights between the supervised L 1 loss and the adversarial losses L c and L s . When there was no L 1 loss (see the row w/o L 1 ), the model performed the worst. When the weight of L 1 loss was gradually increased (see the rows λ 1 = 1, λ 1 = 10, and Full model), the performance of the model improved. However, when the weight of L 1 loss was infinite (without adversarial losses), the model performance was not good for the perceptual-level metrics (see the row w/o L content and L style for reference). Thus, we recommend setting the value of λ 1 much greater than L c and L s . In this work, we set L 1 = 100, L c = L s = 1.

Knowledge transfer
In this section, we demonstrate that a simple transfer learning technique can further improve the generative ability of the proposed model. The main focus of this paper is cross-language font style transfer. In the "English2Chinese" part, we used a few English letters as style input and one Chinese character as content input for every iteration. However, compared with the same-language font style transfer like "Chinese2Chinese", the diversity of style input was not enough. For each font in the dataset, we have more than 1000 Chinese characters but only 52 English letters. Also, in the "Chinese2English" part, we only had 46 English letters (the remaining 6 English letters were used for testing) as content input to train the model. This may have led to an overfitting problem. Humans can learn from the experience of same-language font style transfer and apply it to cross-language font style transfer. Inspired by this, we show that a simple transfer learning structure can further boost performance. Specifically, we first pretrained the FTransGAN model with 10 epochs with Chinese characters as both content and style input. Then, for "English2Chinese", we retrained it with 20 epochs with English letters as style input and Chinese characters as content input. For the "Chinese2English" part, we retrained  Fig. 9, Full model* is more stable when generating images. Note that this knowledge transfer design is only used to show the improvement of generative ability with transfer learning. In the section on comparison with the state-of-the-art models, we did not pretrain our model for a fair comparison.

Attention analysis
To further understand and analyze the proposed multilevel attention structure, we visualize the weights given by the Layer Attention Network. The three Context-aware Attention Networks have different receptive field sizes. The layer with the smaller receptive field can only see a small region of the original image, while the layer with a larger receptive field can see almost the entire image. These weights indicate whether the model should focus more on local or global features for the current image. When dealing with handwritten fonts, our model tends to observe local features. In contrast, when dealing with printing or artistic fonts, our model tends to focus on a global feature. We speculate that this is because the features of handwritten fonts are mostly concentrated in local regions (e.g., stroke or line thickness), while for some artistic fonts, global considerations are required. We randomly illustrate some fonts in Fig. 10.

Conclusion
We proposed a cross-language font style transfer system that can synthesize an entire font library by using only a few samples from another language. We also built a largescale multi-language dataset to train and evaluate our model.
The experimental results demonstrate that our model has high generative ability compared with several state-of-theart approaches, and the proposed Context-aware Attention Network and Layer Attention Network play an important role. Nonetheless, our model has some drawbacks. First, although the number of style images was arbitrary during testing, the architecture design of the proposed model could only receive a fixed number of style images during the training phase. Second, for some highly artistic fonts, our model does not perform well. In addition, we used paired data to supervise the training. However, it is challenging to collect paired data in the real-world. Thus, we need to modify the model to an unsupervised learning mode. Dealing with these issues ensures exciting and challenging directions for future research.

Declarations
Informed consent Informed consent was obtained from all individual participants included in the study.

Conflict of Interests
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.