Cross-language font style transfer

Li, Chenhao; Taniguchi, Yuta; Lu, Min; Konomi, Shin’ichi; Nagahara, Hajime

doi:10.1007/s10489-022-04375-6

Cross-language font style transfer

Open access
Published: 08 February 2023

Volume 53, pages 18666–18680, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Cross-language font style transfer

Download PDF

Chenhao Li¹,
Yuta Taniguchi²,
Min Lu³,
Shin’ichi Konomi² &
…
Hajime Nagahara¹

2331 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a cross-language font style transfer system that can synthesize a new font by observing only a few samples from another language. Automatic font synthesis is a challenging task and has attracted much research interest. Most previous works addressed this problem by transferring the style of the given subset to the content of unseen ones. Nevertheless, they only focused on the font style transfer in the same language. In many cases, we need to learn font style from one language and then apply it to other languages. Existing methods make this difficult to accomplish because of the abstraction of style and language differences. To address this problem, we specifically designed the network into a multi-level attention form to capture both local and global features of the font style. To validate the generative ability of our model, we constructed an experimental font dataset of 847 fonts, each containing English and Chinese characters with the same style. Results show that our model generates 80.3% of users’ preferred images compared with state-of-the-art models.

CLF-Net: A Few-Shot Cross-Language Font Generation Method

Font Style Transfer Using Neural Style Transfer and Unsupervised Cross-domain Transfer

Neural Style Difference Transfer and Its Application to Font Generation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Fonts are important visual designs that often provide additional information, such as whether the current text content is formal or casual. Some artistic fonts can even create a frightening or playful atmosphere. However, it is time-consuming to design a new font because there are many factors to consider, such as strokes, decoration, and effects. In addition, all characters within the same font must be designed in a consistent style and appropriate size. Some font libraries may contain thousands of characters from multiple languages (e.g., Microsoft YaHei contains over 20,000 characters including Chinese, Korean, Japanese, Latin, and Greek). Artists usually spend considerable time maintaining a consistent style between these characters to ensure visual compatibility. This labor-intensive process can cause many problems. Artists often design fonts in only one language, and it takes much time to extend the style to other languages in the future.

With the development of deep neural networks [1,2,3], automatic font synthesis without human intervention has become possible. Deep neural networks take both shape and texture transfer into account in an end-to-end manner. Early approaches [4, 5] were proposed to synthesize an entire font library by observing a subset of it. These methods treat the font style transfer as an image-to-image translation [6] or a cGAN [7] task. While these image-to-image translation approaches have shown remarkable generative ability, there remain several obvious drawbacks. First, they divide the training process into two phases. Usually, models are pre-trained on a large dataset, and for the specific task, the model then needs to be fine-tuned. The fine-tuning process makes these methods less practical with limited computational resources. Second, hundreds of training samples are needed during the fine-tuning phase, and creating these training samples is another labor-intensive task. To reduce the training time and manual work during the fine-tuning phase, several few-shot learning methods have been proposed [8,9,10,11,12,13]. These models can synthesize a high-quality font library by observing only a few samples.

However, all the methods described only transfer styles within the same language and cross-language font style transfer is more challenging. Mostly, artists will only design fonts in one language (e.g., movie posters), and it is time-consuming to extend the font style to other languages. Therefore, a model that can learn font style from another language is necessary. However, characters can vary greatly from language to language. More specifically, some parts of Chinese characters are complex and do not appear in English letters. Furthermore, unlike artistic [14,15,16] or text [17] style transfer tasks that require consideration of only global style, the font style often consists of both local (e.g., stroke, decoration, and thickness) and global features (e.g., effect and shape). Therefore, it is difficult to learn font styles from one language and apply them to another. Especially for font generation works [12, 13] that only consider the global features.

To solve these issues, we propose FTransGAN (Font Translator GAN), which can synthesize a high-quality font library by observing only a small number of samples from other languages without fine-tuning. We use two encoders to extract the style and content representation separately. Then, we concatenate them and input them into the decoder. Two discriminators are used to check the degree of matching from style and content perspectives. Our style encoder contains two modules: the Context-aware Attention Network and the Layer Attention Network, which work together to capture local and global style features. Experimental results from a collected multi-language dataset show high visual quality for both handwritten and printing fonts. We illustrate some application examples in Fig. 1. The main contributions are summarized as follows:

We developed a novel model, FTransGAN, and first applied an end-to-end solution to cross-language font style transfer. Two novel modules: the Context-aware Attention Network and the Layer Attention Network were introduced to capture both local and global style features simultaneously.
We constructed a new multi-language font dataset consisting of 847 fonts, each with 52 English letters and more than 1000 Chinese characters.
We demonstrate that the generative ability of the proposed model can be easily improved by a transfer learning technique.

The remaining sections of this paper are organized as follows. In Section 2, we introduce existing works related to the topic of this paper. In Section 3, the proposed FTransGAN model is demonstrated. Section 4 reports the experimental results to show the effectiveness of the model. We conclude our paper and show some limitations in Section 5.

This paper is an extension of [18]. The differences between the journal and conference versions are as follows: (1) A detailed ablation study including loss and hyper-parameter analysis was added to the journal version. (2) In the journal version, we compare our study with two recent works [19, 20]. (3) The conference version only included the “English2Chinese” experiment. Here, we show that the proposed model is also capable of ”Chinese2English” transfer, and (4) A transfer learning technique was adopted to improve the performance of the proposed FTransGAN.

2 Related work

Font generation is a long-standing challenge that many studies have attempted to address. In this section, we selectively review some papers that are closely related to our work. In Sections 2.1 and 2.2, we introduce two main topics linked to our study: style transfer and image-to-image translation. We also present some related works on self-attention in Section 2.3, as self-attention is a key technique in the proposed model. In Section 2.4, we introduce some deep learning- based font generation methods.

2.1 Style transfer

Style transfer [14, 15] is a long-standing problem in computer vision, where styles are extracted from one or several images and applied to another image. The input usually consists of a content image and a style image. The output is obtained by optimizing content loss and style loss. The content loss is the distance at each position of two feature maps. The style loss can be calculated by comparing the summary statistics of each layer. The limitation of these works is that they only support one-to-one style transfer. StyleBank [21] solved the problem of multiple style transfers by storing multiple style layers, and [22] proposed adaptive instance normalization that can be adapted to arbitrary new styles. In addition, [23] introduced an arbitrary style transfer method by reshuffling deep features of style images. Recently, StyleGAN [24,25,26] proposed a novel generator architecture that automatically divides attributes into different levels and provides intuitive, scale-specific control of the synthesis. However, these methods are mainly designed for artworks, and they usually define style as a set of colors and textures. As already noted, font styles are relatively abstract and consist of local and global features. Therefore, it is difficult to directly apply these methods to the font style transfer problem.

2.2 Image-to-image translation

Pix2Pix [6] and CycleGAN [27] proposed a generalized image-to-image translation framework that aims to learn the mapping between two domains. The former has paired data, while the latter is unpaired. Similarly, [28, 29] proposed the coupled GAN (CoGAN), which learns a joint distribution of two domains by weight sharing. However, these models have a few problems. First, they can only translate images between two domains. Second, they usually require a large number of training data, which is less practical. starGAN [30] and MUNIT [31] solved the first problem by adding domain information to the generator. However, the generative ability of the model is limited to a few domains – it cannot synthesize images of unknown domains. Recently, FUNIT [11] was proposed as a few-shot unsupervised image generation method. It solved the second problem by using an additional encoder to extract domain information from several images. Inspired by these works, our model also adopts the few-shot generation structure and adversarial loss.

2.3 Attention mechanism

The attention mechanism [32, 33] was first proposed in the field of natural language processing [34, 35] to mitigate the damage caused by a fixed-length vector in the encoder-decoder architecture. This mechanism enables the machine to focus on certain words. Later, Xu et al. [36] applied an attention network to the computer vision field to solve the image captioning problem. Yu et al. [37] improved the attention network into a multi-level architecture to obtain both spatial and semantic information from a single image. Recently, [38] made several breakthroughs in the natural language field by exploiting the self-attention module, and SAGAN [39] also successfully applied the self-attention layer to the image generation task. Here, we directly use the self-attention block proposed by SAGAN [39] for our Context-aware Attention Network.

2.4 Font generation

Font generation can be considered a special case of image style transfer if we consider each character as an image. Existing works can be roughly divided into two groups - many-shot learning and few-shot learning.

Many-shot learning methods [4, 5, 40, 41] mainly employ the image-to-image translation architecture. The condition is a character in the source font, and the output should match the corresponding character in the target font. However, these methods usually need many reference images (e.g., 700 glyph images) to learn the style.

Several recent models [8,9,10, 12, 13, 42, 43] have been proposed for few-shot font generation. These models can generate an entire font library with only a few samples (e.g., 6 glyph images). MC-GAN [8] was the first end-to-end approach to synthesize artistic fonts. However, its number of input and output images was fixed (26 English letters), meaning that the approach cannot handle a large font library (e.g., Chinese) because of the limitations of its architecture. AGIS-Net [10] and EMD [12] later solved this problem by combining the content and style features. They can also be considered image-to-image translation tasks. The difference is that these models require two conditions: style and content images. The output image should be a combination of the two conditions. To improve the quality of the generated images, [13] proposed the Deep Feature Similarity (DFS) architecture to leverage the feature similarity between the input content and style images to synthesize target images. Recently, researchers [9, 19, 20, 44,45,46] have made significant progress by exploiting the compositionality of compositional scripts. However, our experimental results indicate poor performance for the constructed multi-language dataset.

3 FTransGAN

In this section, we demonstrate the proposed model, FTransGAN. We introduce the problem setting and model overview in Section 3.1. We then describe the proposed Context-aware Attention Network and Layer Attention Network in Section 3.2 and Section 3.3, respectively. Finally, we describe the loss function in Section 3.4.

3.1 Problem setting and model overview

Like other few-shot font generation models, our goal was to generate font images by taking two conditions: style and content images. We consider the generation of font images as the process of estimating the conditional probability p(x∣s,c). Where x is the target image, c is one content image in a standard style (e.g., Microsoft YaHei in Chinese or Times New Roman in English), and s is a set of style images {s¹,s²,…,s^K}, which have the same style but different content. The content and style images should come from different languages. For example, if the content image is a Chinese character, the style images should consist of English letters, and vice versa.

As shown in Fig. 2, our model has a Generator G and two discriminators: a Content Discriminator D_content, and a Style Discriminator D_style. The two discriminators follow the design of PatchGAN [6] to check the real and fake patches locally. The Generator G takes one content image c and K style images s as inputs and generate target image x:

$$ x = G (s, c). $$

(1)

The detailed structure of our generator can be found in Fig. 3. It consists of a style encoder f_style, a content encoder f_content, and a decoder f_decoder:

$$ z_{s} = f_{\text{style}}(s), $$

(2)

$$ z_{c} = f_{\text{content}}(c), $$

(3)

$$ x = f_{\text{decoder}}(z_{s}, z_{c}). $$

(4)

Where z_s and z_c are extracted feature codes for style and content respectively. we specifically designed the style encoder f_style as a multi-level attention form [37, 47], using two attention modules - the Context-aware Attention Network and the Layer Attention Network - to capture both local and global style features. More details of these two networks are given in the following sections.

3.2 Context-aware attention network

As shown in Fig. 3, the style encoder has three parallel Context-aware Attention Blocks. After multiple convolutions of 3×3 kernels, they have 13×13, 21×21, 37×37 receptive fields respectively. Thus, the shallower layer can only perceive local features, while the deeper layer can perceive almost the whole image. Figure 4 shows the details of the Context-aware Attention Block. The input is a feature map of size C×H×W given by the last convolution layer, where C, H, and W denote the number of channels, height, and width, respectively. Here, we denote each region of the feature map as $\{v_{r}\}_{r=1}^{H\times W}$. Unlike previous works [37, 47], which use an LSTM [48] or GRU [49] block to obtain the contextual information recurrently, we incorporated contextual information into the feature map of each region through a Self-attention [39] layer to improve computational efficiency. It is given by:

$$ h_{r}=f_{a}(v_{r}), $$

(5)

where f_a denotes the Self-attention layer, the new features vectors h_r contain information limited to their receptive fields and contextual information from other regions.

We introduced the attention mechanism to give each region a score because not all regions have the same contribution. Specifically,

$$ u_{r}= f_{c}(h_{r}), $$

(6)

$$ a_{r}=\text{softmax}({u_{r}^{T}} u_{c}), $$

(7)

$$ y=\sum\limits_{r=1}^{H\times W} a_{r} v_{r}. $$

(8)

That is, we input each contextual vector h_r one by one into a single-layer neural network f_c to obtain u_r as a latent representation of h_r. Then, a context vector u_c was used to measure the importance of the current region. u_c was randomly initialized and trained jointly with the entire model. After that, we can obtain the normalized attention score by a softmax layer. Finally, we computed a feature vector y as a weighted sum of each region. Note that we have three parallel Context-aware Attention Networks. Thus, we can obtain three feature vectors y₁, y₂, and y₃.

3.3 Layer attention network

Given a style image, should the machine focus on local or global features? We believe that depends on the image itself. Based on this assumption, we designed the Layer Attention Network.

As shown in Fig. 5, The Layer Attention Network receives four inputs, including a feature map given by the last convolution layer. We flatten it to obtain a feature vector y_m and three feature vectors y₁, y₂, and y₃ given by three Context-aware Attention Networks. We use a single-layer neural network f_l here to give each feature vector a score. These scores explicitly indicate the level of features that the model should focus on. Specifically,

$$ w_{1},w_{2},w_{3} = f_{l}(y_{m}), $$

(9)

$$ z = \sum\limits_{i=1}^{3} w_{i} y_{i}, $$

(10)

where w_i,i = {1,2,3} are three normalized scores given by a neural network, and z is the weighted sum of the three feature vectors. Note that each time, the style encoder will accept K images. Thus, the final latent code z_s is the mean of all vector z:

$$ z_{s} = \frac{1}{K} \sum\limits_{K} z^{k}. $$

(11)

In addition, the size of the content code is C×H×W, but the style code z_s is a C-dimensional vector. We copied z_s several times to match their size. After getting the content code z_c and expanded style code z_s, we simply concatenated and fed them into the decoder. The decoder will generate images based on the style z_s and content codes z_c.

3.4 End-to-end training

Our model consists of two discriminators D_content and D_style. They have almost the same architecture and consist of several convolution layers. D_content receives the generated and content image, and then checks whether they are the same character. D_style receives the generated and style image, and checks whether they are the same style. We directly concatenate these images in the channel dimension to input to the discriminators, and the entire model is jointly trained in an end-to-end manner.

The loss function of our model consists of three terms: L₁ loss, style loss L_style, and content loss L_content,

$$ L = \lambda_{1} L_{1} + \lambda_{s} L_{\text{style}} + \lambda_{c} L_{\text{content}}, $$

(12)

where λ₁, λ_s, and λ_c are three weights for balancing these terms. For higher quality results and to stabilize GAN training, both L_content and L_style employ hinge loss [50] functions:

$$ L_{\text{content}}=L_{\text{contentD}}+L_{\text{contentG}}, $$

(13)

$$ L_{\text{contentG}}=-E_{x,c\sim P(x,c)}[D_{\text{content}}(x,c)], $$

(14)

$$ \begin{array}{@{}rcl@{}} L_{\text{contentD}}& = &-E_{\hat{x},c\sim P(\hat{x},c)}[\min(0,D_{\text{content}}(\hat{x},c) - 1)]\\ & & -E_{x,c\sim P(x,c)}[\min(0,-D_{\text{content}}(x,c) - 1)], \end{array} $$

(15)

$$ L_{\text{style}}=L_{\text{styleD}}+L_{\text{styleG}}, $$

(16)

$$ L_{\text{styleG}}=-E_{x,s\sim P(x,s)}[D_{\text{style}}(x,s)], $$

(17)

$$ \begin{array}{@{}rcl@{}} L_{\text{styleD}}&=&-E_{\hat{x},s\sim P(\hat{x},s)}[\min(0,D_{\text{style}}(\hat{x},s)-1)]\\ & & -E_{x,s\sim P(x,s)}[\min(0,-D_{\text{style}}(x,s)-1)], \end{array} $$

(18)

where $\hat {x}$ is the ground truth image, x is the generated image, and c and s denote the content and style images, respectively. To stabilize our training, we also adopted an L₁ loss in our loss function to calculate the pixel-wise error between generated images and the ground truth images:

$$ L_{1}=E_{\hat{x},x\sim P(\hat{x},x)}[\Arrowvert x-\hat{x} \Arrowvert_{1}]. $$

(19)

4 Experiments

In this section, we demonstrate the generative ability of the proposed FTransGAN from multiple perspectives. In Section 4.1, we introduce the proposed dataset and demonstrate how we split it. Section 4.2 shows some hyperparameters that we used to train and test the models. In Section 4.3, we introduce four state-of-the-art models, EMD [12] DFS [13], LF-Font [19], and MX-Font [20] that we compared with our model. After that, we report the quantitative and qualitative results in Section 4.4. In the remaining sections, we comprehensively analyze the proposed model. Specifically, in Section 4.5, we conduct an ablation study to show the contribution of each component in the proposed model. In Section 4.6, we demonstrate that a simple knowledge transfer method can improve the generative quality. Finally, in Section 4.7, we show that the proposed self-attention architecture also increases the model’s explainability.

4.1 Font dataset and experiment settings

To evaluate the generative ability of our model, we constructed a dataset including 847 gray-scale fonts, each with approximately 1000 commonly-used Chinese characters and 52 English letters of the same style. In addition, we used a common font, Microsoft YaHei, as the content images, and this font was only used to index the character categories. We processed the dataset by finding a bounding box around each image and resizing it so that the larger dimension reached 64 pixels. We then created 64×64 font images by padding. All pixel values were normalized to a range of -1 to 1 before being fed into the model.

Next, we divided the experiment into two parts: “Chinese2English” and “English2Chinese”. In the “English2Chinese” part, the model needed to transfer the style of English letters to Chinese characters. We randomly chose 29 Chinese characters as unknown contents and 29 fonts as unknown styles, and the rest were the training data. Therefore, the whole dataset was divided into three parts: C1, images for training; C2, images with known content but unknown styles for testing, and C3, images with known styles but unknown content, used for testing. In the “Chinese2English” part, the model needed to transfer the style of Chinese characters to English letters. For this part, we randomly chose 29 fonts as unknown styles and 6 English letters as unknown contents, and left the rest as training data. This time, E1 was for training; E2 was used to test unknown styles, and E3 to test unknown content. Figure 6 shows several examples of the font dataset and the partition rules. Before the experiments, we made sure that there was no overlap between the training and the testing set by computing the nearest neighbor for all fonts.

4.2 Hyperparameter settings

We provide the hyperparameter settings of our model in this section. Our basic setup follows Pix2Pix [6]. In the following experiments, we set λ₁ = 100, λ_c = λ_s = 1 and K = 6. Both Generator G, Content Discriminator D_content, and Style Discriminator D_style were initialized with Normal Initialization. In the “English2Chinese” experiment, we trained both our model and competitors for 20 epochs by using Adam optimizer [51] with β₁ = 0.5, β₂ = 0.999, and learning rate lr = 0.0002 in the first 10 epochs and a linear decay in the remaining 10 epochs. In the “Chinese2English” experiment, we trained them for 200 epochs. Empirically, we set the batch size to 256. Unlike previous work [8] that employed dropout to their generator to obtain randomness, we did not use dropout because we have observed in experiments that this behavior will reduce the generative ability of the model. Instead, we added slight random noise to the style code z_s.

4.3 Competitors

To our knowledge, cross-language font style transfer has not been done before. Therefore, we had to select models that could be modified for this task. We excluded models that cannot handle large font libraries [8] or were originally designed for the unsupervised generation [11, 52]. Some few-shot learning methods, like DFS [13] and AGIS-Net [10] require fine-tuning during testing. However, fine-tuning is computationally expensive and cannot be applied to some real-time systems. Most recent works [19, 20] require component labels as an additional supervision. However, comparison with methods that need component labels is not suitable. First, the comparison is unfair because it is very easy for models to learn styles and structures of characters with component labels. However, this type of label is expensive for both training and testing. Second, these methods cannot be applied to the cross-language transfer because some languages – like English – do not have compositional characters. Nevertheless, we visually compared our method with two compositional-based methods, namely, LF-Font [19] and MX-Font [20].

Finally, we chose four supervised few-shot style transfer models – EMD [12], DFS [13], LF-Font [19], and MX-Font [20] – as our competitors. To ensure fair comparison, all models were NOT fine-tuned when processing unseen style or content images in the experiments. The generation behavior was merely a forward propagation. Here, we slightly modified the input and output channels of DFS to synthesize gray-scale images. LF-Font and MX-Font require component labels, which we also generated for our dataset (C1, C2, and C3 parts). In addition, as LF-Font and MX-Font do not support font style transfer between English and Chinese, we only conducted the “Chinese2Chinese” experiment using these models. All methods were trained with the same iterations (20 epochs) as our proposed model.

4.4 Quantitative and qualitative evaluation

In this section, we evaluate both numerical and visual results of the proposed model, comparing findings with the competitors mentioned in Section 4.3. Quantitative evaluation of generative models is inherently difficult because there is no universal rule for comparing ground truth images with generated images. Moreover, there is no standard response for tasks like artistic style transfer. Several evaluation metrics [15, 53, 54] have been proposed to measure model performance based on different assumptions, but they remain controversial. Therefore, we used three aspects: pixel-level, perceptual-level, and human-level accuracy to evaluate the model in a comprehensive way. In Table 1, we illustrate quantitative results, including mean absolute error (MAE), structural similarity (SSIM), multi-scale structural similarity (MSSSIM), accuracy, and mFID for unseen characters and styles for these three aspects. For both “English2Chinese” and “Chinese2English” parts, our model outperformed methods of EMD [12] and DFS [13] on most evaluation metrics.

Table 1 Quantitative evaluation of the proposed multi-language dataset. Bold entries represent the best performance

Full size table

Pixel-level evaluation

Pixel-wise evaluation compares pixels at the same position between the ground truth image and generated image. We used MAE, SSIM, and MS-SSIM. Nevertheless, evaluation metrics at the pixel level often contradict human intuition. Therefore, we also used the other two levels of metrics to fully evaluate all models.

Perceptual-level evaluation

[54] proposed a method to evaluate generative models by computing the Fréchet Inception Distance (FID) between the feature maps of the ground truth and generated images. Liu et al. [11] modified it to a conditional version (mFID) by averaging FID for each target class. To evaluate the generated images from both the style and content perspective, we trained two ResNet-50 [55] networks on our proposed dataset to classify content (character) and style (font), respectively. We report the top-1 accuracy and mean FID (mFID) based on these two networks. Thus, the perceptual-level evaluation metric consists of 4 components: style-aware accuracy, content-aware accuracy, style-aware mFID, and content-aware mFID.

Human-level evaluation

Our ultimate target is to synthesize images that satisfy users. Therefore, we randomly sampled 39 sets of images from the output of all methods and asked users to select their preferred images given the content and style references. We asked them to comprehensively evaluate the synthesized images in terms of both style matching and content recognizability. All experiments were completely anonymous, and the synthesized images were randomly shuffled so that participants could not know which model the image came from. We collected a total of 390 valid responses from 10 people who were fluent in English and Chinese. Table 2 shows most participants liked the images synthesized by our model.

Table 2 User preference data based on 390 responses. Bold entries represent the best performance

Full size table

Visual comparison

As shown in Fig. 7, we demonstrated results for both “English2Chinese” and “Chinese2English” parts. It was observed that EMD [12] erases some thinner fonts and performs poorly on the highly artistic fonts. DFS [13] does not perform well on printed fonts. Our method can synthesize high-quality images of different font types.

The compositional-based methods, LF-Font [19] and MX-Font [20], do not support font style transfer between Chinese and English. Thus, we only made comparisons with their methods for ”Chinese2Chinese” transfer. Considering the comparison is unfair (they require component labels), we only provide visual comparison results in Fig. 8 for reference and do not report quantitative results. Even without the component labels, our method still has similar generative ability to these methods. LF-Font [19] sometimes failed with characters that have complicated structures, and MX-Font failed with highly artistic fonts. Our FTransGAN performed well in both those instances.

4.5 Ablation study

To evaluate the contribution of each component in our FTransGAN, we gradually removed some modules, and then demonstrated the results. We also implemented a baseline method to replace the proposed multi-level attention module by concatenating all style images in the channel axis and feeding it to a style encoder, named CAT model. In Table 3, we show several evaluation metrics based on the ablation study. The values of these metrics clearly demonstrate that the adversarial losses and L1 loss play an important role in our model. They work together to synthesize high-quality font images. The capability of the model can be further improved by our Layer Attention Network and Context-aware Attention Network. When these modules or losses are removed, both pixel-level and perceptual-level metrics decrease rapidly. The visualization of the results of the ablation study is demonstrated in Fig. 9. An experiment about the hyper-parameter λ analysis is also reported in Table 3. Specifically, we set different weights between the supervised L₁ loss and the adversarial losses L_c and L_s. When there was no L₁ loss (see the row w/o L₁), the model performed the worst. When the weight of L₁ loss was gradually increased (see the rows λ₁ = 1, λ₁ = 10, and Full model), the performance of the model improved. However, when the weight of L₁ loss was infinite (without adversarial losses), the model performance was not good for the perceptual-level metrics (see the row w/o L_content and L_style for reference). Thus, we recommend setting the value of λ₁ much greater than L_c and L_s. In this work, we set L₁ = 100, L_c = L_s = 1.

Table 3 Ablation study on the proposed multi-language dataset. Bold entries represent the best performance

Full size table

4.6 Knowledge transfer

In this section, we demonstrate that a simple transfer learning technique can further improve the generative ability of the proposed model. The main focus of this paper is cross-language font style transfer. In the ”English2Chinese” part, we used a few English letters as style input and one Chinese character as content input for every iteration. However, compared with the same-language font style transfer like ”Chinese2Chinese”, the diversity of style input was not enough. For each font in the dataset, we have more than 1000 Chinese characters but only 52 English letters. Also, in the ”Chinese2English” part, we only had 46 English letters (the remaining 6 English letters were used for testing) as content input to train the model. This may have led to an overfitting problem. Humans can learn from the experience of same-language font style transfer and apply it to cross-language font style transfer. Inspired by this, we show that a simple transfer learning structure can further boost performance. Specifically, we first pretrained the FTransGAN model with 10 epochs with Chinese characters as both content and style input. Then, for “English2Chinese”, we retrained it with 20 epochs with English letters as style input and Chinese characters as content input. For the “Chinese2English” part, we retrained it with 100 epochs with Chinese characters as style input and English letters as content input. Theoretically, only the style encoder or content encoder should be retrained. However, during experiments, we observed that freezing the other parts will degrade the model performance. Therefore, we kept the entire model unfrozen for retraining. We denote our FTransGAN with knowledge transfer as Full model*. We did not evaluate the full model* on the unseen character images in the ”English2Chinese” part because the model can see all Chinese characters during pretraining. Table 3 shows that for most evaluation metrics, Full model* performs best, the transferred knowledge improved model a lot. As observed in Fig. 9, Full model* is more stable when generating images. Note that this knowledge transfer design is only used to show the improvement of generative ability with transfer learning. In the section on comparison with the state-of-the-art models, we did not pretrain our model for a fair comparison.

4.7 Attention analysis

To further understand and analyze the proposed multi-level attention structure, we visualize the weights given by the Layer Attention Network. The three Context-aware Attention Networks have different receptive field sizes. The layer with the smaller receptive field can only see a small region of the original image, while the layer with a larger receptive field can see almost the entire image. These weights indicate whether the model should focus more on local or global features for the current image. When dealing with handwritten fonts, our model tends to observe local features. In contrast, when dealing with printing or artistic fonts, our model tends to focus on a global feature. We speculate that this is because the features of handwritten fonts are mostly concentrated in local regions (e.g., stroke or line thickness), while for some artistic fonts, global considerations are required. We randomly illustrate some fonts in Fig. 10.

5 Conclusion

We proposed a cross-language font style transfer system that can synthesize an entire font library by using only a few samples from another language. We also built a large-scale multi-language dataset to train and evaluate our model. The experimental results demonstrate that our model has high generative ability compared with several state-of-the-art approaches, and the proposed Context-aware Attention Network and Layer Attention Network play an important role.

Nonetheless, our model has some drawbacks. First, although the number of style images was arbitrary during testing, the architecture design of the proposed model could only receive a fixed number of style images during the training phase. Second, for some highly artistic fonts, our model does not perform well. In addition, we used paired data to supervise the training. However, it is challenging to collect paired data in the real-world. Thus, we need to modify the model to an unsupervised learning mode. Dealing with these issues ensures exciting and challenging directions for future research.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Liu H, Nie H, Zhang Z, Li Y-F (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322
Article Google Scholar
Liu T, Liu H, Li Y. -F., Chen Z, Zhang Z, Liu S (2019) Flexible ftir spectral imaging enhancement for industrial robot infrared vision sensing. IEEE Trans Industr Inf 16(1):544–554
Article Google Scholar
Liu H, Wang X, Zhang W, Zhang Z, Li Y-F (2020) Infrared head pose estimation with multi-scales feature fusion on the irhp database for human attention recognition. Neurocomputing 411:510–520
Article Google Scholar
Jiang Y, Lian Z, Tang Y, Xiao J (2017) Dcfont: an end-to-end deep chinese font generation system. In: SIGGRAPH Asia 2017 technical briefs, pp 1–4
Lyu P, Bai X, Yao C, Zhu Z, Huang T, Liu W (2017) Auto-encoder guided gan for chinese calligraphy synthesis. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, vol 1, pp 1095–1100
Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Azadi S, Fisher M, Kim VG, Wang Z, Shechtman E, Darrell T (2018) Multi-content gan for few-shot font style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7564–7573
Cha J, Chun S, Lee G, Lee B, Kim S, Lee H (2020) Few-shot compositional font generation with dual memory. In: European conference on computer vision. Springer, pp 735–751
Gao Y, Guo Y, Lian Z, Tang Y, Xiao J (2019) Artistic glyph image synthesis via one-stage few-shot learning. ACM Trans Graphics (TOG) 38(6):1–12
Article Google Scholar
Liu M-Y, Huang X, Mallya A, Karras T, Aila T, Lehtinen J, Kautz J (2019) Few-shot unsupervised image-to-image translation. In: Proceedings of the IEEE international conference on computer vision, pp 10551–10560
Zhang Y, Zhang Y, Cai W (2018) Separating style and content for generalized style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8447–8455
Zhu A, Lu X, Bai X, Uchida S, Iwana BK, Xiong S (2020) Few-shot text style transfer via deep feature similarity. IEEE Trans Image Process
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2414–2423
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711
Huo Z, Li X, Qiao Y, Zhou P, Wang J (2022) Efficient photorealistic style transfer with multi-order image statistics. Appl Intell:1–13
Hu M, He M (2021) Non-parallel text style transfer with domain adaptation and an attention model. Appl Intell 51(7):4609– 4622
Article Google Scholar
Li C, Taniguchi Y, Lu M, Konomi S (2021) Few-shot font style transfer between different languages. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 433–442
Park S, Chun S, Cha J, Lee B, Shim H (2021) Few-shot font generation with localized style representations and factorization. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 2393–2402
Park S, Chun S, Cha J, Lee B, Shim H (2021) Multiple heads are better than one: few-shot font generation with multiple localized experts. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13900–13909
Chen D, Yuan L, Liao J, Yu N, Hua G (2017) Stylebank: an explicit representation for neural image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1897–1906
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision, pp 1501– 1510
Gu S, Chen C, Liao J, Yuan L (2018) Arbitrary style transfer with deep feature reshuffle. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8222–8231
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4401–4410
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of styleGAN. In: Proc CVPR
Karras T, Aittala M, Laine S, Härkönen E., Hellsten J, Lehtinen J, Aila T (2021) Alias-free generative adversarial networks. In: Proc NeurIPS
Zhu J. -Y., Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Liu M. -Y., Tuzel O (2016) Coupled generative adversarial networks. In: Advances in neural information processing systems, pp 469–477
Liu M. -Y., Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. Adv Neural Inf Process Syst:30
Choi Y, Choi M, Kim M, Ha J. -W., Kim S, Choo J (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8789–8797
Huang X, Liu M. -Y., Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), pp 172–189
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3Rd international conference on learning representations, ICLR 2015
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) Arhpe: asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Trans Industr Inf 18(10):7107–7117
Article Google Scholar
Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learning Syst
Liu H, Zheng C, Li D, Shen X, Lin K, Wang J, Zhang Z, Zhang Z, Xiong NN (2021) Edmf: efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Industr Inf 18(7):4361–4371
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning. PMLR, pp 7354–7363
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lian Z, Zhao B, Xiao J (2016) Automatic generation of large-scale handwriting fonts via style learning. In: SIGGRAPH ASIA 2016 technical briefs, pp 1–4
Sun D, Ren T, Li C, Su H, Zhu J (2018) Learning to write stylized chinese characters by reading a handful of examples. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 920–927
Liu X, Meng G, Chang J, Hu R, Xiang S, Pan C (2021) Decoupled representation learning for character glyph synthesis. IEEE Trans Multimedia 24:1787–1799
Article Google Scholar
Tang L, Cai Y, Liu J, Hong Z, Gong M, Fan M, Han J, Liu J, Ding E, Wang J (2022) Few-shot font generation by learning fine-grained local styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7895–7904
Kong Y, Luo C, Ma W, Zhu Q, Zhu S, Yuan N, Jin L (2022) Look closer to supervise better: one-shot font generation via component-based discriminator. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13482– 13491
Liu W, Liu F, Ding F, He Q, Yi Z (2022) Xmp-font: self-supervised cross-modality pre-training for few-shot font generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7905–7914
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. In: International conference on learning representations
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR (Poster)
Xie Y, Chen X, Sun L, Lu Y (2021) Dg-font: deformable generative networks for unsupervised font generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5130–5140
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in neural information processing systems, pp 6626–6637
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

Download references

Acknowledgements

We thank Michelle Pascoe, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript.

Author information

Authors and Affiliations

Osaka University, 1-5 Yamadaoka, Suita, 565-0871, Osaka, Japan
Chenhao Li & Hajime Nagahara
Kyushu University, 744 Motooka, Nishi-ku, 819-0395, Fukuoka, Japan
Yuta Taniguchi & Shin’ichi Konomi
Akita University, 1-1Tegata Gakuen-machi, Akita-shi, 010-8502, Japan
Min Lu

Authors

Chenhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Taniguchi
View author publications
You can also search for this author in PubMed Google Scholar
Min Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shin’ichi Konomi
View author publications
You can also search for this author in PubMed Google Scholar
Hajime Nagahara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenhao Li.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Informed consent

Informed consent was obtained from all individual participants included in the study.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, C., Taniguchi, Y., Lu, M. et al. Cross-language font style transfer. Appl Intell 53, 18666–18680 (2023). https://doi.org/10.1007/s10489-022-04375-6

Download citation

Accepted: 29 November 2022
Published: 08 February 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10489-022-04375-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cross-language font style transfer

Abstract

Similar content being viewed by others

CLF-Net: A Few-Shot Cross-Language Font Generation Method

Font Style Transfer Using Neural Style Transfer and Unsupervised Cross-domain Transfer

Neural Style Difference Transfer and Its Application to Font Generation

1 Introduction

2 Related work

2.1 Style transfer

2.2 Image-to-image translation

2.3 Attention mechanism

2.4 Font generation

3 FTransGAN

3.1 Problem setting and model overview

3.2 Context-aware attention network

3.3 Layer attention network

3.4 End-to-end training

4 Experiments

4.1 Font dataset and experiment settings

4.2 Hyperparameter settings

4.3 Competitors

4.4 Quantitative and qualitative evaluation

Pixel-level evaluation

Perceptual-level evaluation

Human-level evaluation

Visual comparison

4.5 Ablation study

4.6 Knowledge transfer

4.7 Attention analysis

5 Conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Informed consent

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation