Zero-shot domain paraphrase with unaligned pre-trained language models

Automatic paraphrase generation is an essential task of natural language processing. However, due to the scarcity of paraphrase corpus in many languages, Chinese, for example, generating high-quality paraphrases in these languages is still challenging. Especially in domain paraphrasing, it is even more difficult to obtain in-domain paraphrase sentence pairs. In this paper, we propose a novel approach for domain-specific paraphrase generation in a zero-shot fashion. Our approach is based on a sequence-to-sequence architecture. The encoder uses a pre-trained multilingual autoencoder model, and the decoder uses a pre-trained monolingual autoregressive model. Because these two models are pre-trained separately, they have different representations for the same token. Thus, we call them unaligned pre-trained language models. We train the sequence-to-sequence model with an English-to-Chinese machine translation corpus. Then, by inputting a Chinese sentence into this model, it could surprisingly generate fluent and diverse Chinese paraphrases. Since the unaligned pre-trained language models have inconsistent understandings of the Chinese language, we believe that the Chinese paraphrasing is actually performed in a Chinese-to-Chinese translation manner. In addition, we collect a small-scale English-to-Chinese machine translation corpus in the domain of computer science. By fine-tuning with this domain-specific corpus, our model shows an excellent capability of domain-paraphrasing. Experiment results show that our approach significantly outperforms previous baselines regarding Relevance, Fluency, and Diversity.


Introduction
Text Paraphrase can be viewed as the task of expressing the same semantic content of the original text in a different way [1]. It requires that while keeping the core semantics of the original text, the paraphrased text needs to be as diverse as possible. As an essential task of Natural Language Processing, Text Paraphrase has a wide range of applications. For example, Semantic Parsing [2], Relation Extraction [3,4], Question Answering [5,6], and Dialog Generation [7] all benefit from Text Paraphrase. However, the research progress on Text Paraphrase is not satisfactory. Natural Language Processing has recently made tremendous progress in many other sub-areas, such as Machine Translation, Open-B Zheng Chen zchen@uestc.edu.cn 1 School of Information and Software Engineering, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, Sichuan, China ended Generation, and Question Answering. These advances mostly rely on large-scale neural models, enormous computing power, and large-scale corpora. For Text Paraphrase, the challenge is that it is difficult to acquire high quality and large enough paraphrase corpus. Some researchers manage to extract paraphrase corpora from the large-scale online text in an unsupervised manner [8,9]. Even without considering how inefficient these approaches are, the retrieved paraphrase texts are still confronted with severe problems such as semantic incoherence and lack of diversity. Moreover, since domain-specific paraphrase texts are even more scarce and hard to retrieve, domain-specific paraphrase generation remains rarely explored.
Round-trip translation, which requires no paraphrase corpora, is a ready-to-use paraphrase generation method. In this approach, a machine translation system is used to translate an inputted source sentence in one language into a mediator sentence in a different language, and then translate this mediator sentence back into the source language. For example, to paraphrase " ", we first perform a Chinese-to-English translation, translating the source sentence to the mediator-sentence "The weather is really nice today!". Then, we translate the mediator-sentence back into the source language via an English-to-Chinese translation, to obtain the paraphrase sentence, " ". With this pivoting approach, the scarcity problem of the paraphrase corpus is avoided. However, the quality of the generated paraphrase text is relatively low. The reason is that the performance of this paraphrase approach is mainly dependent on the performance of the machine translation. However, the problem of machine translation is far from solved, especially in generating faithful, expressive, and elegant text. Moreover, the sentences need to be translated twice. This leads to double the loss of semantic information during translation, thus hurting the consistency between the paraphrase and the original sentence.
Language Model-based zero-shot paraphrase generation is considered to be one of the most promising approaches. Guo et al. [10] proposed the first zero-shot paraphrase generation model. They trained a Transformer-based language model [11] using multi-lingual parallel corpora. In the paraphrase generation phase, given input in one language, the model could be guided to generate a paraphrase output in the same language. Compared to round-trip translation, this method allows the model to learn the representation of an input sentence to paraphrase the given sentence directly, thus minimizing the semantic loss during the paraphrasing process. This method, however, also has some severe drawbacks. First, it must be trained over large-scale multi-lingual parallel corpora. It also draws on the idea of denoising auto-encoder (DAE) to augment the training data. The training with DAE is extremely time-consuming, making it difficult for small research groups to reproduce this approach. In addition, an additional language identifier needs to be added to guide the model to generate the paraphrase in the same language as the input sentence. We deeply question the effectiveness of the language identifier. Since in our reproduction, controlling the output language, especially Chinese, is far more problematic than just adding a language identifier.
Inspired by the research of Guo et al. [10], we propose Zeppel, a zero-shot paraphrase method based on unaligned pre-trained language models, which significantly reduces the difficulty and cost of the training process. We use the sequence-to-sequence model as the base architecture, with the multilingual BERT [12] as the encoder and the Chinese GPT2 [13] as the decoder. We train our model over an English-Chinese bilingual parallel corpus (the source language is English, and the target language is Chinese). In the inference phase, when a sentence is inputted, regardless of the language that the sentence is used, the model could generate a Chinese sentence to express the same semantic content. Therefore, when inputting an English sentence, this model behaves like a state-of-the-art Englishto-Chinese translation model. When inputting a sentence in another language, Japanese, for example, this model acts as a Japanese-to-Chinese translation model. However, when inputting a Chinese sentence, this model becomes a great Chinese-to-Chinese paraphrase model. There is no need to provide a language identifier to guide the generation since the output sentences are always in Chinese. By leveraging pre-trained language models, our approach could generate paraphrased text with both Relevance and Fluency without training over a large amount of corpus or employing a DAE to augment the given training data. We further conduct research on a domain-specific paraphrase task using a relatively small academic corpus in the domain of Computer Science. The generated high-quality paraphrase texts illustrate the data efficiency of our approach.
Our contributions in this paper are as follows: (1) we propose an effective method to build a paraphrase model using a pre-trained multilingual auto-encoder language model and a pre-trained monolingual auto-regressive language model, and train it with a bilingual corpus ("The Zeppel model", "Model training" sections); (2) we propose a modified version of Diverse Beam Search for improving diversity not only between the output beam groups but also between the input and output sentences, thus more suitable for the task of paraphrase generation ("Paraphrase generation" section); (3) we construct an English-Chinese bilingual corpus in the domain of Computer Science ("Datasets" section), and then train a domain-specific paraphrase model on that corpus. The data, code, and model will be released upon acceptance of this paper.

Traditional paraphrase methods
Most traditional text paraphrasing is based on a thesaurus or predefined rules. A thesaurus-based paraphrase generation system generates paraphrases by replacing some words with their corresponding synonyms [14,15]. This replacement could be carried out at both lexical level and phrase level. Early rule-based paraphrase generation methods generally rely on manually written paraphrase rules or patterns [16]. Later, some researchers have proposed to extract paraphrase rules automatically [17,18] instead of manually writing them to support better diversity. Sentence splitting and combining [19] is also considered as a rule-based approach. These traditional paraphrase generation methods are often poor in terms of flexibility, stability, and textual quality.

Translation-based paraphrase methods
Due to its simplicity and convenience, the translation-based paraphrase methods are one of the most commonly used methods in a real-world setting. Round-trip translation proposed by Mallinson et al. [20] and back-translation proposed by Wieting et al. [21] both translate the input sentence into a different language and then back-translate the translation result into the original language as the paraphrase text. These two-step text paraphrase methods utilizing ready-touse translation systems benefit significantly from the rapid development of machine translation technology. However, they also come with the problem of losing semantic information during the translation process. Other than directly building a paraphrase system on top of a translation system, machine translation also helps to build corpora for Paraphrase Identification [22] or Evaluation [23].

Seq2seq-based paraphrase methods
With the development of deep learning technology, neural models based on the sequence to sequence (seq2seq) architecture [24] are applied to paraphrase tasks. Based on a Stacked Residual LSTM Network [25], Prakash et al. [26] is the first to explore deep learning models for paraphrase generation. Inspired by the idea of CopyNet [27], Cao et al. [28] proposed a paraphrase generation model based on copy mechanism. They also changed the backbone of the model into Gated Recurrent Neural Network [29]. Gupta et al. [30] combined a Variational Autoencoder [31] with a Seq2Seq model to generate richer and more diverse paraphrase texts. Egonmwan et al. [32] integrated the Transformer model [11] into the seq2seq architecture to further improve the performance of paraphrase systems. In practice, by training over large-scale and high-quality paraphrase corpus, like PPDB [33], WikiAnswers [34], and MSCOCO [35], these seq2seqbased deep learning models could achieve satisfying results.
However, languages other than English do not have such paraphrase corpus. Hence, identifying paraphrase sentence pairs from large-scale online text becomes crucial when building a paraphrase system in a low-resource language [36][37][38]. Nevertheless, even the authors themselves admit that an automatically extracted paraphrase corpus can never reach the quality of human-written ones. Since we mainly explore the zero-shot approach in this paper, we will not give further details about the research on paraphrase identification.

Zero-shot paraphrase methods
Zero-shot paraphrasing, which is highly correlated with the Pre-train and Fine-tune Paradigm, is the latest and the most promising approach. Guo et al. [10] proposed the first zero-shot paraphrase generation model by pre-training a multilingual language model and then fine-tuning it with a parallel corpus. Thompson and Post [39] also found out that a well-trained Multilingual Translation system could generate paraphrases in a zero-shot fashion. They [40] further improved the diversity of the generated paraphrases by discouraging the production of n-grams that are present in the input. Fan et al. [41] borrowed the idea of unsupervised machine translation and proposed a purely zero-shot approach, which even does not need a parallel corpus. However, the iterative back-translation procedure is even more time-consuming. By employing a reinforcement learning procedure, Siddique et al.'s approach [42] is also extremely time-consuming. These researches deeply inspire our work. However, we aim to avoid the extremely resource-intensive and time-consuming training process of these works and make it possible to apply zero-shot paraphrasing in a lowresource language or domain.

The Zeppel model
The Zeppel is based on a seq2seq architecture. The encoder is initialized by the multilingual BERT and the decoder is initialized by the Chinese GPT2 (see Fig. 1). The multilingual BERT and the Chinese GPT2 are pre-trained separately, thus have different representations of the same Chinese token. Therefore, we call them unaligned pre-trained language models. Since the models have no idea that the input Chinese tokens and the output Chinese tokens are in the same language, the Chinese paraphrasing in our model could perform in a Chinese-to-Chinese translation manner. This is the key difference between our approach with the former zero-shot approaches.
In the training phase, Zeppel is trained to maximize the likelihood: where y t ∈ Y is the target sequence, and y t refers to the token generated by the model at the timestep t. X is the input sentence to be paraphrased. And θ denotes the parameters of the model. In the inference phase, given an input sentence X , Zeppel paraphrases it as follows: 1. The encoder vectorizes it with the vocabulary and embedding layer of multilingual BERT to get E x , then encodes the vector E x to get the output of BERT H mBERT : 2. The input of the decoder is H mBERT and the generated sequence Y t−1 at the current time step. The decoder vectorizes Y t−1 with the vocabulary and embedding layer of Chinese GPT2 to get E Y t−1 . Then through the decoder, the output token y t of the next time step is obtained:

Model training
The training of the Zeppel is simple and straight forward. We feed an English sentence into the model and supervise its output with the corresponding Chinese sentence. When a domain-specific paraphrase model is needed, we could perform a two-step training. That is, we first use an Englishto-Chinese general-domain parallel corpus to train the model, and then use a small scale English-to-Chinese parallel corpus in a specific domain to finetune it. We will discuss the effectiveness of this two-step training strategy in "Discussion" section.
For each training sample, special tokens are added according to the pre-trained model. For an English sentence, which will be input into the multilingual BERT, we add [CLS] and [SEP] to its head and tail, respectively. For the corresponding Chinese sentence, which will be input into the Chinese GPT2, we add [BOS] and [EOS] as its beginning and ending tokens. It is worth mentioning that there is no need to add additional language identifiers to the English or Chinese sentence. The decoder, Chinese GPT2, could only generate sentences in Chinese, while the encoder, multilingual BERT, could understand as much as 104 languages. Hence, even if the language of a sentence that fed to the encoder in the inference phase is inconsistent with that of the one in the training phase, benefiting from the multilingual BERT's multi-language understanding ability, similar representations can be obtained for sentences with close semantics. The Chinese GPT2 learns to generate the Chinese sentence based on that inputted semantic representations from the English-to-Chinese training corpus. That is, the multilingual BERT to Chinese GPT2 model can perform machine translation from 104 languages to Chinese, including Chinese-to-Chinese machine translation. However, the multilingual BERT and the Chinese GPT2 have different dic-tionary and different encoding matrix; therefore, they have an inconsistent understanding of the Chinese language. Thus, the Zeppel could perform the Chinese paraphrasing in the Chinese mBERT -to-Chinese GPT2 zh translation manner.

Paraphrase generation
In paraphrase generation, diversity is more important than other constrained text generation tasks, such as Abstractive Summarization and Machine Translation. Nevertheless, traditional maximum likelihood decoding algorithms such as Beam Search do not take diversity into consideration. Diverse Beam Search [43], a diversity-enhanced beam search algorithm, only considers the diversity among the output beam groups. However, in the paraphrase generation task, the diversity between the input and the output sentences is more critical. Hence, we modify the Diverse Beam Search, take into account the diversity not only between the output beam groups but also between the input and output sentences, utilize Hamming distance, N-gram repetition, and Levenshtein distance as the diversity measure, and propose Diverse Beam Search for Paraphrase Generation (DBS-PG) as follows.
Let G be the number of beam groups and B be the beam size. When decoding the tth token of bth beam, DBS-PG select the word w b t from the candidate words Cand b t (size:V ) as follows: where θ is the logarithm of this conditional probability distribution over all words. For decoding the tth token, θ(w t ) = log P r (w t | w t−1 , . . . , w 1 , x). H b v,t denotes the Hamming distance penalty for the candidate word. It is computed as follows: where S H is the set consisting of the words of the tth token of the input sentences X and the b-1 beams that have been generated. HDP is the value of Hamming distance penalty, which is 2.5 by default. N b v,t denotes the N-gram repetition penalty. It is computed as follows: where S N is a n-gram set from the input sentence X and the b-1 beams that have been generated. NRP is the value of N-gram repetition penalty, which is 2.5 by default.
After each beam reaches its [EOS], we will obtain multiple sets of generated sentences Y list = [Y 1 , Y 2 , . . . , Y n ] . The sentences in Y list are arranged in descending order according to their joint probabilities. Then, we need to select the most diverse paraphrase Y final from the candidates Y list . We employ a Levenshtein distance threshold (LThreshold), which value is 0.25 by default. We compute the Levenshtein distance between Y i ∈ Y list and X in turn, and use D i to represent the value that the distance divided by the length of X . If D i is greater than the LThreshold, we choose Y i as the final paraphrase Y final . By default, we choose the last sentence Y n as final paraphrase Y final .

Datasets
We employ the translation2019zh from the Large Scale Chinese Corpus for NLP [44] for general-domain training. The translation2019zh contains about 3.8 million high-quality English-Chinese parallel sentence pairs. As for domainspecific training, we crawled a small corpus in the computer science domain from Journal of Software 1 and Computer Science 2 . We crawled the Chinese and English abstracts of 6931 and 19782 papers from their websites, respectively. By performing sentence segmentation, aligning the sentences by applying the vecalign 3 , and removing poorly aligned sentence pairs, we obtained 59562 English-Chinese parallel sentence pairs in the computer science domain. For automatic evaluation, we randomly split this domain-specific corpus into an evaluation set of 3000 sentence pairs, and the rest are used as the training set. Moreover, we also randomly selected 200 sentence pairs from the evaluation set for human evaluation. Table 1 shows the statistics of our dataset.

Automatic evaluation
We compute cosine-similarity between the semantic representations of the paraphrases and the input sentences to evaluate the semantic consistency of the generated paraphrases. The semantic representation of a sentence is computed using the text2vec 4 with the bert-base, chinese 5 serves as the 1 http://www.jos.org.cn.  word vector. We also employ Distinct-2 6 and Inverse Self-BLEU(defined as: 1−Self-BLEU) [45] as metrics to evaluate the diversity of generated paraphrases. Inverse Self-BLEU is calculated between an original sentence and its paraphrase to evaluate their dissimilarity, while Distinct-2 is calculated based on the generated sentence only to evaluate whether a sentence could be expressed in diverse manners.

Human evaluation
We recruited six human annotators to rate the generated paraphrases with three metrics: Relevance, Fluency, and Diversity, respectively. Scores range from 0 to 5, the higher the better. Each paraphrase sample is rated by at least two annotators. During evaluation, the annotator is unaware of which model the paraphrase sample came from.

Implementation and baselines
We implement our model in Pytorch, using the Transformers 7 library provided by HuggingFace. The multilingual BERT we use as the encoder has 12 hidden layers, 12 attention heads, and 768 hidden state dimensions. The Chinese GPT2 we use as the decoder also has 12 hidden layers, 12 attention heads, and 768 hidden state dimensions. The vocabulary sizes of multilingual BERT and Chinese GPT2 are different, which are 119547 and 21128. Moreover, we follow the pseudo self-attention approach introduced by Ziegler et al. [46], thus minimizing the extra parameters in the sequenceto-sequence architecture. For training, we set learning rate to 5e − 5, and the training batch size to 32. It takes less than 24 hours to train our model, using a single Nvidia RTX 3090 GPU. For paraphrase generation, we set the number of groups of Diverse Beam Search to 5, and each group's beam size is 2. The penalties of Hamming Diversity and n-gram Diversity are both 2.5.
To verify the effectiveness of our method, we employ round-trip translation and trans-paraphrasing as baselines. To implement round-trip translation, we introduce English as the pivot language, and train two translation models, Chineseto-English and English-to-Chinese, separately. These translation models also use BERT as the encoder, and GPT2 as the decoder. We use the same training dataset to train these 6 https://github.com/neural-dialogue-metrics/Distinct-N. 7 https://huggingface.co/transformers/. models in the same two-step fashion. To perform paraphrase generation, a source sentence is first translated into English, and then translated back into Chinese.
To implement trans-paraphrasing, we first translate the English sentences of the aligned bilingual corpus into Chinese, and then train a sequence-to-sequence paraphrase model using this Chinese-to-Chinese translated pseudo -paraphrase corpus. Since the corpus is obtained via translation, the generated paraphrase texts retain noticeable translationese characteristics. Hence, this approach is called trans-paraphrasing.
We also provide the results of Guo et al.'s approach [10] for comparison. We call their model the unpretrainedparaphrasing since the key difference between their model and ours is that we utilize pretrained language models while theirs are not. As suggested, we train the unpretrainedparaphrasing model from scratch using the MultiUN 8 and OpenSubtitles 9 . Then, we finetune it using the same corpus we used for a fair comparison. The training process is augmented utilizing the DAE, while the finetuning process is not. Guo et al.'s original implementation employed a Top-K sampling algorithm [47] for decoding. Stochastic decoding algorithms certainly provide better diversity while potentially compromising the quality of the generated paraphrase. Thus, we also report the results decoding using the DBS-PG we propose in this paper. In addition, during decoding, we filtered the tokens that never appeared in the Chinese corpus to ensure that the generated paraphrase texts are in Chinese. Generating non-Chinese characters can have a remarkably negative impact on the quality of the paraphrasing. However, such a problem happens occasionally, especially with Top-K sampling decoding. Table 2 shows the result of automatic evaluation. We can find that Zeppel achieves 0.823 in cosine similarity, which is the highest score compared to the baseline approaches. Such a result indicates that the paraphrase texts generated by Zeppel have much better semantic consistency than those generated by the baselines. Meanwhile, Zeppel also achieves competitive results in the Distinct-2 and Inverse Self-BLEU. One thing needs to be pointed out that though the Distinct-2 and Inverse Self-BLEU are also higher the better. However, Bold font indicates the best performance for each metric they need to be judged under the same similarity level. Without a high similarity score, high diversity scores make no sense, since a randomly generated sentence could achieve the highest diversity score. Generally speaking, Distinct-2 higher than 0.75 and Inverse Self-BLEU higher than 0.5 could be considered good enough. Unpretrained-paraphrasing with DBS-PG achieves a 0.810 cosine similarity, the second-highest among these approaches, 0.11 higher than that of the original Unpretrained-paraphrasing, which was decoded using the Top-K Sampling. However, the original Unpretrainedparaphrasing showed better results in Distinct-2 and Inverse Self-BLEU. Such results corroborate our assumption that the DBS-PG could produce paraphrases with better semantic similarity, while the Top-K Sampling could provide better diversity.

Results and analysis
Comparing our Zeppel with the Unpretrainedparaphrasing with DBS-PG, Zeppel performance better on Cosine similarity, which is the most critical metric, while a little bit worse on Inverse Self-BLEU. The difference on Distinct-2 is marginal. We argue that the BERT and GPT2 within Zeppel have been pre-trained over a large corpus and thus have better capabilities in transferring their knowledge into the domain of Computer Science. The Unpretrainedparaphrasing model has only been trained on the MultiUN and OpenSubtitles. Although these two corpora are relatively large, the knowledge within is still limited and cannot be compared to the large corpus on which BERT and GPT2 were trained. Therefore, the Unpretrained-paraphrasing model struggled in producing paraphrases in the domain of Computer Science.
For further in-depth investigation of the paraphrase results, we randomly select 1000 samples, 200 by each model, from the automatic evaluation results and show them in Fig. 2. From this figure, we notice that the sample distributions of baselines are more scattered than those of Zeppel. This result indicates that the paraphrase text generated by Zeppel has better consistency in performance. In addition, the baseline models might have achieved remarkable consistency and diversity scores on average. However, a high consistency score does not always come with a high diversity score for a particular sample. A good paraphrased text requires consistency and diversity at the same time. For Zeppel, the coherence of these two indicators is much better.
The results of the human evaluation are shown in Table  3. From the Table 3, we can tell that Zeppel outperforms the baselines significantly in Relevance and Fluency. The Unpretrained-paraphrasing achieves the best diversity scores due to the stochastic decoding algorithm it utilizes. However, the diversity hurts its quality as we expect. Hence, the Unpretrained-paraphrasing achieves the second-lowest fluency score, while the Relevance score is only slightly higher than the Round-trip translation. The DBS-PG decoding algorithm could improve its paraphrase quality, resulting in higher Relevance and Fluency scores, yet lowering its Diversity.
For the Diversity results, there is a surprisingly significant gap between the manual and automatic evaluations. The two translation-based baselines achieve outstanding diversity scores in automatic evaluation. However, their diversity scores in manual evaluation are surprisingly low. We suspect this is because they suffered severe information loss during generation, thus hurting sentence length. Human judges may tend to give lower diversity scores when evaluating shorter sentences. Since beam search is one of the causes of the length problem [48], the Unpretrained-paraphrasing decoded using Top-K Sampling could avoid such problem and still get a remark performance in manual evaluation. In the appendix, we give out some comparative examples for the reader to perform a subjective evaluation on their own.

Comparison of different training strategies
We conduct a experiment to verify the effectiveness of the two-step training strategy. For comparison, we train two paraphrase models only use the training corpus in the general domain or the computer science domain, respectively. The result is shown in Table 4. From the table, we can tell that the two-step training strategy achieves the highest scores in Cosine similarity and Distinct-2. Meanwhile, Zeppel domain , which trained over a relative small corpus, performs the worst. This means that the effect of the domain paraphrase is highly dependent on the scale of the corpus as well as the domain-specific corpus.    Table 5. We can see from the table, the Inverse Self-BLEU obtained by the aligned model is far lower than Zeppel, while the two models are not much different between cosine similarity and Distinct-2. It is shown that, because the encoder and decoder of the aligned models use the same model, they have a consistent understanding of the Chinese language. Thus, the paraphrases generated by the aligned model and the input sentence are highly similar not only in semantics, but also in content.

Comparison with different decoding algorithms
We also run an experiment to verify the effectiveness of the Diverse Beam Search for Paraphrase Generation, the decoding algorithm we proposed for the purpose of improving the diversity of paraphrased text. We employ the Greedy Search, Beam Search, and Diverse Beam Search as baseline decoding algorithms to generate paraphrase text for comparison. The result is shown in Table 6.
As can be seen from the table, Greedy Search and Beam Search achieved the finest performances on Cosine similarity due to their maximum-likelihood decoding objective, meaning paraphrased texts generated by these algorithms are most semantically similar to input text. However, their performance on Inverse Self-BLEU is not satisfactory. Such a low Inverse Self-BLEU score indicates that a significant portion of the generated n-grams is repetitive with the input text. The Diverse Beam Search decoding algorithm remarkably Bold font indicates the best performance for each metric improves the Self-BLEU score of the generated paraphrased texts. That is why most existing text paraphrasing systems adopt it as the default decoding algorithm. However, the improvement of diversity harms the semantic expression accuracy of the generated text. The Cosine similarity score decreased by nearly 10 percent. By utilizing Hamming distance penalty, N-gram repetition penalty, and Levenshtein distance threshold between the input text and the output beam groups, our DBS-PG achieves the best diversity performance (Inverse Self-BLEU) while still gaining a remarkable improvement in semantic expression accuracy (Cosine similarity) over Diverse Beam Search. To better understand how these penalties and threshold contribute to the overall performance, we further perform three ablation studies that run DBS-PG without one of these penalties and threshold. The results are also shown in Table 6. We see that the Levenshtein distance threshold contributes a lot to the Inverse Self-BLEU. The DBS-PG without Levenshtein distance penalty decreases by 0.136 compared with the full-form DBS-PG since it forces the generated texts to be different from the input text. However, it shows a 0.020 increase in the Cosine similarity, which means introducing the Levenshtein distance threshold will slightly hurt the semantic representation of the produced paraphrases. Both Hamming distance and N-gram repetition penalties could improve the Cosine similarity. We argue that these penalties could expand the search space of the beam search, which leads to finding better semantically similar paraphrases. Hamming distance threshold and N-gram repetition penalty also contribute to the Inverse Self-BLEU since we take input text into consideration, other than only considering the penalties between generated groups as in the vanilla DBS. The N-gram repetition penalty also slightly improves the Distinct-2. However, due to the excellent text generation capability of the Chinese GPT2, the paraphrased texts generated by all decoding algorithms achieve high Distinct-2 scores. Their difference is minor. High Distinct-2 scores indicate that the generated texts are qualitative and informative from a stand-alone perspective.

Comparison of training costs of different models
In Table 7, we list the model size, the training corpus, and the cost of our Zeppel model. As a comparison, we also list two other deep neural network-based zero-shot paraphrase models, which are Thompson and Post's and Guo et al.'s models. Thompson and Post employ a vanilla sequence-tosequence transformer model with 745M parameters. They trained their model with 99.8 million sentences in 39 languages on a server with four Nvidia RTX 2080ti GPUs for about six weeks. According to their paper, the cost of a single training procedure is close to $13000. Guo et al. 's model is an auto-regressive transformer model, that is, a GPT-like model. Its parameter size is 110 million. They trained their model with 125.9 million sentences, about 25% more than Thompson and Post's. However, the training cost should be much lower since their model's parameter size is relatively small. They did not disclose the hardware platform and the time consumption of their training procedure. A reasonable guess is that it may also take several days for such a model to converge by training over a multi-GPU server. We re-implement their approach. It costs us eight days of training over a server with eight 3090 GPUs. The final loss we got was 0.821. Further training could possibly enhance its performance. And, of course, there will be more expenses. Since our Zeppel model is based on multilingual BERT and Chinese GPT2 models, both well pre-trained, it only takes us less than 24 hours to fine-tune Zeppel over a single-GPU workstation. However, considering the knowledge within BERT and GPT2's corpora is much larger than that of Thompson and Post and Guo et al.'s. As a result, the Zeppel model could achieve better language understanding and generation capability than the models that are trained from scratch with significantly less training effort.

Conclusion
In this work, we propose a novel zero-shot domain paraphrase approach named Zeppel. We train it with an English-to-Chinese aligned bilingual corpus. Then, by inputting a Chinese sentence into it, this model could surprisingly generate fluent and diverse Chinese paraphrases. Experiment results show that our approach significantly outperforms baselines regarding Relevance, Fluency, and Diversity. In the future, we would like to explore its applicability in an online academic paraphrase system, like Langsmith [49]. Moreover, since it is much easier to acquire a machine translation corpus than a paraphrase corpus, other low-resource languages with a decent pretrained autoregressive language model, like Japanese with japanese-gpt2 10 , and Korean with KoGPT2 11 , may also potentially benefit from our zero-shot paraphrasing approach.

A.4 Sample 4
Funding This research was supported by the Natural Science Foundation of Sichuan Province (2022NSFSC0503), and Sichuan Science and Technology Program (2022ZHCG0007).

Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.