HMPT: a human–machine cooperative program translation method

Program translation aims to translate one kind of programming language to another, e.g., from Python to Java. Due to the inefficiency of translation rules construction with pure human effort (software engineer) and the low quality of machine translation results with pure machine effort, it is suggested to implement program translation in a human–machine cooperative way. However, existing human–machine program translation methods fail to utilize the human’s ability effectively, which require human to post-edit the results (i.e., statically modified directly on the model generated code). To solve this problem, we propose HMPT (Human-Machine Program Translation), a novel method that achieves program translation based on human–machine cooperation. It can (1) reduce the human effort by introducing a prefix-based interactive protocol that feeds the human’s edit into the model as the prefix and regenerates better output code, and (2) reduce the interactive response time resulted by excessive program length in the regeneration process from two aspects: avoiding duplicate prefix generation with cache attention information, as well as reducing invalid suffix generation by splicing the suffix of the results. The experiments are conducted on two real datasets. Results show compared to the baselines, our method reduces the human effort up to 73.5% at the token level and reduces the response time up to 76.1%.


Introduction
In the field of intelligence technology, programming languages evolve much faster than natural languages.For example, over the decades, more than 2,500 high-level programming languages have emerged worldwide.There is often a succession of programming languages in the same field, for example in the field of web development, languages such as Java, Python and Golang have emerged.
Extended author information available on the last page of the article 27 Page 2 of 38 Companies often need to rewrite their programs in new programming languages in order to adapt to the trend of software development.As a result, it led to significant demand for program translation.Program translation is to rewrite the existing code from one kind of programming language into another, e.g., from Python to Java.Effective program translation methods can help companies in many practical development scenarios, such as: 1) Application modernization.For example, the Commonwealth Bank of Australia spent five years and $750 million to convert its platform from COBOL to Java (Lachaux et al. 2020).In addition, during COVID-19, the massive volume of visits overwhelmed the unemployment benefit systems in some areas of the United States, all of which were coded in archaic languages (Kelly 2020).2) Multi-language versions of applications on different platforms.For example, when an enterprise plans to apply Windows-based desktop programs to the Android platform, they may need to rewrite the C++ programs in Java.
The traditional method to conduct program translation is to write a compiler from the source language to the target language in a purely manual way, which is expensive and inefficient since it requires experts with sophisticated bilingual knowledge and compilation principles.The explosive growth of the amount of code on hosted platforms such as GitHub has facilitated machine learning-based program translation research.Such learning-based program translation employing only machine has low cost and high efficiency, which, however, brings intolerable program bugs.Some overlooked errors in natural language translation can affect the overall program operation.For example, the TransCoder-DOBF model released by Facebook in 2021 (Lachaux et al. 2021) has only 50% computational accuracy in the transla- tion of Python and Java, which can not be directly used for enterprise development.We also note that some simple human-machine collaboration methods have been proposed for program translation tasks, such as confidence highlighting and alternate translations (Weisz et al. 2021).However, except in a few cases, for example, the system automatically modifies all instances of a variable when the user changes the name of the variable.The above methods in most cases require the software engineer to post-edit the output code.The information about the user's changes is not fed back to the model.
In order to solve this tricky problem, we propose to adopt the Human-Machine Computing (HMC) (Yu et al. 2021) framework in program translation.The HMC framework has yielded good results in many research areas (Yang et al. 2021;Chai et al. 2020;Kim and Pardo 2018;Yuksel et al. 2020).The core idea of HMC is to take the strengths of both human and machine so that the overall quality, i.e., accuracy and efficiency, can be improved in the case of continuous iterative updates.However, this also poses two challenges.First, the current proposed approach is simply to modify the output code statically (i.e., post-editing), thus requiring a large amount of human effort (i.e., number of corrections).To effectively reduce the human effort, how should software engineer collaborate iteratively with program translation models?Second, compared to translation tasks in natural language, program translation has much longer text (for example, the most extended program in the TransCoder test dataset has 548 tokens), while translation models utilize transformer model with large deep neural networks (Vaswani et al. 2017), which inevitably causes high response time during the subsequent iterations and seriously affects the user's collaboration experience.
To reasonably and effectively solve these problems, first, we propose to feed the edit of software engineer on the current model generated code into the model again and accomplish this task through interactive collaboration.Specifically, we introduce a prefix-based interactive protocol from the field of natural language interactive translation to the program translation task.Figure 1 shows an example of our proposed method and demonstrates that our method effectively reduces human effort.Second, to reduce the response time, we avoid invalid calculations during the interaction.Specifically, during re-translation, we achieve this goal in two aspects: avoiding duplicate prefix generation with the attention information cached in the translation of the previous round, as well as reducing the invalid suffix (i.e., nonprefix parts) generation by aborting early at the appropriate time step and splicing the suffix of the output code of the first round.
In this paper, we propose a human-machine interactive program translation method, namely HMPT (Human-Machine Program Translation).The process of each round of interaction is as follows: First, when the program translation model translates the source program, the software engineer corrects the output code and feeds the edit back to the program translation model in the form of a prefix, based

Model Generated Code
Software Engineer Edited Code Fig. 1 Example of the proposed HMPT method.The software engineer gets an initial program translation result in model generated code in iteration 1.However, he prefers the "for" loop rather than "while" loop and thus corrects the first unsatisfactory token "int" to "'for', as in the software engineer edited code, iteration 1.Then, the HMPT-based program translation model re-translates based on his correction in model generated code in iteration 2 and finally outputs a satisfactory result in software engineer edited code in iteration 2. Compared to the user post-editing the output code (that requires 11 edits), our method requires only 1 edit on which we determine the valid cached attention information.Second, based on the feedback, the model makes re-translation and reduce duplicate/invalid generation in two parts of the code: 1).Duplicate prefix: we use cached attention information to deduct the generation of duplicate prefixes.2).Invalide suffix: we observe that some parts of the generated suffix are invalid, and thus we make suffix splicing to deduct the generation of them.In each iteration, we use a two-step approach to determine that the position can be spliced by certain suffix; if it does, we abort the inference early.The interaction will be repeated until the software engineer is satisfied with the result.In summary, we made the following contributions: • In order to reduce the human effort, HMPT feeds the edit of software engineer into the model by introducing the prefix-based interactive protocol.To the best of our knowledge, this is the first attempt to introduce a natural language interactive translation approach to the program translation task.• In order to reduce the interactive response time, HMPT avoids a large number of calculations by reducing duplicate prefix generation and invalid suffix generation.• We have conducted extensive experiments on the dataset.The results show that our approach reduces the human effort by up to about 73.5% at the token level compared to the human post-editing method, and our proposed optimization approach about prefix and suffix generation reduces the response time by up to about 76.1% .It also provides the user with flexible human effort and response time choices.

Related work
Language is the bridge of communication.In order to solve the problem of natural language translation, people put forward the idea of machine translation.
With the continuous efforts of scholars, Machine translation has developed rapidly.Programming languages, as special sequences of text, are more structured than natural languages.The rapid development of programming languages leads to the increasing demand of software iteration, and the huge cost of cross-platform transplantation makes the demand of program translation increasing day by day.Some scholars have explored program translation, but the translation results are not good.Program translation often has many errors that cannot be compiled.The translation results usually need manual post-editing, which consumes a huge amount of manpower and time.How to guarantee the quality of program translation while effectively reducing human effort is the biggest challenge.In order to solve the above problems, inspired by interactive natural language translation, this paper proposes a human-machine collaborative interactive program translation system, which significantly reduces the cost of program translation.Different from previous studies, this paper reinputs human correction into the model as feedback, and realizes a new man-machine cooperation mode.

Machine translation of natural language
Machine translation refers to the automatic conversion of source language text to target language text by using intelligent equipment such as computers.Machine translation is originally used in the field of natural language translation.The methods of machine translation are mainly divided into rule-based method and data-driven method.Rule-based approaches require formal grammar rules defined by experts with linguistic knowledge of both the source and target languages.The data-driven approach relies on bilingual databases and models.
Rule-based machine translation Rule-based machine translation(RBMT) systems are basically constituted by two components: the rules, that account for the syntactic knowledge, and the lexicon, which deals with the morphological, syntactic, and semantic information.Both rules and lexicons are grounded on linguistic knowledge and generated by expert linguists manual.Similarly, code translation based on rule-based translations need experts to write rules manually.But programming languages are much more structured than natural languages.The more complex the languages are, the harder it is to write rules.As a result, the translation accuracy of rule-based programs depends on the quality of the written rules.
Data-driven machine translation Data-driven machine translation can be divided into statistical machine translation and neural machine translation.Statistical machine translation method uses mathematical statistical model to complete translation work.Statistical machine translation equates the translation problem to solving the probability problem.It is essentially finding the target language with the maximum probability corresponding to the source language.Based on the above basic ideas, neural machine translation implements direct translation from source language to target language by using neural network.In the 1990s, Castano and Casacuberta (1997) implemented a neural network-based translation approach using small-scale corpora, which is the basis for program translation.On the basis of the above research, Zamora-Mart (2010) describes a Machine Translation system that integrates a Neural Network Language Model in the decoding process.Sundermeyer et al. (2014) presents two different translation models using recurrent neural networks.The first one is a word-based approach using word alignments.The second one is a rule-based translation model with phrasebased decoding.Because neural machine translation method does not need to design complex feature engineering, it has become the mainstream machine translation method.
The above content introduces the development process of natural language machine translation methods.Natural language machine translation is a bold attempt by scholars in the field of machine translation.The program translation proposed in this paper is a further attempt on the basis of natural language-based machine translation.The program translation has a stronger structural requirement on the generated results, so scholars have made a lot of efforts for this.

Translation of programming languages
Current research on program translation can be divided into four main categories: supervised translation, unsupervised translation, pre-training and fine-tuning, and human-machine collaboration program translation.
Supervised program translation Earlier program translation research mainly used supervised translation methods on massively parallel corpora.Initially, the phrase-based statistical machine translation model was primarily used for program translation between Java and C# (Nguyen et al. 2013;Karaivanov et al. 2014).Aggarwal et al. (2015) used a similar approach to train a translation model from python2 to python3 on a parallel corpus generated by the open-source library 2 to 3. 1 Later, with the development of deep learning, research on program code translation based on the Tree2Tree model emerged.The main idea of such research is to take advantage of the feature that the program can be parsed into an abstract syntax tree by a syntax parser and use the location information of the parse tree for translation.Among them, Chen et al. (2018) used the attention mechanism to locate the corresponding subtree in the source tree and used the information of the subtree to guide the non-terminal extension.Drissi et al. (2018) used grammar rules of the target language to generate grammatically correct programs.Ahmad et al. (2022) used program summaries as an intermediate language and perform a reverse translation similar to target-to-NL-to-source generation to train the model.
Unsupervised program translation Later, many monolingual source code exists on code hosting platforms such as GitHub and Gitee. 2 This monolingual source code represents a wealth of language information.Lachaux et al. (2020) proposed TransCoder for translation between C++, Python, and Java by training a completely unsupervised program translation model on the source code of the GitHub project.The core idea is to use the classical unsupervised machine translation method in three major stages: Masked Language Modeling (MLM) (Devlin et al. 2019), denoising auto-encoding (DAE) (Vincent et al. 2008), and back-translation (BT) (Sennrich et al. 2016).Roziere et al. (2021) went through unit tests to evaluate the TransCoder results and fine-tune the model with the correct translation to obtain greater accuracy.
Pre-training and fine-tuning Since the current datasets of program translation are characterised by the large size of monolingual datasets (e.g.code in code repositories such as github) and the small size of bilingual parallel corpora, the best solution is to pre-train the model on the monolingual set and then fine-tune the downstream tasks in the parallel corpus.Among them, Lachaux et al. (2021) recover the original version of the obfuscated source code to pre-train the model.Wang et al. (2021) proposed three pre-training tasks based on the T5 model: span prediction, identifier prediction, and identifier tagging.Span prediction means to randomly mask spans with arbitrary lengths and let the model predict these masked spans combined with some sentinel tokens at the decoder.Identifier prediction replaces the identifier with a specific token for the model to predict.Identifier tagging aims to make model to 1 https:// docs.python.org/2/ libra ry/ 2to3.html.
2 https:// gitee.com/.judge whether the token is a identifier.Feng et al. (2020) used a BERT-like target preprocessing task in an open-source GitHub repository, and Guo et al. (2021) used semantic-level code structure information like data streams.Unlike abstract syntax trees, it encodes the relation of "where-the-value-comes-from" between variables.This characteristic makes model pre-training more efficient.Ahmad et al. (2021) pre-trained the model by using denoising auto-encoding on collected functions and associated natural language texts.Zhu et al. (2022) used a denoising auto-encoding of cross-lingual snippet and Multilingual Snippet Translation (MuST) pre-training approach.Chakraborty et al. (2022) propose the pre-training task of "De-Naturalize" the source code (i.e., converting it to some semantically identical equivalent form) and then naturalizing them.Niu et al. (2022) synthesized the source code, the structural information of the source code, and the relevant natural language description information, and defined three pre-training tasks, namely Masked Sequence to Sequence (MASS), CodeAST Prediction (CAP), and Method Name Generation (MNG).
These pre-training methods use the model parameters of the previously learned task to initialize the model parameters of the new task.In this way, old knowledge can help the new model successfully perform new tasks from old experience, rather than starting from scratch.In this paper we also use the pre-trained model, which can improve the accuracy of program translation.
Human-machine collaboration program translation When it comes to translation of programming languages, such as Java to Python, according to the research of Ahmad et al. (2021), the Seq2seq machine translation model with the attention mechanism has a compile accuracy of only 66.5%.And the transformer machine translation model has a compile accuracy of only 61.5%.According to this, the reasonable cooperation between the machine translation model and software engineer has become an effective solution for actual project development.Weisz et al. (2021) proposed a preliminary human-computer collaboration exploration scheme, i.e., confidence highlighting and alternate translations to assist engineers in the program code translation process.Weisz et al. (2022) also clearly suggested the current need for the existence of interactive program translation tools.
However, the current human-machine cooperative method does not feedback the correction information reviewed by software engineer to the translation model, resulting in it is not highly efficient (Weisz et al. 2021).

Interactive-predictive machine translation in natural language
The results of natural language machine translation always contain errors.Therefore, they often require manual modification.In order to improve the efficiency of manual editing, Foster et al. (1997) introduced interactive-predictive machine translation (IPMT), which aims to investigate how human feedback can act on subsequent machine translation in the process of human-machine interaction to generate a better translation in a continuous iterative correction.Subsequent research on this basis can be broadly divided into two categories: interactive-predictive statistical machine translation (IPSMT) and interactive-predictive neural machine translation (IPNMT).IPSMT in natural language The initial research on interactive translation was mainly based on statistical machine translation.Barrachina et al. (2009) used manually verified prefixes to generate more suitable suffixes in SMT systems.Koehn et al. (2014) considered using more lenient matching on the search graph to increase the accuracy, such as matching only the last word of the prefix, case-insensitive matching, etc. Sanchis-Trilles et al. (2014) considered some pruning strategies for wordgraph to reduce the response time.Alabau et al. (2014) considered the introduction of multimodal information (e.g., On-line handwritten text recognition) into the IMT system.
IPNMT in natural language With the success of neural machine translation, Peris et al. (2017) reimplemented prefix-based protocols in the neural machine translation system and proposed a fragment-based collaboration approach.Later, they extended the granularity of the prefix-based collaborative approach from word level to character level and introduced the online learning approach to interactive translation systems (Peris and Casacuberta 2019).Gupta et al. (2020) incorporated lexical syntactic descriptions into the Transformer model in the form of CCG supertags and implemented prefix-based interactive protocols with the Transformer model.Lam et al. (2018) considered the user's rating on translation results as a collaborative approach and used this quality feedback as a reward in reinforcement learning to guide the model's parameter updates.Zhao et al. (2020) used reinforcement learning techniques to jointly predict the decision of whether the target word requested human guidance based on the history of partial translation and human involvement.Weng et al. (2019) extended the interaction protocol to allow humans to correct words at arbitrary positions in the translation and to be able to learn from the modifications.
Compared with post editing, interactive natural language machine translation, which introduces human-machine interaction, significantly reduces human effort.Inspired by interactive natural language machine translation, we propose interactive program translation, also to reduce the human effort in correcting program translation results.Because of the stronger structure of program language, program translation proposed in this paper has higher requirements than natural language translation in terms of interaction mode and model design.

Human-machine interactive program translation
This section describes the overall construction process of the proposed human-machine interactive program translation method, i.e., HMPT.We divided this section into two main parts according to the objectives.
Firstly, to reduce the human effort, we observe that structured texts like programs are more sensitive to different prefixes than natural language translations.Therefore, we try to introduce prefix-based interactive protocols in natural language translation into the program translation task, i.e., PHM (Prefix-based Human-Machine interactive program translation).It divides the prediction of the program translation model into two stages (Peris et al. 2017).If the user corrects the prefix, then in the next round of decoding, the token before the prefix position will be fixed as the token verified by the user, i.e. restricted prediction; in the remaining positions, the token with the maximum probability calculated according to the neural translation model will be used as the output, i.e. free prediction.This approach can be very effective in reducing the human effort of software engineer.However, it introduces unbearable response time for users, which is almost impractical.
Next, to effectively reduce the response time, we improve the PHM method to significantly reduce response time while allowing for a slight increase in human effort.First, we redivide the two stages of the PHM method into three stages in terms of whether the software engineer get the same translation results.We note that the first stage repeats the sub-process of the previous translation, which is unnecessary.Therefore, to maximize the avoidance of duplicated computational overhead, we propose combining the attention cache technique of the current Transformer framework and skipping the first stage directly.It is important to note that this improvement does not affect human effort.Then, we suggest pessimistically assuming that the generated program fragment contains the bug.This means that software engineer do not need some program suffixes generated in this round.Therefore, we suggest aborting the current translation process at an appropriate time to minimize this expense.Its core is choosing the abort point, and we propose a two-step prediction method.After the abort point, the generated program fragment will be spliced with a certain suffix of the output code of the first round.Finally, we propose a general inference process that combines these two optimization methods, i.e., HMPT. Figure 2 shows the overall process of HMPT.Next, we first introduce PHM in Section 3.1 and then introduce HMPT in Section 3.2.For the convenience, all the notations used in this paper and their corresponding meanings are given in Table 1.

PHM
Almost all current research on program translation is based on the neural machine translation framework introduced by Castano and Casacuberta (1997), which has since been widely studied (Cho et al. 2014;Klein et al. 2017).
In the field of natural language translation research, interactive-predictive machine translation has superior properties compared to human post-editing, and it can better utilize the feedback information from users to guide the model through the subsequent translation process.Peris et al. (2017) Clearly, interactive-predictive machine translation can effectively combine feedback from the user with the The state information about the encoder-decoder attention sub-layers in the decoder

S s
The state information about the self-attention sub-layer in the decoder decoding phase of a translation model.However, to the best of our knowledge, there is no relevant research on interactive machine translation methods for program translation.
Considering the structured text like program code, a few character changes can have a huge impact on the subsequent program fragments, especially keyword changes, such as for to while, switch to if, etc.Therefore, we consider introducing prefixbased interactive protocols in natural language translation to the program translation task.In this protocol, the software engineer needs to check the target program y p I p 1 = y p1 , … , y pI p generated by the translation model by token and correct the first incorrect token y pi to ŷpi .This indirectly means that the software engineer accepts the original prefix y p i−1 1 and results in a valid prefix ŷp i 1 . The program translation model searches in a more restricted space with the feedback f = ŷp i 1 as a constraint to generate a better target program in the next round.The basic equation of this process is as follows: The first part of this equation means that given the source program x J 1 and the correct prefix y I 1 , the goal is to solve the target program with maximum probability y I 1 .
The second part means that the former is equivalent to the product of the probabilities of each time step.
Since almost all current program translation models are based on relevant variants of the Transformer architecture.Inspired by research related to the prefix-based interactive protocol on the Transformer model in natural language translation (Ott et al. 2019;Gupta et al. 2020), we consider introducing the method into the task of program translation.
Specifically, the program translation process of the next round is divided into two stages.When the time step does not exceed i, the probability distribution of output of the decoder on the vocabulary table is constrained.When the time step exceeds i, the decoder will freely predict the next token until the terminator is predicted.Since in prefix-based interactive protocols the token before the prefix position is already restricted to the user-verified token, the output probability of this part must satisfy: where is the Kronecker delta.A t is the output of the decoder of the Transformer model at time step t (i.e., sequence output), V o ∈ ℝ v×d is a weight matrix of linear layer, v is the size of vocabulary table, d is hidden layer dimension.In the neural machine translation model, a softmax output layer is used to obtain the distribution of all words at the current moment, i.e., the probability of each word in the target language word list is calculated using the Softmax function.Let the output vector of the Transformer at time step j to be A t , p y t | y t−1 1 , x J 1 can be redefined as: (1) is the output of the transformer-based program translation model, and ȳT t is the one-hot representation of the word y t .Under this approach, we simply need to provide the feedback f = ŷp i 1 to the model during the next iteration, and accordingly, the probability of each word at each time step t is: This formula means that when t is less than or equal to i, the generated token and the token in the target program are subjected to an operation based on the Kronecker operator.And when t is greater than i, the best token is obtained based on the maximum probability.

HMPT
This section will optimize the PHM method to reduce the response time.Firstly, Sections 3.2.1 and 3.2.2mainly describe the caching attention to maximize the avoidance of computation caused by duplicate prefix generation.Then, Sections 3.2.3 and 3.2.4mainly describe the reduction of invalid suffix generation by splicing suffix.Finally, Section 3.2.5 describes the reference process that combines these two.

Stage redivision
In the PHM method, when the time step is less than or equal to the user-edit position i, the probability distribution of the output is compulsorily changed.In fact, when the time step is strictly less than i (i.e., the editing position has not been reached), the purpose is to reproduce the state context information at the time step i of the previous round of translation, which will have an impact on the inference of subsequent timestep due to the attention mechanism.
We note that when t ∈ [0, x − 1] , the token corresponding to the maximum value of the output probability distribution generated is consistent with the user requirements.Although its probability distribution does not strictly conform to the first case of Eq. ( 4), the two probability distributions are equivalent from the point of view of the impact on the subsequent inference of the current round (Because the token entered into the decoder at the next time step is the same in both cases).And when t = i , the user makes a word correction.In this case, the token corresponding to the maximum value of the output probability distribution produced by the model does not meet the user's requirements, so the output probability distribution must be changed compulsorily.
Therefore, we propose to re-divide the Eq. ( 4) into two stages, which have exactly the same token output as the formula Eq. ( 4) (note that it is not the same probability distribution output.However, this is the exact same translation result for software engineer): This formula means that when t is equal to i, the generated token and the token in the target program are subjected to an operation based on the Kronecker operator.When t is less than i, the best token generated is the same as the result of the previous round and can be computed directly using A pt .And when t is less than i, the best token is obtained based on the maximum probability.A p t is the output of the decoder at time step t during the previous round of translation of the program translation model.We propose to skip directly the subprocess with time step t ∈ [0, i − 1] in order to avoid repeated calculations, in which case Eq. ( 1) is rewritten as: where f ′ should contain, in addition to the user-verified prefix ŷp i 1 , the relevant state information of the previous round of translation process at time step i that affects the subsequent generating, thus maximizing the avoidance of duplicate computations.

Attention cache
Since almost all of the program translation models proposed in the current study are based on the Transformer architecture (Vaswani et al. 2017), which consists mainly of an encoder as well as a decoder, where the decoder consists mainly of several same layers, each with a self-attention sub-layer as well as an encoder decoder attention sub-layer.Crucial in these sub-layers is the calculation of attention, the calculation formula is as follows (Vaswani et al. 2017): where Q , K , V represent three important matrices, respectively, Q is the query matrix, K is the key matrix and V is the value matrix, the values of which will have an impact on the subsequent decoding.Q and K are dot-multiplied to obtain a score, and then this score is normalized by dividing by 2. The normalized result is 27 Page 14 of 38 weighted by the softmax function, and the weight and V are dot-multiplied to obtain the final feature output.
In general, as shown in Fig. 3, to allow the model to learn in different spaces, the query matrix, the key matrix and the value matrix is sliced in the hidden layer dimension d, i.e., a multi-head attention mechanism, the formula is as follows (Vaswani et al. 2017): where j denotes the id of the head,W Q j ∈ ℝ d× (d∕h) , W K j ∈ ℝ d× (d∕h) , W V j ∈ ℝ d×(d∕h) , are parameter matrices used to linearly transform the Q matrix,K matrix, and V matrix to the dimension d/h specified by each head, and W o is the parameter matrix used to fuse the attention results computed by each head.
We note that to avoid repetitive operations on each time step during inference, the cache mechanism (Vaswani et al. 2018) is introduced implemented in most of the frameworks.This mechanism can be considered an acceleration technique during the inference of a certain round.
Inspired by this mechanism, we propose to directly use the cached state information to replace the first stage of Eq. ( 5) to maximize the avoidance of repeated computation with the previous round.As shown in the "Feedback" and the "Attention Cache Module" of Fig. 2, the position where the inference starts in the next round will directly skip to the position i edited by the user.Obviously, the time complexity of the next inference process will change from O( Î) to O( Î − i) , where the size of i will monotonically increase with the number of user interactions, so the response time of this process will become shorter and shorter with user interactions.( 8) Specifically, we propose dividing the state information into two parts depending on whether it relates to the editing position.The key matrices and the value matrices of all encoder-decoder attention sub-layers will remain unchanged after the first round of interaction.And in the decoder self-attention sub-layer, each position needs to calculate the attention of positions from the start until the current one and the key matrices and value matrices are related to the time step.Thus the next round of translation requires state information of the decoder self-attention sub-layers up to the position i (note that this does not include position i).
Therefore, f ′ of Eq. ( 6) should be defined as: This formula means that the new feedback Where S 0a is the state information about the encoder-decoder attention sub-layers in the decoder at the first round of interaction.K 0 lja ∈ ℝ J×(d∕h) represents the key matrix of the encoder-decoder attention sub-layer of the j'th head of the l'th layer after the first inference of the decoder, and so on for the rest.S ps i−1 1 is the state information about the self-attention sub-layer in the decoder of the most recent inference.K pljs i−1 1 ∈ ℝ (i−1)×(d∕h) represents the decoder's sliced submatrix of the key matrix of the self-attentive sublayer of the j'th head of the l'th layer after the most recent inference, and so on for the rest.Equations ( 8) and ( 9) represent the specific composition of these two types of feedback information, respectively, and Eqs. ( 10) and ( 11) indicate that the K and V matrices have linear splicing properties.

Early abort
We observe that in Eq. ( 5), if the generating length exceeds position i, the program translation model will autoregressively translate to the end of the output code.However, this is an optimistic assumption that the user will get the desired target program after the most recent edit and stop the interaction.In most cases, the number of interactions is more than one.This means that part of the inference process is not actually useful to the user, which wastes part of the decoding computation.Specifically, suppose the user corrects at position i p of the most recent result and the model ( 9) produce a new alternative translation y I 1 = y 1 , … , y I .Then the user checks and cor- rects again at position i, the actual valid part is just the program fragment ŷi i p = ŷi p , … , ŷi , and y I i+1 = y i+1 , … , y I is the invalid translation.Therefore, we propose to abort the inference process early at the appropriate time.To ensure that a complete target program is presented to the user, another program suffix should be spliced at the abort point.And this suffix can be intercepted from the output code of first round y 0 I 0 1 = y 01 , … , y 0I 0 .Therefore, the Eq. ( 6) should be rewritten as: where e denotes the position to be aborted (i.e., abort point), and e 0 denotes the posi- tion corresponding to the abort point in the output code of first round y 0 I 0 1 .This formula means that the new intermediate result can be a combination of 3 parts: the user-validated prefix ŷp i 1 , the freely decoded part and the suffix of the result of the first round of translation y 0 . For example, if the user verifies that the prefix is abc and the freely decoded part is def, and can splice in the suffix gh when decoding to position f, then the intermediate result of this round is abcdefgh.

Choice of abort point
The choice of the abort point is crucial for interactive program translation.An early abort point will lead to the output code that does not fully apply user feedback and thus increase the number of human-machine interactions.A late abort point will lead to too much invalid inference and thus increase the response time.
The optimal abort position should be the first spliceable position after the user's next modified position.We thus propose to divide this work into two stages.When the generated position is t (note that the time step at this position in this round should be t − i ), firstly, we judge whether the generated program fragment y t i already contains the potentially wrong token.Then we evaluate whether y t i can be spliced with certain suffix y 0 I 0 e 0 +1 = y 0e 0 +1 , … , y 0I 0 .If the two conditions are satisfied, the suffix splicing is performed at position t.
Based on the consideration of generation speed (which means we need to use a lightweight approach), we followed Ueffing and Ney (2007) in using confidence scores for assessing the goodness of translation quality.Specifically, we suggest using the token whose max output probability P is lower than a certain threshold in y t i to evaluate the quality.Its expression is: If C qe x J 1 , y t i = 0 , y t i will be determined to contain a potential error.And is a userdefined parameter, we will show the effect of different values in the experiment results.( 14) Pr y e i+1 | x J 1 , f � + y 0 As for the generated fragment y t i to be able to splice a certain suffix output code of the first round, there are mainly two conditions: 1.The token y 0e 0 is the same as the predicted token y t .2. Considering that similar sequence outputs are more likely to have similar subsequent decoder output, we suppose that the cosine similarity between the sequence output of e 0 and the sequence output at the current position is large enough.Algorithm 1 details the full procedure.
Algorithm 1 Judge whether the generated fragment can be spliced with certain translation suffix Require: A t : The sequence output of current position; y t : The predicted token of current position; At dictionary: The sequence output dictionary of all time steps in the first round; token list: The generated token list of all time steps in the first round; Ensure: Returns the suffix y 0 I0 e0+1 that can be spliced at the current position, and returns None if the condition is not met.where At_dictionary is a hash table created at the first round.It's key is token y 0t , and value is a list consisting of two-tuples {t, A 0t } to represent the location t i of the token at the output code of first round and the decoder output A 0i at that location, where t i is sorted from smallest to largest.

General inference process
The overall process of our proposed method is shown in Fig. 2. The software engineer checks the output code of the previous round and corrects it.The valid prefixes and two different categories of attention cache information obtained will be used as feedback.They are fed into the decoder of the program translation model.The attention cache information is used as the initial state of the heads in each decoder layer for the following inference.Specifically, the last token of the prefix will be used to generate the Q matrix, which together with the K and V matrices of the all self-attention sub-layers of the previous round after slicing will be used as the Q, V and K matrices in the self-attention layer.The K and 27 Page 18 of 38 V matrices of the encoder-decoder attention sub-layer of of the first round will be used as the K and V matrices in the encoder-decoder attention layer.At each time step of inference, it is checked whether the two conditions of suffix splicing are satisfied.If not, inference continues.Otherwise, the inference process is aborted early, and the result (which consists of the edited prefix by human, the generated tokens, and the output code of the first round) is output.The software engineer continues to check the output code.
When the generated program fragment y t i can splice a certain suffix, the spliced result y = y p 1 , … , y p i , … , y t , y 0e 0 +1 , … , y 0I 0 will return directly.We pro- pose to distinguish two cases whether the user's correction position for this result is within the decoding range to determine the effective attention cache.
In the next round of interaction, when the user's correction position i belongs to the decoding range, i.e., i ∈ [i p + 1, e p + 1] , the attention cache of the model includes the valid range, so the method of Eq. ( 6) is applicable.When the user corrects the result position i beyond the decoding range, i.e., i ∈ [e p + 2, I p ] , the effective attention cache range is only up to e p , and the remaining attention results in the range [e p + 1, i] need to be computed in the next round.Therefore, we propose to redefine the state information f ′ that feedback to the next infer- ence as: This formula means that we have further expanded on the feedback and the new feedback f ′ consists of ŷp .And the number of the total time steps required for the next round of inference process is e − min i − 1, e p , the probability of token on each position is: This formula means that if there exists the case where the time step t is less than or equal to i, then the generated token and the token in the target program are subjected to an operation based on the Kronecker operator.Otherwise, when the time step is greater than i and the abort point has not yet been reached, then the best token is obtained based on the maximum probability.
Algorithm 2 details the procedure of the HMPT method. (16) Algorithm 2 Inference process of the HMPT method Require: X J 1 : The source program input; ŷp if len(y) < i then 7: adjust the distribution of output probabilities so that ŷpt is the token corresponding to the maximum output probability

Interactive interface
To display the results of the program translation and the user settings, we designed an interactive interface that integrates the general inference process described above.The user has to set the direction of translation and the source program, and then click a button to start the first round of translation.The user then checks the result, selects the first incorrect token position and inputs the correct token, then clicks the button to translate again.This is repeated until the user is satisfied with the result.
The detailed operation demonstration is as follows: 1.As shown in Fig. 4a, the initialization interface of HMPT after starting, HMPT has completed the loading and deployment of translation models in all translation directions internally at this time.

Experiments and results
We conducted experiments and evaluated our proposed approach on two program translation datasets. 3The experiments were conducted on a server with 8 NVIDIA TITAN RTX GPUs, where the memory of each GPU is 24219 M.

Dataset
Currently, published parallel corpora for program translation tasks are quite scarce, and we evaluate our proposed approach on two datasets in two translation directions.The TransCoder test dataset (Lachaux et al. 2020) is created to test the performance of the TransCoder model, which consists mainly of a set of parallel functions in C++, Java, and Python extracted from the solutions of the GeeksforGeeks website.The AVATAR dataset (Ahmad et al. 2021) comprises solutions to programming problems written in Java and Python from open-source programming contest sites and online platforms.To ensure the same function form as the TransCoder test dataset, we mainly use a subset, AVATAR-g4g, consisting of Java-Python parallel functions collected from GeeksforGeeks.Since the TransCoder test set consists mainly of a single parallel function, in order to make the data distribution consistent when the model is trained, we require that the training set also be in the form of a function.″function form″ means that each program sample is a function with a separate function, similar to the form static int function(int A){return A*2;}.
The subset that has this form is the AVATAR-g4g dataset.This subset was selected by choosing those Java-Python parallel functions collected from GeeksforGeeks.
Table 2 shows the details of the two datasets.Since the TransCoder does not have a parallel corpus for training, we use the TransCoder test dataset in our experiments to evaluate.

Baselines
Our work has two objectives, i.e., human effort and response time.For the response time, since the response time problem exists only in interactive scenarios, we use PHM, as illustrated in Section 3.1, as our baseline.As for the human effort, previous research work has rarely proposed human-machine program translation methods, and we mainly set the following baselines: • Human post-editing (HPE) The most common form of collaboration.After the program translation model has translated the program, the user makes static corrections on that result until a satisfactory result is achieved.• Automatic downstream code editing (ADCE) Weisz et al. (2021) proposed a UI that performs relevant downstream editing when the user selects an alternative translation.If the user corrects the variable's name, the UI will automatically rename all instances of the variable.And similar feature exists in some IDE tools.This approach can reduce human duplication of efforts.

Experimental setup
In this subsection, we describe the settings of the program translation model and the user settings in the experiment.In the simulation, we assume that the target reference program corresponding to the source program in the parallel corpus is used as the final result the software developer wants to obtain.In the interactive program translation scenario, since it is a prefix-based interactive protocol, we compare the current translation result program with the target reference program by token and find out the first token that does not match.Then, we use this token of the target reference program as the edit from the software engineer, and then the translation model generates the next round of results according to our method.An exact match between the translation result and the target reference program will be considered to accept the entire target program by the user.For the HPE, we use the output code of the first round as the program that the user needs to edit.For the ADCE, the best way to simulate is on a programming IDE, however, this is difficult.Therefore we consider an approximation to measure.Since it is not convenient to distinguish the variables of a program directly, we consider distinguishing a superset of variables, i.e. identifiers.We used the syntax parsing tool tree-sitter5 to parse the program and get a list of the identifiers that need to be replaced in the output code of first round to the reference program.The difference between the number of identifiers in the original list and the number of identifiers in the de-duplicated list is approximately the user's reduced edit action.
• Real interaction environment The best evaluation method for our interactive program translation scenario is to involve realistic software engineer in the interactive process.We test the efficiency of collaboration in a real environment using the interactive interface tools described in Sect.3.2.5.In order to avoid the bias that a software engineer may have on the same coding jobs across different tools, we require each software engineer to set a modification rule that is consistent with their programming habits and follow it throughout test.We invited four real software engineers as participants.The four software engineers are computer professionals with at least four years of programming experience.They have software and hardware expertise in the computer field and understand the basic workings of computers.They are familiar with a variety of programming languages, including C, C++, Java and Python.In their work, the programming languages they most commonly use are Java and Python.As our dataset is made up of parallel functions collected from GeeksforGeeks, we also require these engineers to be good at courses such as data structures and algorithm design, so that they are not unfamiliar with the content of the dataset.In order to select the appropriate participants, we have designed a set of mock questions to examine them, which consists of Python and Java knowledge points, and those who achieve a pass mark will be accepted as real participants.After the selection, we informed them in advance about the cooperation rules and instructions for using the interactive interface.We randomly selected a representative batch of programs, and then organized software engineers to complete the interactive translation on the tool until the correct target program was obtained.Among them, the parameters and of the suffix splicing module of the interface tool are respectively set to 0.99 and 0.8.The parameters is used to determine the similarity threshold for successful splicing, and is used to determine if the decoded segment contains possible errors, the value of both can significantly affect the quality of the intermediate results, so these parameters are important.

Metrics
The main goal of our approach is to minimize human effort while mitigating the large response time.Therefore, we propose to divide the evaluated metrics into two main categories: response time and human effort.Metrics to assess the response time We directly set two timers before and after the translation process, and then we use the time difference between the two to approximate the response time.We test all interactive program translation methods sequentially on one sample and then on the next sample, thus ensuring that the environment tested in this sample has similar machine loads (e.g., Memory -Usage, volatile GPU-Util, etc.).The response time refers to the amount of time a user has to wait for the next round of translation after making a change.The response time is crucial to the user experience.We consider the current "the response time" to be invalid because, since programs usually are much longer than natural language text, this means that each round of re-translation requires a significant amount of time, and excessive response times would degrade the user experience.We have used two modules (Attention Cache module and Suffix Splicing module) to reduce the response time by reducing the computational overhead (i.e., reducing the time complexity of the problem).Therefore, we define "effectiveness" as a significant reduction in response time with little change in the quality of the intermediate results.
Metrics to assess the human effort In general, the user's number of actions from the output code of the first round to the final satisfactory target program is used to indicate the level of effort required by the user.The current method with stateof-the-art "human effort" is the human-machine cooperation method proposed by Weisz et al. (2021).Here we use the number of rounds of interaction, the number of modified tokens, the number of mouse clicks and the number of keystrokes to measure "human effort".
• Word stroke ratio (WSR) Tomás and Casacuberta (2006) It is mainly used in interactive scenarios.It is expressed as the number of words that the user needs to correct from the output code of the first round to the desired target result, divided by the number of reference words.Since there are more special symbols in the program code, we directly compare the required number of tokens in our experiments.
• Word error rate (WER) Och and Ney (2003) It is mainly used in HPE scenarios.
It is expressed as the token-level editing distance between the result of the first round and the reference, divided by the number of the reference tokens.
• Key stroke and mouse action ratio (KSMR) Barrachina et al. (2009) It is expressed as keystrokes and mouse clicks required from the output code of the first round to the desired target result.As for the number of mouse clicks, we follow the suggestion of Peris et al. (2017).Correcting a word and verifying a prefix requires one mouse action.Accepting the result requires one mouse action.
The number of keystrokes is calculated as the number of characters of the token needed to be corrected.KSMR is equal to the total number of keystrokes and mouse clicks divided by the number of characters of the reference.• Character error rate (CER) Civera et al. (2004) It is mainly used in HPE scenarios.It is calculated as the character-level editing distance between the output code of the first round and the reference, divided by the total number of characters of the reference.However, since the number of characters in the program is too large, calculating the character editing distance is difficult.In our work, the character-level editing distance between the two is approximated by the sum of the number of characters of the relevant token in the token-level editing path.

Metrics in software engineering
As program translation is essentially a problem in the field of software engineering, we need to consider evaluation metrics of software engineering.
• Weighted N-Gram match(BLEU weight ) Ren et al. (2020) This Metric is used to distinguish the difference in importance of different tokens.It gives a higher weight to some key tokens (like keywords) and a lower weight to literals (like string literal).• Syntactic AST match(Match ast ) Ren et al. (2020) This Metric is used to measure grammatic similarity.It uses the tree-sitter6 tool to parse out all subtrees, and then evaluates the syntactic similarity by comparing the subtree of the candidate program with the subtree of the target program.• Semantic data-flow match(Match df ) Ren et al. (2020) This Metric is used to measure semantic similarity.It uses the semantic graph of Guo et al. (2021) to represent the data flow features of the program and computes the semantic similarity between the candidate program and the reference program program.• CodeBLEU Ren et al. (2020) This Metric is used to measure comprehensive similarity.This metric is calculated by weighting the sum of BLEU (Papineni et al. 2002), BLEU weight , Match ast , Match df and we set the coefficients of all four to 0.25 in our experiments.

Results
This subsection describes the experimental results of HMPT and baselines.In order to evaluate the effect of different values of the threshold parameter of Eq. ( 15), we set the as {1, 0.8, 0.6, 0.4} .To ensure that A t and A 0t are sufficiently similar in Algorithm 1, the value of is set as 0.99.In addition, we evaluate the function of suffix-splicing method by comparing the performance of HMPT and HMPT that removes the method.The latter one is called HMPAC.Addtionally, the samples with length greater than the maximum value during the interaction will not be considered.

Comparison on ideal environment
We tested the real environment of Python-Java and Java-Python on the HMPTbased interactive tool.To better compare the improvement of HMPT, we also invited the same software engineers to complete the same code with HPE and ADCE.Table 3 shows the final results of these methods.
• HPE requires the most number of modified tokens and time spent.ADCE has only a small reduction because there are fewer scenarios involving variable modifications in program translation and most software engineers do not subconsciously use ADCE-related tools.• HMPT achieves the optimal number of modified tokens and spent time.Compared to HPE, the average number of tokens modified by HMPT is reduced by about 79.3% and the average time spent is reduced by about 30.5%.

Comparison on human effort
In order to evaluate the human effort required for different human-machine collaboration methods, we calculated the metrics related to human effort for these methods on two datasets separately.Tables 4, 5 show the results of the WSR and the KSMR of different human-machine collaboration methods (where HPE and ADCE are WER and CER), respectively.Table 6 shows the number of interactions required for different interactive program translation methods.From the results, We can observe that: • HPE requires the most human effort.ADCE has a slight reduction compared to HPE.This method is helpful if an identifier needs to be modified and the identifier appears more than once in the program.But this situation does not frequently occur in program translation scenarios.• The WSR of PHM is only about a quarter of that of HPE.This indicates that introducing a prefix-based interactive protocol of natural language translation into program translation has good performance.In addition, the ratio between  PHM and HPE rises to about 1 : 2 in terms of KSMR, mainly because we consider the use of the mouse in the interactive program translation.• HMPAC guarantees the same optimal effect as PHM.This means that our proposed optimization in prefix generation does not adversely affect the translation results.• We observe the variation of human effort for different parameter for HMPT.
In general, as the decreases, which implies stricter suffix splicing conditions, it will reduce human effort.HMPT-0.4method has almost achieved the same results as PHM.HMPT-0.8 has similar performance to PHM method, with an increase in the average number of interactions of only 0.3.HMPT-1, however, has a 1.1 increase in the average number of interactions.This means that it is very important to choose a suitable .
In summary, our proposed PHM has the best effect on human effort.And HMPAC has the same effect as PHM because caching attention information does not affect the translation quality.HMPT can also achieve a similar effect as PHM when choosing appropriate parameters .

Comparison on response time
To evaluate the response time of different interactive program translation methods, we calculated the average response time of these methods on GPUs on two datasets separately, as shown in Table 7 (note that changes in machine loads may cause changes in response time, and busy servers can cause response times to become larger.However, the ratios of response time for these interactive program translation methods are similar, and the following analysis is mainly based on the ratios).Our main analysis is as follows: • PHM has the greatest average response time.In our evaluation, its average response time is over 10 s.Huge time makes the user unbearable and leads to poor user experience, so this interactive method is almost impossible to use in reality.• HMPAC reduces the average response time by about 40% .It proves the effec- tiveness of our proposed attention cache method, since the reduction is precisely the overhead caused by the repeated generation of prefixes that users have already verified.Figure 5 shows the average response time of these two methods for the different number of interaction rounds.We observe that the response time of PHM increases as the number of interactions increases.This is because the rise in the number of interactions means that the source program is getting longer.At the same time, as the number of interactions increases, the difference in response time between the two methods becomes larger, and HMPAC even has a decreasing trend.This is because the length of invalid prefixes increases, while the length to be generated decreases.• As for HMPT, we observed that the average response time decreases as the parameter increases.It is reasonable because an increase in the means that the suffix splicing condition is more relaxed, and the inference process is aborted earlier at more possible locations.HMPT-1 has the least response time.It decreases by about 76.1% compared to PHM. Figure 6 shows the aver- age response time of these methods for different suffix lengths.We observed that with increasing suffix length, their differences were increasing.Overall, HMPT-0.8 performs close to HMPT-1, and HMPT-0.4 is slightly better than HMPAC.
In summary, HMPT, while significantly reducing response time, may increase the human effort, but this increase is often negligible with appropriate .For example, compared to HMPAC, HMPT-1 reduces the average response time by about 60% while increasing the average number of interactions by 1.1.HMPT-0.8reduces the average time by about 50% while increasing the average number of interactions by only 0.3.HMPT-0.4reduces the average response time by about 6.6% even with almost no increase in the number of interactions.These three cases also imply, respectively, that the program translation system's priority goals are response time, comprehensive consideration, and human effort.Fig. 6 The average response time (s) of HMPAC and HMPT with different value for different suffix lengths.Since the response time of the first round of translation is the same for all interactive program translation methods, we filter it out

Comparison on lightweight transformers
In order to assess whether a lightweight transformer would have superior experimental results, we conducted a comparison experiment on lightweight transformers.We use the CodeT5-small model with a smaller hidden dimension as the lightweight transformers model (which has only about a quarter of the parameter size of the normal CodeT5 model).For comparison purposes, we designed the lightweight-HMPT-0.8method, which uses the CodeT5-small model as the code translation model (i.e., the "machine"), while the rest of the settings are exactly the same as in HMPT-0.8.The results of the experiment are shown in Table 8.From the results, it can be found that lightweight-HMPT-0.8 has lower response latency than HMPT-0.8, as it requires less computational resources.In addition to this, we found that lightweight-HMPT-0.8worked better for Java-Python translation.We consider that this is mainly due to the fact that Java is more complete compared to Python (Java has a more complex syntax, needs to specify variable types, etc.), so the Java-Python translation is significantly easier for the Python-Java translation, and therefore it requires a smaller number of parameters for the model.Through experiments, both the lightweight model and the standard model have their own advantages.For readers who want to test more lightweight models, they can make the experiments following this way: Implement the initial translation with a lightweight model whose input consists of a complete sequence in the source language and output should be a complete sequence in the target language.Implement the result updating with a lightweight model whose input includes the user-validated

Comparison on similarity measures
In order to evaluate the impact of different similarity measures in the algorithm 1 to the experimental results, we design a comparison experiment based on other similarity measures.Specifically, we have chosen two classical measures, Pearson correlation coefficient (PCCs) and Euclidean distance (ED).PCCs-based method was set at the same threshold as the value of cosine similarity, both at 0.99.ED-based method was set at a threshold of 0.1.and the rest of the settings were identical to HMPT-0.8.The results of the experiments are shown in Table 13.We found that the results of the PCCs-based approach were almost identical to those of HMPT (i.e.cosine similarity-based method), implying that replacing the similarity measure in HMPT with the PCCs is also feasible.However, the ED-based method performs much worse, with a much higher response time than the cosine similarity-based method and PCCs-based method, as it cannot effectively determine whether the program suffix can be spliced.

Comparison on different program length
In order to assess the effect of different program lengths on the experimental results, we calculated the results of all metrics at different program lengths.Specifically, we tested the performance of HMPT and PHM on g4g-Python2Java for different length programs, and the results are similar on the rest of the dataset.
The experimental results are shown in Table 14.From the experimental results, it can be seen that HMPT guarantees almost exactly the same performance as PHM in terms of the metrics of evaluating the human effort, such as the number of interactions, WSR and KSMR.And in terms of response time, as the length of the program increases, the reduction magnitude of the response time by HMPT becomes greater.Therefore, the advantages of our proposed HMPT are more obvious with long programs.

Conclusion
In this work, we propose an interactive program translation method based on human-machine cooperation.We introduce the prefix-based interactive translation protocol in natural language translation to the program translation scenario and propose PHM.It can largely reduce the amount of human effort on the program translation task.However, due to the excessive length of the program, it also introduces a considerable response time, so we propose HMPT.First, to avoid duplicate prefix generation process, we propose the cache attention and pass it to the next round of inference process according to the location of the prefix.To avoid the generation of invalid suffixes, we propose suffix splicing to abort the invalid inference process in advance.Finally, we conducted extensive experiments on two real datasets to demonstrate the effectiveness of our approach in reducing human effort and response time.
In the future, we plan to extend our work to introduce a more diverse human-machine cooperative way for program translation, thus allowing for a more freely cooperative experience for software engineer.In addition, ideally, HMPT and HMPAC should have the same effect in terms of human effort.So we should find a more efficient way to predict the presence of potential errors in a given program segment, such as considering machine learning-based classifiers or adding additional network layers.directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.

Fig. 2
Fig.2The overall process of the proposed HMPT

27
Page 20 of 38 2. Before users cooperate, as shown in Fig.4b, they need to set the direction of the translation in the "Direction" text box (e.g., Java to Python) and then enter the source program to be translated in the "Source Code" text box.3.As shown in Fig.4c, the user clicks the "Translate" button and enters the interactive mode to collaborate with the machine to complete the translation: a.The machine first displays the translation result in the "Output Code" text box; b.The user then reviews the output program and selects the first incorrect token, then clicks the "Choose" button and the system automatically fills the token into the "Edit from" box; c.The user types the correct token in the "to" box.4. As shown in Fig.4d, click on the "Translate" button to start the second translation, and the "Output Code" text box will show the result, with the prefix shown in green to distinguish it.Repeat the process until the user is satisfied.5.If the user has other programs to be translated, click the "Next Code" button to start a new interaction, otherwise click the "Finish" button to end the translation.

•
Program translation model We use the source code published byAhmad et al.  (2021)  when they released the AVATAR dataset, and we mainly use the CodeT5 model(Wang et al. 2021) as our program translation model.This model follows the same architecture as T5(Colin et al. 2020), and its implementation is based on the Transformers framework 4 .We use the tokenizer published by the CodeT5 model, which mainly follows the BPE tokenizer trained by the research ofRadford et al. (2019).The size of the vocabulary table is 32,000.The CodeT5 pretraining model is fine-tuned on the AVATAR-g4g training dataset, with parameters following the settings in the source code published byAhmad et al. (2021).And we use Adam Optimizer(Kingma and Ba 2015) to update the model parameters with a learning rate set to 0.00005, a batch size of 2, a hidden layer dimension of 768, with a maximum of 20 training rounds.We finally select the model parameters with the lowest loss on the validation dataset during training.We keep the same configuration for the training process in Python to Java and Java to Python.And in the testing part, we set the beam size to 1 and the maximum number of program tokens to 510.• User simulation We follow the automated simulation scheme proposed by Barrachina et al. (2009); Peris and Casacuberta (2019); Peris et al. (2017); González-Rubio et al. (2013); Tomás and Casacuberta (2006) et al. in the natural language interactive translation task.

Fig. 5
Fig. 5 The average response time (s) of PHM and HMPAC for different number of interaction rounds

Table 1
Notations and their meanings j Number of heads in the transformer architecture l Number of layers in the transformer architecture e The position to be aborted (i.e., abort point) e 0 Position corresponding to the abort point in the output code of first round P Max output probability C qe Translation quality Kronecker delta Similarity threshold for encoder output Threshold for maximum output probability ȳT The one-hot representation of the word y A Output of the decoder of the Transformer model 1: if y t ∈ At dictionary.keys then

Table 2
Detailed information about the AVATAR-g4g dataset and the TransCoder test dataset.

Table 3
Test results in a real environment

Table 13
Results of other similarity measures on all task Java Java-Py Py-Java Java-Py Py-Java Java-Py Py-Java Java-Py

Table 14
Results for different program lengths of Py-Java on AVATAR-g4g