BCGen: a comment generation method for bytecode

Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecode is even harder to understand for programmers and researchers. Bytecode has been widely used in various software tasks such as malware detection and clone detection. In order to understand the meaning of the bytecode more quickly and accurately and further help programmers in more software activities, we propose a bytecode comment generation method (called BCGen) using neural language model. Specifically, to get the structured information of the bytecode, we first generate the control flow graph (CFG) of the bytecode, and serialize the CFG with bytecode semantic information. Then a transformer model combining gate recurrent unit is proposed to learn the features of bytecode to generate comments. We obtain the bytecode by building the Jar packages of the well-known open-source projects in the Maven repository and construct a bytecode dataset to train and evaluate our model. Experimental results show that the BLEU of BCGen can reach 0.26, which outperforms several baselines and proves the effectiveness and practicability of our method. It is concluded that it is possible to generate natural language comments directly from the bytecode. Meanwhile, it is important to take structured and semantic information into account in generating bytecode comments.


Introduction
As an intermediate code, bytecode has been widely used in various software tasks, such as malware detection (Zhao et al. 2021;McLaughlin et al. 2017), clone detection (Yu et al. 2017;Keivanloo et al. 2012), and so on. In these tasks, developers often need to analyze and understand the bytecode. Therefore, there is a need for an 1 3 5 Page 2 of 31 easy, fast, and accurate way to understand the bytecode. If we can generate natural language comments for bytecode, it will greatly save developers time to understand bytecode. Furthermore, generating comments for bytecode has practical implications. For example, (1) In APP sensitive-resource-access detection (Nguyen et al. 2021;Hăjmăsan et al. 2019), some existing tools detect the API from bytecode to determine its specific behavior (e.g., accessing the phone's contact list). However, if an attacker deliberately names an API to hide its true functionality, we cannot analyze the API's behavior by its name. Therefore, the comments of the bytecode of the API can help determine the real behavior of the API. (2) When programmers face with a task of decompilation for a large-scale system (Krogmann et al. 2010;Jackson and Waingold 2001), the comments of the bytecode can provide a bird'seye view of the system's functionalities, which can be used as important supporting information with decompilation tools.
At the source code level, researchers have carried out many studies on automatic comment generation methods (Alon et al. 2018;Haiduc et al. 2010b;Haque et al. 2020;Hu et al. 2018a;Moreno et al. 2013). At present, the most prevalent methods are based on deep learning (Sridhara et al. 2011). The input data used by these methods are the information extracted from the source code, such as API information (Hu et al. 2018b) and abstract syntax tree (AST) LeClair et al. (2019). However, at the bytecode level, it is very challenging to generate comments for it. Bytecode is a binary file that contains an executable program, consisting of a sequence of opcode/ data pairs, and is a kind of intermediate code (Dahm 1999). Compared with source code, bytecode is more abstract, and the code structural information is missing. In this paper, we propose BCGen, a Transformer based model (Vaswani et al. 2017), to generate comments for bytecode. Specifically, we first disassemble the bytecode to obtain the assembly instruction. Then, we mainly use the information in the Code area of the disassembled bytecode. The Code area mainly contains opcodes, which are similar to assembly language. To capture the code structural information, those methods generating comments at source code level convert the code into an AST (LeClair et al. 2019), and model the code structural information from the AST. Inspired by this, we convert the bytecode into a CFG and capture the bytecode structural information from the CFG. In addition, the token sequence in the Code area reflects the program semantic information of the bytecode. Thus, we input the token sequence and the CFG of bytecode into the customized deep learning model, and generate the comment through the trained model. In order to better capture the useful information, BCGen employs the encoder-decoder model to analyze the token sequence in the Code area and the CFG.
As far as we know, there is no ready-made dataset available for generating the bytecode comments, therefore we need to build the dataset. We choose Maven Repository 1 as our data source. The Maven repository stores a wide range of Jar packages, which contain most of the world's popular open-source project components, and we can freely download the Jar packages we need in the Maven repository. By decompressing the Jar packages, we can obtain the Java bytecode files stored in it and their corresponding Java source code files. We use heuristic rules to extract the bytecode in the Jar packages and the comments of its corresponding source code to build the dataset, and finally obtain a dataset of size 50k. We conduct several experiments on the dataset, and the results show that the BLEU score of BCGen reaches 0.26, which outperforms several based lines, such as Seq2Seq (Sutskever et al. 2014) and Hybrid-deepcom (Hu et al. 2020). We also further conducted user study. The evaluation results showed that BCGen's performance was also better than the baselines. To facilitate research and application, our replication package 2 and dataset 3 are released.
The main contributions of this study are as follows: • To our best knowledge, BCGen is the first method to generate code comments at the bytecode level, while existing methods generate comments at the source code level. • A dataset for generating bytecode comments is published. We collect Java Jar packages and use their bytecode to build the dataset. • We propose the CFG to capture the structure information of bytecode, which can improve the accuracy of the bytecode comment generation. • The comprehensive evaluation results demonstrate the feasibility and effectiveness of our bytecode comment generation method.
The rest of this paper is organized as follows. Section 2 describes the background of bytecode. Section 3 details the proposed method and BCGen model. The experimental setup is discussed in Sect. 4. Section 5 analyzes the experimental results. Section 6 is the user study. The result discussion is presented in Sect. 7. Section 8 surveys and summarizes related research work. Section 9 is Threats to validity. Finally, Sect. 10 concludes the paper and points out possible future directions.

Background
In this section, we mainly introduce the basics of bytecode, especially the structure of Java bytecode and what can we get from it. Bytecode is a form of instruction set designed for efficient execution by a software interpreter (Li and Fraser 2011). Unlike human-readable source code, bytecode is compact numeric code, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

Java bytecode
Java bytecode is compiled intermediate code that needs to be converted to machine code by the Java Virtual Machine (Lindholm et al. 2013). Java bytecode instructions contained in the Java class files enrich semantic information and direct the execution of the source code. The bytecode is usually executed directly on the virtual machine, which further compiles the bytecode into machine code to improve performance. To run a Java program, the source code must be converted into a CLASS file with a suffix of .class, that is, a binary bytecode file is interpreted and executed by the virtual machine after being compiled by the compiler. We can compile the source code into a CLASS file through the IDE tool or the command line, so that we can get the bytecode corresponding to the source code. Figure 1 shows an example of the source code of a Java method and its bytecode. The upper left of Fig. 1 is a Java source code. The Java source code here is a |getString()| method, whose main function is to return the given string. The upper right part is the hexadecimal form of bytecode generated after compilation. We can view this hexadecimal format corresponding to the bytecode with a hexadecimal viewer. Since this bytecode contains a lot of other information, we highlight (the blue highlighted area) the hexadecimal format corresponding to the Code area of the |getString()| method bytecode. The part outlined in the red box corresponds to the hexadecimal code of the opcode sequence in the Code area.
The bottom of Fig. 1 is the result of the disassembly of the bytecode. We can use the "|Javap|" (an anti-parsing tool that comes with |JDK|) command to disassemble the bytecode to get this part of the content. We can see that the core part of Fig. 1 Example of a Java source code, compiled bytecode and disassembled bytecode the bytecode corresponding to the Java method is the Code area, which contains the bytecode instruction, the line number table (LineNumberTable), and the local variable table (LocalVariableTable). The LineNumberTable is used to describe the correspondence between Java source line numbers and bytecode line numbers, and the LocalVariableTable describes the local variables of the Java method.
In bytecode instructions, one byte represents one instruction. For example, the bytecode in the hexadecimal form corresponding to the opcode "|aload_1|" is "0x2B". An instruction may or may not have parameters. If there are parameters, the following bytecode is its parameter. If there is no parameter, the following bytecode is the next instruction. As shown in Fig. 1, |invokevirtual| means: use the object pointed to by the data of the reference type at the top of the operand stack as the method receiver, and then call the instance constructor method, the private method or the method of its parent class of this object. The following "|#3|" is the constant pointed to by this instruction, "|//|" represents a specific form of this constant. It shows that bytecode instructions mainly contain information about program execution. Therefore, these instructions provide us with rich semantic information for understanding the bytecode.

Proposed approach
The generation process between source code and comments is similar to the translation process between different natural languages. The existing research (Hu et al. 2018a) has applied machine translation methods to translate the source code (method-level) to comment. In this paper, BCGen translates method-level bytecode to comment.
The overall framework of BCGen is shown in Fig. 2. BCGen mainly includes four stages: data collection, data preprocessing, model training and comment generation. The bytecode we extract from the Maven repository is parsed and preprocessed into a parallel corpus of Java methods and their corresponding comments. To learn the structural information, we convert the Java bytecode to CFG sequences through a special traversal method. To capture the semantic information, we extract the token sequences from the bytecode, and use them together with the CFG of the bytecode as the input to the model. We build and train a generative neural model by using the parallel corpus of token sequence, CFG sequence and comment. The main challenges in the training process are how to build a dataset and how to extract the important information in the bytecode to represent the main characteristics of the Java method. In the following paragraphs, we will present the detail steps to address these challenges.

Dataset collection
The first part of Fig. 2 shows the process of collecting the dataset. The original source of the dataset comes from the most popular projects in the Maven repository. We first manually download the jar package of the Java project on the Maven repository according to the popularity ranking (the usages of the jar packages). We finally collected 665 jar packages. Since the directly downloaded projects may not meet our needs, it is necessary to further process and analyze the downloaded projects. Our goal is to generate the corresponding comment for method-level bytecode. However, the bytecode itself does not contain comments, so we need to find the source code corresponding to the bytecode to extract the comments. Specifically, we use the method of regular expression matching to find all the methods in the Java class and get the comment located before the method header. However, many comments are too long to be used directly as our data. The total number of words in the comment is very large. Take the |getImage()| method as an example, its comment is shown  Fig. 3. This example shows that the comment content includes the description of method functions, parameters, return values, and the detailed introduction of methods. The number of words in the comment is very large. If this comment is used directly, it is not conducive to model learning. The first sentences of Java methods comment usually describe the functionalities of a Java method according to Javadoc guidance. 4 Following the way of Hu et al. (2020), we also take the first sentence of the comment as the target comment.
In order to reduce the vocabulary size, reduce the dimension of the language model, and obtain more accurate text analysis and expression, we do further processing on the extracted comments. Specifically, we perform stem extraction and part-of-speech restoration on the comment text, and the specific processing is as follows.
• Use Python's nltk 5 library to perform stem extraction and part-of-speech restoration on the comment text. • The punctuation of the sentence is replaced by the <SEG> token.
After extracting the comment, we can locate the corresponding bytecode fragment through the method name. Finally, there are 498 jar packages meet our data requirements, i.e., these projects contain both bytecode and the corresponding source files (with comments).
In addition, we also performed template deduplication on the collected dataset. Assuming that <> represents any character, then the sentences "get the type of error" and "get the type of event" can be represented by the template "get the type of <> ". If the dataset contains too many such similar sentences, it is difficult for us to judge whether the high-scoring (e.g., BLEU) model has really learned the knowledge or just correctly guessed the repeated words in the template. Therefore, We need to perform template deduplication. Specifically, we use the AEL tool (Jiang et al. 2008), which is one of the state-of-the-art log parsing approaches for template detection. Its template detection accuracy can reach 90%. The AEL tool is used to perform template detection on the bytecode-comment pairs (A bytecode-comment pair is called an instance) we obtained. For instances in all templates, we keep only two of them that contain more than 80% of the same token. This ensures that for the entire dataset, at most only two instances are very similar. Since template detection does not change the token of the original dataset and the model structure, its performance affects the effect of deduplication according to the template, which ultimately affects the size of the deduplicated data set. At last, we collect 55,130 bytecodecomment pairs. 5 Page 8 of 31

Key information extraction from bytecode
Take the |peek()| method as an example, which is shown in Fig. 4. Its comment is "return the first value of this queue, null if empty". It can be found that the bytecode is composed of an opcode sequence. Compared with the source code, it is difficult to understand the information described by bytecode only.
Bytecode files are unreadable binary files. In order to generate code comments, we need to extract key information from them. By compiling Java source code, we can get the Java bytecode. As described in Sect. 2, the core part of the bytecode corresponding to the Java method is the Code area, which contains the bytecode instruction, the line number table (LineNumberTable, in Fig. 1), and the local variable table (LocalVariableTable, in Fig. 1). The bytecode instructions consist of operators and operands, which contain rich semantic information about the program. Therefore, we mainly extract the semantic information and the CFG in the Code area to express the characteristics of the bytecode.
For the semantic information extraction, we mainly focus on the tokens in the Code area. After being decompiled, the Code area may contain a large number of tokens. Using them directly will make the vocabulary too large, and many unknown words may appear in the prediction phase of the model, which has an impact on the prediction accuracy of the model. However, most of these tokens meet the camel naming rules or connect by ".", so we split the tokens by camel word segmentation and "." to reduce the vocabulary, so as to alleviate the problem of Out-of-vocabulary.
We deal with the tokens as follows: Fig. 4 The CFG of the peek() method • Delete unnecessary punctuation marks that have no impact on code operation semantics, such as ",", ";"; • For the specific form of this constant, we segment the token into natural semantic words based on the naming rule of camel case segmentation; • Single character token such as "a", "b" and "c", are replaced with "simplevar". • If the token is a pure number, it is represented by the |<NUM>| token. • Finally, add start and end identifiers, i.e., |<BEG>| and |<END>|, at the beginning and the end of the sequence.

Control flow graph construction
Compared with source code, the gap between bytecode and comment is larger, so the translation between bytecode and comment is challenging. An easy way to model bytecode is to treat it as plain text. However, this ignores structural information and may reduce the accuracy of the generated comments. In order to learn the structural information of the bytecode, we introduce the CFG into the model.
A CFG is a graphical representation of the control flow during the execution of a program. It can represent the possible flow of execution of all basic blocks during code execution, and it can also reflect the real-time execution of a piece of code. It contains code structural information, which is helpful for the model to understand bytecode and generate comments. However, we cannot simply feed the CFG into BCGen, which is a sequence processing model. At present, there are related methods to serialize the CFG and capture the features of the CFG, such as the studies (Ding et al. 2014;Phu et al. 2019;Ramu et al. 2020). Ding et al. (2014) constructed an executable tree by removing the reverse edges of the CFG, and then connect the paths from all root nodes to leaf nodes of the executable tree as a sequence representation of the graph. Finally, n-grams are used to extract sequence features. Due to the intersection of the paths from the root node to the leaf node, it may lead to a large number of token repetitions, and the final sequence representation is too long. The study Phu et al. (2019) is an improved version of Ding et al. (2014) that reduces the time complexity of feature extraction. If we directly apply these methods to serialize the CFG, the generated sequence will be too redundant and long, and the model training effect is poor and memory-consuming. Ramu et al. (2020) proposed a customized control flow analysis method (BCFA) for the efficiency of implementing source code analysis using large-scale control flow graphs (CFGs). However, this method needs to perform static analysis on the CFG before selecting the traversal strategy, and construct a decision tree to select the optimal traversal strategy. On the one hand, for our dataset with a small amount of data, the time saved by traversal optimization may not be worth the time cost of strategy selection. On the other hand, 5 Page 10 of 31 different traversal strategies will lead to inconsistent representation of the final generated CFG sequences, which may have an impact on the performance of the model. Therefore, a simple traversal method is used to construct the sequences. Different from the above methods, We propose to traverse the CFG using a preorder traversal method and convert it into a sequence. The details are presented in Algorithm 1. The input of Algorithm 1 is the root node of the CFG and a list used to mark the visited nodes. This list can prevent the program from infinite loops due to the ring structure of the graph. The algorithm is implemented recursively, traversing each CFG node in order and recording it in the defined string, and finally, we can get the converted string sequence.
Algorithm 1: Preorder traversal for CFG Input: n, v n is the node of CFG v is used to mark whether the node has been traversed Output: seq seq is the token sequence of the CFG after traversal 1: def TraversalCFG(n, v) 2: seq = ∅ //Initialize seq to an empty string 3: v.append(n.num) //num is the node number 4: if !n.hasChild then 5: seq += n.label / /label is the node content 6: else 7: seq += n.label 8: for child in n.Children do 9: if child.num not in v then , We generate a control flow graph in the form of a DOT file through the bytecode. The DOT file is a text file that describes the constituent elements of a diagram and the relationships between them. In addition, the DOT file can be visualized with the |GraphVize| tool. 6 The principle of generating the CFG from the bytecode file is to decompile the bytecode into a three-address code, and identify some basic blocks. Blocks are connected to form a control flow graph, which can be described by the DOT language. In our work, we use the soot tool (Lam et al. 2011) to generate DOT files from bytecode files. Soot is a software engineering tool for analyzing and optimizing Java programs. It is often used for static or dynamic analysis of code or logs. The input of the soot tool is a bytecode file. Using its CFGViewer method, the bytecode can be parsed into a three-address code and identify basic blocks without additional parameter settings, and the final generated CFG is saved in a DOT file. In the process of dataset collection, using Soot has an 80% success rate to generate CFG. We use Soot to convert the CLASS file into DOT file and the data contained in the DOT file is the CFG. Then we serialize the CFG in the DOT file to get the text that can be input to the model. Figure 4 shows the CFG corresponding to the |peek()| method.
Finally, we deal with the sequence obtained by the CFG in the way of preorder traversal as follows: • Segment the identifier into natural semantic words based on the naming rule of camel case segmentation; • Remove some useless symbols, such as "@", "#" and "$"; • Single character variables are replaced with "simplevar"; • If the token is a pure number, it is represented by the |<NUM>| token. • Finally, add start and end identifiers |<BEG>| and |<END>| at the beginning and end of the sequence. The final CFG sequence of the |peek()| method is shown below: <BEG> r0 = this r3 = akka dispatch abstract bound node queue r0 r1 = r3 peek node if r1 == null goto label1 r2 = r1 value goto label2 label return r2 if r1 == null goto label1 label r2 = null <END>

BCGen model
In this part, we will introduce the details of BCGen. Figure 5 illustrates the model framework of BCGen. It can be observed that BCGen is mainly composed of three parts, two encoders (called |Token Encoder| and |CFG Encoder| respectively) and one decoder. Since semantic and structural information are different modalities of bytecode, we need to feed the two sequences (token sequence of bytecode and CFG sequence) to the model separately. BCGen uses two encoders to encode token and CFG sequences, respectively, so that one encoder of BCGen can learn the semantic information of the bytecode from the token sequence, and the other encoder can learn the structural information from the CFG sequence. The ability of the encoder and decoder to extract features from input sequences determines the performance of the model. Common sequence feature extraction models include RNN, LSTM, GRU, Transformer Wang et al. 2019) and so on. One way to improve accuracy is to strengthen the neural architecture. In the above models, the Transformer can better handle long-term dependencies, while GRU has fewer model parameters. In the NLP community, Transformer has achieved the most advanced results (Devlin et al. 2018;Dong et al. 2019;Radford et al. 2019), outperforming RNN, for a variety of NLP tasks. Therefore we adopt the Transformer-based encoder-decoder structure as the model backbone. However, the model requires two encoders to encode the input sequence, and our early attempts showed that using the Transformers for both encoders would result in a model with too many parameters and poor accuracy. Therefore, we replace one of the encoders with GRU to reduce model parameters and guarantee model accuracy. Next, we will introduce the details of these three parts separately.

Token encoder
|Token Encoder| is a transformer-based encoder whose inputs are the token sequence of bytecode. For a token sequence X b = x b 1 , ..., x b n , before inputting it into the encoder, it is necessary to obtain the representation vector e i of each token x i of this input sequence. Because BCGen does not adopt RNN structure, the order information of tokens cannot be used. Therefore, when constructing the representation vector of the tokens, it is necessary to consider using the positional encoding of the token sequence to save the absolute and relative position information of each token in the sequence. Finally, e i is computed as in which WE and PE are the embedding layer of the token and the embedding layer of the position of the token respectively. Then, the obtained token representation vector matrix is passed into the encoder. To be precise, the |Token Encoder| is a stack of Attention blocks (|AttnBlk|), and each |AttnBlk| contains a Multi-Head Attention layer (|MulAttn|) and a Feed Forward layer (|FeedForward|). The output of both layers will pass through the Add layer and the Batch Normalization layer (|Norm|), where the structure of the ResNet (He et al. 2016) is used to prevent the gradient from disappearing, as shown in Fig. 5. The output m i of the |MulAttn| is computed as and the output of one |AttnBlk| is computed as Through the MulAttn, FeedForward, Add and Norm described above, an Encoder block can be constructed. After stacking 12 |AttnBlk|, the encoding information matrix C of all tokens in the sequence can be obtained, which contains the opcode information and other potential features. Inspired by transfer learning, we replace the |Token Encoder| with the BERT (Devlin et al. 2018) encoder, which is a pretrained model.

CFG encoder
The second encoder is a GRU (Chung et al. 2014), which is an effective variant of the LSTM (Hochreiter and Schmidhuber 1997) network. Compared with LSTM, under the premise of greatly simplifying the network structure, it can achieve similar results and also solve the long dependence problem in the RNN network. The |GRU Encoder| is used to learn the characteristics of the input CFG sequences X c = x c 1 , ...x c m . At each time step i, it reads one term x c i of the CFG sequence, and then updates and records the current hidden state g i , namely, where f is a GRU unit that maps a term x c i of the CFG into a hidden state g i . GRU can learn structural information from the CFG and finally encode this information into vector g.

Decoder
The |Decoder| is used to compute the probability of word y i of the target comment conditioned on representation vectors c, g and their previously generated words y 1 , ..., y i−1 of the comment. It is worth mentioning that the composition of the |Decoder| is very similar to that of the |Token encoder|, which is also composed of stacking Attention blocks (|AttnBlk|), except that each Attention block of the |Decoder| contains two attention layers. The first Multi-Head Attention layer of the |AttnBlk| adopts the masked operation (|MaskMulAttn|), which can prevent the i token from knowing the information after the i+1 token, i.e., The second Multi-Head Attention layer (MulAttn) is computed as After obtaining the output vector md 2 i of |Mulattn|, we sum the output semantic vectors of the |Token Encoder| and the |CFG Encoder| to fuse the bytecode semantic and structural features extracted by these two encoders, and feeds these two vectors together into the forward propagation layer. So, the output of the |Decoder| is The last part of the |Decoder| block is to use Softmax to predict the output words of the comment. Softmax predicts the next comment word according to each row of the output matrix of the linear layer, and finally generates the predicted comment sequence y, i.e., where w and b are the learnable parameters of the linear layer.
Finally, the goal of this model is to minimize cross-entropy, i.e. to minimize the following objective function: By optimizing the objective function using optimization algorithms such as gradient descent, BCGen can learn and tune its own parameters.

Baselines
We used the basic Seq2Seq model (Sutskever et al. 2014), the attention-based Seq2Seq model (Luong et al. 2015) and the Hybrid-deepcom model (Hu et al. 2020) as the baselines. The Seq2Seq model is often used in machine translation, text summarization and other tasks. In general, the Seq2Seq model can achieve better results after introducing attention mechanism. Hybrid-deepcom is one of the state-of-theart models for code comment generation. In essence, it is also a Seq2Seq model, which uses GRU as encoder and decoder, and contains two encoders to extract the syntax and semantic information of the source code respectively.

Parameter settings
The hardware platform of our experiment adopts NVIDIA RTX 3090, Intel i9 10980XE CPU @ 3GHz with 256GB memory. The training parameters are set as follows: the training batch size is 32, the learning rate is 0.001, the maximum number of iterations is 10. We use twelve Transformer blocks with eight heads in each block, and the embedding dimension is set to 768. The length of the input sequence is an important parameter of the model. In order to set the input sequence length reasonably, we counted the length distribution of token sequence, CFG and the comment of each method, as shown in Fig. 6. By calculation, when the maximum length of token and CFG sequences is set to 384, and the maximum length of the comment is set to 64, more than 90% of the methods can be covered. Therefore, we finally set the token and CFG sequences' maximum length to 384, and the comment maximum length to 64.

Evaluation criteria
In this paper, we use BLEU (Papineni et al. 2002), METEOR (Banerjee and Lavie 2005), ROUGE-L (Lin 2004) to measure the effect of comments generated by the model.

Fig. 6 Length distribution of Bytecode, CFG and Comment
BLEU is an indicator commonly used in the NLP field to measure the similarity of two texts. We use BLEU to calculate the similarity between the comments generated by the model and the reference comments, so as to measure the comments generation effect of the model. The BLEU score ranges from 1 to 100 as a percentage value. The higher the score, the higher the similarity between the two comments. It is calculated as follows: where p n is the ratio of subsequences of length n in the candidate sequence that are also in the reference sequence. N is the maximum number of grams. And BP is brevity penalty: where l c is the length of the candidate translation and l s is the length of the effective reference sequence. To better illustrate the results, we evaluate the generated comments by different smoothing methods. Chen and Cherry (Chen and Cherry 2014) introduced 7 smoothing techniques that work better for sentence-level evaluation. In this study, we use smoothing functions 3.
METEOR calculates the similarity scores between a pair of the generated sequence and reference sequence by: Where Pen is the penalty term, which is calculated according to the number of chunks (ch) and the number of matches (m): and F mean is the harmonic mean of precision(P) and recall(R).
The , and above are three penalty parameters whose default values are 0.9, 3.0 and 0.5, respectively. ROUGE-L computes F-score based on the length of the longest common subsequence (LCS). Suppose the lengths of the generated comment X and reference comment Y are m and n, it is computed as: The parameter = P lcs ∕R lcs and F lcs is the value of ROUGE-L. One advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence-level word order as n-grams.

Research questions
In our experiments, we mainly focus on the following research questions: • RQ1: How effective is the proposed approach when comparing with the state-ofthe-art baselines? • RQ2: Is the CFG helpful to improve the performance of the code comments generation? • RQ3: Does the bytecode length (i.e., token) affect the accuracy of BCGen?
We ask RQ1 to evaluate the BCGen model compared with other deep learning models. We ask RQ2 in order to evaluate the impacts of the CFG on the accuracy of the comment generation. For RQ3, we want to examine the effect of the bytecode length on the quality of comments generated by BCGen.

Results
In this section, we evaluate the performance of different approaches to generate comments for bytecode. The ratio of the training set, test set, and validation set is 8:1:1. We measure the gap between the automatically generated comments and the human-written comments by BLEU, METEOR and ROUGE-L scores.

RQ1: How effective is the proposed approach when compared with the stateof-the-art baselines?
Since the original input of these baseline models is based on source code, in order to measure the performance of these models on the bytecode comment generation, (1 + 2 )P lcs ⋅ R lcs R lcs + 2 ⋅ P lcs we replace the input of these models with bytecode. In particular, since the Seq2Seq model has only one encoder, we take the concatenation result of the token sequence and the CFG sequence as its input, while the Hybrid-deepcom model has two encoders, which is able to input the token sequence and the CFG sequence at the same time. Table 1 illustrates the scores of different approaches to generate comments for Java methods at the bytecode level. Comparing the baselines and our model, results show that BCGen obviously outperforms all the baselines in all three metrics. The basic Seq2Seq models have low accuracy and cannot effectively learn the semantic and structural information of bytecode. Compared with the Hybrid-deepcom and the attention-based Seq2Seq, the BLEU-4 score of BCGen is improved by 8.42% (from 17.16 to 25.62%) and 7.36% (from 18.26 to 25.62%) respectively. METEOR and ROUGE-L scores show similar results. This may be due to the ability of BCGen to process long-term dependence. As the length of bytecode or CFG is much longer than the natural language, the self-attention mechanism adopted by Transformer is conductive to help BCGen catch long distance meaningful words, and then learn some characters related to the bytecode operation.

RQ2: Is the CFG helpful to improve the performance of the bytecode comments generation?
In this RQ, we compare the performance of the competing approaches by using different inputs, i.e., token and CFG sequences alone, or both of them. The results are shown in Table 2, and the parentheses indicate the input of the model. After adding CFG to the model, the BLEU-4 of attention-based Seq2Seq improved by 2.85% (from 14.31 to 17.16%), while Hybird-deepcom improved by 2.77% (from 15.49 to 18.26%). Compared with our model, the BLEU of the BCGen is greatly improved after adding CFG to the input (from 21.30 to 25.62%, improved by 4.32%). METEOR and ROUGE-L scores have also been improved by different degrees. In terms of the degree of improvement, BCGen can learn more useful features from the CFG compared to the baselines. This further demonstrates the superiority of BCGen in generating comments for bytecode. It can be observed that no matter which model is used, the performance of comment generation has been improved after adding CFG information.
To explore whether this improvement is entirely due to CFG or a combination of token and CFG sequences, we conducted further experiments. We only take the CFG as the model input, and then observe the performance of the model. It can be seen from Table 2 that except for the basic Seq2Seq model, the result of simply using CFG as the model input is better than the result of simply using the token as the model input, but inferior to the result of using both as the model input. This shows that the improvement of model performance depends on the fusion of token and CFG, that is, the model can learn useful information from token and CFG sequences, respectively.

RQ3: Does the bytecode length (i.e., the number of tokens) affect the accuracy of BCGen?
To examine the effect of bytecode length on the quality of comments generated by BCGen, we divided the test set into subsets based on the number of tokens per bytecode method. Specifically, we divide the test set into 4 subsets by calculating the quartiles of the bytecode length of the test set, and calculate the BLEU-4 score corresponding to each subset. For comparison purposes, we also plot the performance of the baseline models on each subset, and the results are shown in Fig. 7. We observe that about 75% of the test cases have bytecode less than 144 tokens in length, and only a few test cases are particularly long. What's more, regardless of the subset, the quality of comments generated by BCGen is higher than that by the baseline models. Furthermore, we can observe that as the bytecode length increases, the accuracy of our model shows an overall upward trend until the bytecode length becomes too long. However, the accuracy of the baseline models generally decreases with the increase of bytecode length. This fully demonstrates that our model has a better ability to deal with longer bytecode sequences, that is, BCGen can better remember the long-term dependencies of bytecode sequences. Overall, the length of the bytecode does have a certain impact on the model accuracy, especially for very long sequences, the model accuracy will drop significantly.

User study
In order to verify the effectiveness of the proposed model, we conduct a user study to measure the generated comments of BCGen. Because the bytecode is difficult to understand, we cannot require people to judge whether the generated comments reflect the meaning of the bytecode. Instead, we examine whether the generated comments reflect the meaning of the source code corresponding to the bytecode. Since the basic Seq2Seq model does not work well, and in RQ1 we also confirmed that the attention-based Seq2Seq model has better performance than the basic Seq2Seq model, here we only focus on the attention-based Seq2Seq model and Hybrid-deepcom in the user study.
The measure depends on two perspectives of generated comments: naturalness (grammaticality and fluency of the generated comments) and informativeness (the amount of content carried over from the input code to the generated comments, ignoring the fluency of the text). Specifically, we sent out invitations for manual evaluations to the programmers who graduated from our research group, and eventually 15 programmers (not co-authors) accepted the invitation. These programmers (with an average of 4.5 years programming) are all computer science majors with a bachelor's degree or above, and some are working programmers from well-known Internet companies like Alibaba 7 and Tencent. 8 They need to rate the generated comments in terms of grammaticality, fluency (i.e. naturalness) and whether block comments can accurately reflect the functionality of the source code (i.e. informativeness). Both ratings are on a scale between 0 and 4. To facilitate evaluation, we used the Questionnaire Star website platform 9 to make a questionnaire 10 for participants to evaluate the generated comments. Specifically, we first randomly sample 40 bytecodes from the testset as input to three models, and each model will generate comments for these 40 bytecodes individually. Then we search and obtain the source codes corresponding to these bytecodes and put them in the questionnaire for participants to understand the meaning of the bytecode. The information that the participants can see in the questionnaire is the source code corresponding to the bytecode, and the comments generated by each anonymous model for each bytecode. Figure 8 illustrates an example question in our questionnaire.
The results are presented in Table 3, which demonstrates that BCGen has a higher score compared with the baselines in both naturalness (BCGen over Hybrid-deepcom 15.1% on average) and informativeness (BCGen over Hybrid-deepcom 15.2% on average). We also perform a statistical analysis of the scores of each model. Specifically, we use the Mann Whitney U test to verify the statistical significance. After calculation, the improvement of BCGen relative to other models is statistically significant. In terms of naturalness, the p-value between BCGen and Hybrid deepcom is 0.011, while that between BCGen and attention-based Seq2Seq is 0.00002. In terms of informativeness, the p-value between BCGen and Hybrid-deepcom is 0.009 while that between BCGen and attention-based Seq2Seq is 0.000017. All the p-values are less than 0.05, hence the performances between BCGen and other models are significant.

Discussion
In this section, we mainly discuss whether the comments generated by BCGen really reflects the knowledge of bytecode.
In order to more intuitively show the effect of BCGen for generating comments for bytecode, we perform interval statistics on the BLEU-4 score of the test set. Specifically, we divide the BLEU score into three intervals, which are (0.8, 1] , (0.2, 0.8] and [0, 0.2], and correspond to the high, medium and low score intervals respectively. In addition, we additionally count the number of samples with a BLEU score of 1, that is, the generated comments completely match the reference comments. Table 4 shows the statistical results of BCGen and the baselines, and Table 5 shows some of the test samples we selected. Due to space constraints, only relatively short bytecode segments are selected here. Note that the human-written comments (that is, reference comments) are not the original comments, but processed through stemming and part of speech recovery. Since the bytecode is difficult to understand, we attach the source code here for reference. This shows that compared with the source code, it is more difficult to understand bytecode. From Table 4, for BCGen, we can find that most of the generated comment BLEU scores are concentrated in the medium range, accounting for 64.81%. However, for baseline models, most of the BLEU scores fall in the low range (78.0% for basic Seq2Seq, 73.27% for attention-based Seq2Seq and 71.44% for Hybrid-deepcom). This also shows that BCGen outperforms the baseline models in the quality of comment generation. For samples with BLEU scores falling in the middle range, this part of the auto-generated comments differs from the reference comments. They have similar meanings despite having only a small fraction of matching words, so we believe they can correctly describe the meaning of the bytecode. Case 1 in Table 5 is a test sample of BCGen that falls in the middle scores range. Although the generated comment does not have a high BLEU score, it describes the meaning of the bytecode in more detail.
Compared with the number in the medium and low score range, the number in the high score range is less, accounting for only 12.05%. However, 8.8% perfectly match the reference comments. This shows that most of the generated comments can perfectly match the reference comments in the high partition. Case 2 in Table 5 is a test sample of BCGen that falls in the high score range. We carefully observed the corresponding methods of these high-scoring cases, and most of their functions are  It is worth mentioning that compared with the first two score ranges mentioned above, we pay more attention to the test samples falling between low partitions. Cases 3 and 4 in Table 5 are the test examples for which the BLEU score of BCGen falls in the low partition. These generated comments of BCGen with poor quality can be roughly divided into two categories. One is basically irrelevant comments, which are irrelevant or have more repeated words. In addition, there are notes related to the content and meaning, but the words described are very different from the reference sentences. For case 3, BCGen generates an incorrect comment, while for case 4, the comment generated by BCGen means acceptable. For the baselines, comments with low BLEU scores tend to contain excessively repeated or irrelevant words.

Related work
BCGen employs the deep learning model to generate natural language comments of the bytecode, which are mainly related to two types of work. One is code comment generation, and the other is bytecode analysis.

Code comment generation
Different from the current code comment generation, our work directly generates comments for the bytecode, that is, directly uses the bytecode as the model input, and outputs the corresponding natural language comments. Since there is little relevant bytecode comments generation research, we mainly discuss the approaches of comment generation at the source code level.
The general method of code comment generation is shown in Fig. 9. First, the code is downloaded from the open-source website and processed to form a dataset, and then the comments are generated by the methods such as heuristic templates (Haiduc et al. 2010a;Rodeghero et al. 2015;McBurney and McMillan Fig. 9 Basic framework of the code comment generation task 5 Page 26 of 31 2015), information retrieval (Wong et al. 2013;Rodeghero et al. 2017) and deep learning (Alon et al. 2018;Hu et al. 2018a).
Traditional automatic code comment generation methods (Wong et al. 2013;Haiduc et al. 2010b;Moreno et al. 2013;Rodeghero et al. 2017) are mainly based on heuristic templates and information retrieval techniques. They directly extract useful information from source code to generate comments. Most of these methods have limited generalization ability. In recent years, deep learning techniques have developed rapidly, and many studies have been devoted to generating code comments using deep learning methods (Alon et al. 2018;Haque et al. 2020;Hu et al. 2018a;Iyer et al. 2016). What's more, deep learning methods have also promoted the development of traditional retrieval-based comment generation methods Zhang et al. 2020;Wei et al. 2020). Powerful deep learning models can extract more abstract information from the source code, and further improve comment generation accuracy. Since we also employ the deep learning model to generate bytecode comments, we will discuss the approaches based on deep learning models. Iyer et al. (2016) proposed an RNN network with attention to generate comments describing |C#| code snippets and |SQL| queries. The idea is to treat the source code as plain text, and then model the conditional distribution of comments through an RNN network. This work is the first one to use an encoder-decoder framework for code comment generation.
Many researchers believe that using source code information alone is not enough to generate high-quality comments, so they use ASTs that contain structural and semantic information to generate code comments. Hu et al. (2018a) found a problem that AST is a tree and neural machine translation model relying on sequence input. So they proposed to use a traversal method called structure-based traversal (SBT) to convert an abstract syntax tree (AST) into a traversal sequence, and then used a Seq2Seq attention model to convert the code-abstracted AST sequence into a digest. Different from Hu et al. (2018a), Shido et al. (2019) directly proposed a Tree-based LSTM model, which starts from the leaf node and encodes from the bottom up. Furthermore, Alon et al. (2018) randomly selected leaf node pairs on the abstract syntax tree to generate multiple paths, and then encoded each path using the RNN model to obtain the information of these paths.
The above methods are all based on one kind of information to generate comments. In recent years, many models fuse multiple kinds of information as input. Hu et al. (2020) proposed a Hybrid-deepcom model. This model has two encoders, and utilizes both the lexical information of the source code and the grammatical information of the AST. LeClair et al. (2019) proposed a neural model that combines the AST source code structure and words in the code to generate coherent comments of Java methods. The specific approach is to design a model architecture to process word sequences and SBT/AST sequences in different recurrent networks with attention mechanisms. Haque et al. (2020) argued that using only the internal information of the code fragment would limit the performance of the model. Therefore, they proposed a method to use file context information to help generate code comments. What's more, Hu et al. (2018b) combined the API document information to generate code comments on the basis of the source code.
In addition to the above methods, there are many other methods of code comment generation. Allamanis et al. proposed to model source code using graph-based neural networks (Allamanis et al. 2017) as well as convolutional networks with attention (Allamanis et al. 2016). Wang et al. (Wang et al. 2020) presented a code comment generation approach using the hierarchical attention network by incorporating multiple code features.

Bytecode analysis
Currently, bytecode has been used for many software tasks, such as binary code search (Yang et al. 2021;Xue et al. 2018;Tian et al. 2021), malware detection (Daoudi et al. 2021;Rozi et al. 2020;Zhang et al. 2021), vulnerability detection (Guo et al. 2019;Tian et al. 2020), code clone detection (Yu et al. 2017;Keivanloo et al. 2012), code similarity detection Liu (2021). Yang et al. (2021) proposed the Codee model. In terms of processing binary code, they proposed an optimization function of basic block embedding generation in binary codes as a network representation learning problem to capture the semantic information of basic blocks and the structural information of CFG. BinDeep proposed by et al. Tian et al. (2021) obtained the instruction sequence of each function in the binary code by decomposing the instructions, and then used the instruction embedding model to vectorize the extracted instructions. Rozi et al. (2020) proposed a method to detect malicious JavaScript using deep neural networks to extract bytecode sequence information. It used a compiler to generate a sequence of bytecode that corresponds to an abstract form of machine code. Guo et al. (2019) constructed a software vulnerability detection system based on deep learning and bytecode, in which bytecode slices were converted into numeric vectors as the input of neural networks. Yu et al. (2017) proposed a novel code clone detection method based on Java bytecode. This method can simultaneously detect code clones at both the method level and block level. In order to improve accuracy, the similarities of both methods are call sequences and instruction sequences during the process of code clone detection. A common process of the above work is to convert the bytecode into vector sequences to extract features, and then apply them to different software tasks, while none of these work focuses on the comment generation at bytecode level.

Threats to validity
External validity One threat to external validity is that we only conduct experiments on the Java bytecode dataset. The bytecodes of different programming languages may differ. Therefore, the effectiveness of our approach on bytecode of other programming languages remains unknown. In the future, further investigation by analyzing even more bytecodes written by other programming languages is needed to mitigate this threat.
Internal validity The internal validity refers to the collection and processing of comments. We collect comments for the Java bytecode by extracting the first sentence of the comment before the header of the source code method corresponding to the bytecode. Despite our further denoising work, there may still be some poor-quality comments in the dataset. What's more, stemming and part of speech processing of comments to reduce vocabulary size and model dimensionality might lead to grammatical errors and might negatively impact developers' understanding of comments. In the future, we will investigate better techniques to build the dataset.
Construct validity In this paper, we have only discussed using CFG sequence and token sequence of bytecode to represent bytecode. It is not known whether other information can be used to represent the bytecode and how they affect the result. This will also serve as our future work to further address this threat.
Conclusion validity The evaluation metrics may have a greater impact on the results of dataset-based evaluation, because different metrics have different emphasis on measurement. In this paper, we use three metrics to measure the model performance. More evaluation metrics should be used in the future to mitigate this threat. For the results of human-based evaluation, subjectivity has an impact. Different evaluators affected by subjective factors have different understandings of bytecode and comments. More evaluators should be invited to mitigate this threat in the future.

Conclusion
Generating comments for bytecode will benefit many software activities, while it is a challenging task. For the first time, we propose BCGen to generate comments at bytecode level in this paper. BCGen is based on a transformer-based model, which takes bytecode token information and CFG information as input. CFG is generated from bytecode and converted to sequence through preorder traversal. The method achieves good results on machine translation metrics when comparing with several baselines, proving the effectiveness and feasibility of BCGen. In the future, we plan to extract more effective information from bytecode and further to improve the accuracy of comment generation at the bytecode level.