A methodology for refined evaluation of neural code completion approaches

Code completion has become an indispensable feature of modern Integrated Development Environments. In recent years, many approaches have been proposed to tackle this task. However, it is hard to compare between the models without explicitly re-evaluating them due to the differences of used benchmarks (e.g. datasets and evaluation metrics). Besides, almost all of these works report the accuracy of the code completion models as aggregated metrics averaged over all types of code tokens. Such evaluations make it difficult to assess the potential improvements for particularly relevant types of tokens (i.e. method or variable names), and blur the differences between the performance of the methods. In this paper, we propose a methodology called Code Token Type Taxonomy (CT3) to address the issue of using aggregated metrics. We identify multiple dimensions relevant for code prediction (e.g. syntax type, context, length), partition the tokens into meaningful types along each dimension, and compute individual accuracies by type. We illustrate the utility of this methodology by comparing the code completion accuracy of a Transformer-based model in two variants: with closed, and with open vocabulary. Our results show that the refined evaluation provides a more detailed view of the differences and indicates where further work is needed. We also survey the state-of-the-art of Machine Learning-based code completion models to illustrate that there is a demand for a set of standardized benchmarks for code completion approaches. Furthermore, we find that the open vocabulary model is significantly more accurate for relevant code token types such as usage of (defined) variables and literals.


Introduction
Code completion is a widely used feature of modern IDEs, where the most likely next token is offered based on the code already present up to the cursor position (Kim et al. 2021). This feature not only helps developers to save typing effort, but also assists them in learning new libraries, as it offers information about available functions or attributes. Machine learning (ML) approaches for code completion are leading the field, and in particular the Transformer models excel here by outperforming the recurrent neural networks (RNNs). Multiple state-of-the-art solutions are using Transformers with variations of the code representation and/or the attention mechanisms (Kim et al. 2021;Liu et al. 2020;Svyatkovskiy et al. 2020).
However, most of the proposed code prediction approaches use aggregated metrics (i.e. averaged over all types of code tokens) to evaluate their accuracy (e.g. Li et al. 2018;Wang and Li 2021;Chirkova and Troshin 2021). This eliminates valuable information about the improvements for relevant code token types, and, as a consequence, makes it more difficult to compare approaches or identify weaknesses of a method. A typical case of this elimination of information is the prediction results for variables and function/method calls contributing to the overall score together with less demanding but predictable tokens like keywords and standard libraries. Figure 2 presents an example of detailed evaluation gained by using refined accuracy, in comparison to the traditional method of using aggregated accuracy in Fig. 1. We used a code snippet from the Python150k 1 dataset and omitted long strings for simplicity. The prediction results are obtained from experiments of Schumacher et al. (2020). The incorrect predictions are highlighted in these figures by double-dashed-underlines.
We also use various colors as an additional discrimination. Statistics of correct and incorrect predictions are also presented in Figs. 1b and 2d. Since we only use a small code snippet as an example to demonstrate the advantage of the refined accuracy, the following interpretations are conjectures. The valid evaluation for the used completion model should be conducted on a proper dataset.
While using aggregated accuracy, there are no hints for developers about where to improve the accuracy or to inspect the drawbacks of the used method. Meanwhile, using refined accuracy gives a more detailed view at the completion results for multiple dimensions (i.e. aspects of code token properties).
Firstly, considering the code tokens based on their purpose, we identified five token types for the code snippet as displayed in Fig. 2a: (i) attribute tokens in bold, (ii) variable usages in italics, (iii) method calls in bold and italics, (iv) argument definitions beginning with a , and similarly (v) method definitions with a . The statistics in Fig. 2d show that nine out of 20 predicted tokens are variable usages and seven of those are suggested correctly. This can be considered as an advantage of the used completion model, since predicting usages of identifiers in general is significant for developers, Example for evaluating a code snippet using refined accuracy according to Hellendoorn et al. (2019). However, the prediction of the other four token types still needs to be improved, especially for method call with no true completions. The next dimension analyzed in the example is token length. We consider code tokens with less than 11 characters as short or medium tokens and format them in bold and italics in Fig. 2(b). The rest of the predicted code tokens are long tokens and marked as italics. The analysis results in Fig. 2d reveal that more than 80% of the correct predictions belong to short/medium tokens, which slightly help reducing the typing effort of developers in the example.
Eventually, we inspected the frequency of code tokens based on their number of repetitions in the example. The highly frequent tokens in Fig. 2c (i.e. self) are highlighted in bold. Figure 2d shows that nearly half of the correct predictions are high frequent tokens, which is one of the easy cases of code completion.
The above analysis brings valuable information, which cannot be obtained while using aggregated accuracy. This leads to the idea of focusing on method calls and longfrequent tokens for improving the completion model, since predictions of other code token types already have acceptable accuracies. This example illustrates that aggregated metrics can make evaluation and development difficult by omitting important information.
To our knowledge, only few previous works consider token categories and evaluate the metrics (e.g. mean reciprocal rank, error rate) per category (Kim et al. 2021;Bielik et al. 2016). However, the token subdivision (e.g. attribute access, numeric constant, string, variable/module name, and expression) in these works is rather crude, since they only focus on some cases of the token purposes and disregard other dimensions (e.g. length and frequency). Besides, it is difficult to apply their taxonomy to other works without re-implementation (see Sect. 2.2). We aim to provide more refined types of code tokens which can be used for evaluating existing and future code completion approaches with minimum effort. Our contributions are as follows: • We review the state-of-the-art of ML-based code completion approaches (mostly from 2018 to 2021), not only to substantiate the usage of aggregated metrics but also to give an overview of the current research progress in this field. Despite of using the same modeling methods (e.g. Transformers) for the code completion task, it is still hard to compare between models without explicitly re-evaluating them, due to the differences on input representations (e.g. sequences of tokens or Abstract Syntax Trees), the used datasets, evaluation metrics (e.g. accuracy, MRR), and the scope of prediction (e.g. next n code token or block of code). • To obtain a refined evaluation of code completion accuracy, we introduce a methodology called Code Token Type Taxonomy (CT3) by proposing multiple dimensions for identifying code token types. For each dimension, we propose the types by analyzing the Abstract Syntax Tree (AST) and the relationships between tokens in the AST. CT3 can be used for a comprehensive comparison between approaches, to gain a detailed view of the impact of each component in a prediction model, and to identify model challenges. • We demonstrate the utility of this methodology by conducting an empirical study on the Python150k 2 dataset of a Transformer-based code completion approach. We compare the impact of using closed vocabulary vs. open vocabulary (Karampatsis et al. 2020), and find significantly better accuracy of the latter for relevant token types. • To facilitate reproducibility and reuse of our methodology, we publish the Python150k dataset with pre-computed token types according to CT3. 3 The scripts of our experiments and CT3 source code 4 are also available. The rest of the paper is organized as follows. Section 2 outlines background and related work, while Sect. 3 describes our approach. The experimental evaluation is discussed in Sect. 4. We summarize the challenges of CT3 and threats to validity in Sect. 5. Our additional statistics and results of tuning experiments are presented in Sect. 6. Finally we conclude in Sect. 7.

Background and related work
In this section, we present the state-of-the-art of code completion, followed by an overview of the usage of aggregated metrics for evaluating code completion models in previous research. We end the section with a brief introduction to the Out-of-Vocabulary (OOV) issue and its current possible solutions. Table 1 summarizes the notable works in code completion mostly from 2018 to 2021. We divide the prediction level (the fourth column of the table) into: (i) token, the possible next token, (ii) tokens, the next n tokens up to the end of the code line, (iii) code line, the entire code line, (iv) construct, specific code constructs, e.g. an if condition, and (v) block, code blocks, e.g. a for loop or an entire function. The programming languages used in datasets are documented in the Dataset column (except for the widespread datasets Python150k and JavaScript150k 5 ). We refer to the original papers for more details. The column Eval. ter. & non-ter. shows how authors handled the results on terminal and non-terminal nodes in ASTs, which is only applicable when the representation of input data is AST-related forms.
The table is sorted by year and grouped into three parts: recent methods (from RNN to Codex), popular tools (from Kite to GitHub Copilot), and two possible solutions for the OOV issue (the last two rows in the table). Due to the space limit, citations for methods and tools are omitted from the table and presented in the following subsections. Besides, other information (e.g. evaluation results, handling type and value of nodes in ASTs) is included in an online version of Table 1. 6
Recent works such as IntelliCode Compose (Svyatkovskiy et al. 2020), Feeding Trees to Transformers (Kim et al. 2021;Liu et al. 2020) use Transformer models (Vaswani et al. 2017), which outperform RNNs. One of the reasons is that Transformers better capture the long-term dependencies. There are also several empirical studies on the capabilities of ML-based code completion models, which use Transformers and its variants for experiments, including Transformers for Source Code (Chirkova and Troshin 2021), BERT for Code Completion (Ciniselli et al. 2021b) and Transformers for Code Completion (Ciniselli et al. 2021a).
Generative Pre-trained Transformer 2 and 3 (GPT-2 and GPT-3) are also wellknown methods in this domain. Kite 7 and TabNine, 8 which are based on GPT-2, are noteworthy tools supporting code completion for multiple programming languages. The authors of Chen et al. (2021)  There are also other works that focus on other modeling methods rather than Transformers. The authors Wang and Li (2021) proposed a model named CCAG based on graph neural network (GNN) to fully capture the sequential and repetitive patterns of code, together with the structural information on ASTs. The evaluation results show that CCAG outperforms state-of-the-art approaches, including Transformers.
In general, ML-based models for code completion have recently attracted a lot of attention. However, it is still hard to fairly compare between the models, since many approaches use proprietary evaluation benchmarks. Two models are comparable if they at least use the same evaluation metrics and datasets. For instance, considering the models using the accuracy metric and the Python150k dataset, we narrow down the list of models in Table 1 to Pointer Mixture Network, Extended Network, and CCAG. These models have the same prediction level, a similar form of input data and the same way of separating results for terminal and non-terminal nodes. Still, Extended Network and CCAG can not be compared to each other since they use different experimental setups and various sizes of test datasets. The two remaining comparisons are presented explicitly in papers of the authors by re-evaluating all models. Therefore, there should be a set of standardized benchmarks for code completion models to make the models comparable with a reasonable effort. Table 1 shows that almost all prediction results use aggregated metrics to evaluate the accuracy (i.e. averaging over all code token types). Exceptions are Kim et al. (2021) and Bielik et al. (2016) providing a rough analysis, which is indicated by value categories in column Eval. Metrics of Table 1. However, their evaluations only focus on some general values of code token purposes.

Aggregated metrics and refined metrics for evaluating
The authors Hellendoorn et al. (2019) reveal that there are large differences which code completions are relevant when considering the point of view of developers versus the distributions in synthetic datasets of completions. Their results and insights inspired us to determine criteria for a refined evaluation in code completion. Table 2 presents the criteria together with their corresponding motivations.
For instance, instead of punctuation-like tokens, identifier token type represents code completions most relevant for developers (Hellendoorn et al. 2019). This finding motivated us to analyze syntax type information of code tokens in prediction results. Moreover, based on our experimental observation (see Sect. 4.3), it is hard to predict the identifier names in the declaration due to the arbitrariness of identifiers, but only a small portion of code tokens in the dataset is categorized as this syntax type. In other words, distinguishing between definition and usage of identifiers is essential. Therefore, values of the dimension syntax type should cover not only token purposes but also support dividing between definition and usage of identifiers. Similarly, Hellendoorn et al. (2019) also highlight the importance of local context, typing effort, origin and rare completions (i.e. long and not too frequent tokens). These are motivations for our criteria of the dimensions context, length, origin, and frequency respectively, as expressed in Table 2. Analyzing completion results with aggregated metrics apparently dissatisfies these criteria.
To clarify that none of the prior works provide evaluations with a sufficient level of details, Table 3 presents a qualitative comparison between the methods of Kim et al. (2021), Bielik et al. (2016), and our methodology based on the defined criteria. We consider one more criterion on the implementation of these approaches (last column of Table 3). The refined evaluation should be implemented in a way that token types data can be combined with a reasonable effort with completion logs obtained from prediction models.
The authors Bielik et al. (2016) only considered some specific values in the dimension syntax type (e.g. identifier, property, and number). They lacked a refined categorization for identifiers (e.g. function, variable, and class) as well as the separation between definition and usage of identifiers. Besides, the authors implemented an individual predictor for each token type, which makes it hard to apply their evaluation approach to other prediction models.
Meanwhile, the authors Kim et al. (2021) provided subcategories for identifiers (i.e. attribute access, variable/module name, and function parameter name). However, the subdivision is rather crude and still leaves out several important types, e.g. function/method definition, class definition, and imported library. The published source

CT3 (ours)
−The criterion is not supported +The methodology considers some values in the dimension but does not meet all criteria of the dimension The criterion is accomplished fundamentally a With a reasonable effort code from their paper 10 shows that the authors might group attribute, class usage, and method usage into the attribute category. Similarly, variable/module name might contain variable, class usage, and function usage. Grouping different code token purposes into one group might omit crucial information because of the diverse effect of various code token types on the efficiency of completion models (see Sect. 4.3). In addition, the authors also examined some values in the dimension context but still ignored some essential contexts, e.g. in-class-definition or in-function-definition. Other dimensions (i.e. origin, length, and frequency) are not supported in their methodology. Ultimately, the authors provided scripts to identify token IDs for each of their token types, which can be applied to other completion models. However, their short Python scripts comprise 119 LOC and consider only two categories, i.e. dimension syntax type with four values and dimension context with six values. In addition, these values are derived directly from value or type of a target node in an AST, regardless of whether the node is a terminal or non-terminal node.
In this paper, we propose a methodology to effectively evaluate code completion models and assure all the mentioned criteria. Our prototypical implementation for Python covers 18 values for the dimension syntax type, 16 values for the dimension context, four values for the origin dimension, and dimensions length and frequency both contain three distinct values. The syntax type dimension comprises significant token types as well as a subdivision of definition and usage of identifiers.
We construct our implementation in an extendable way, in terms of number of dimensions, number of values of each dimension, and independence of prediction models. Furthermore, values between dimensions can be combined to create an even more complicated code token type (e.g. determining usages of imported libraries requires both dimensions syntax type and origin, see Sect. 5.1). Besides, we provide a sophisticated analysis for each value of dimensions, by considering not only the target node but also its neighbors and ancestors along the path to the root node in ASTs. We separate terminal and non-terminal nodes, since we only evaluate code tokens that developers will receive from a completion model (i.e. values of terminal nodes). Moreover, we publish pre-computed data of token types, which can be utilized with completion logs created from various prediction models considering ASTs as input data.
The refined evaluation results obtained from our proposed methodology might facilitate comparison and improvement of prediction models for relevant cases. A detailed explanation with examples for the proposed approach is presented in Sect. 3.1.

Out-of-vocabulary issue
An important aspect of the prediction approaches is code representation. While some works use as input a sequence of AST nodes linearized by a tree traversal (Bielik et al. 2016;Li et al. 2018;Schumacher et al. 2020), more recent approaches attempt to capture the high-level structural representation (Alon et al. 2020;Kim et al. 2021). The authors Chirkova and Troshin (2021) indicate that only syntactic information is needed to make meaningful predictions.
Another crucial factor of code representation is capturing the code identifiers. Even with coding conventions, identifiers defined by software developers can at times be overly diverse and complex. As a result, prediction models might be trained with an exceedingly large and sparse vocabulary. Furthermore, if the size of the vocabulary is limited, rare words in the training set have a low possibility to contribute to the vocabulary and therefore are hard to be predicted. This is also known as the out-ofvocabulary (OOV) issue.
With the original method, the vocabulary of a prediction model is constructed on a fixed set of tokens, commonly based on the frequency of tokens in the dataset, which creates the OOV issue. Karampatsis et al. (2020)  The authors Chirkova and Troshin (2020) introduced another approach to deal with the OOV issue by anonymizing all the OOV tokens with special placeholders. However, since these approaches use different types of input representation and datasets, the comparison of their efficiency needs further investigation and is out of scope of this paper.

Methodology for a refined evaluation
We divide the content of this section into two parts: (i) a general workflow of implementing and utilizing CT3, and (ii) comprehensive explanation for each value and dimensions of the CT3 schema used for Python. Figure 3 illustrates the implementation and usage of CT3. Given a programming language, we identify the code token properties relevant for effective code completions. This gives rise to multiple dimensions, i.e. criteria for partitioning, and the token types within each dimension, i.e. a complete subdivision of all tokens into types. Table 4 shows such a schema for Python.

A general workflow
In the next step, a static code analyzer is implemented. Given a dataset (e.g. Python150k), it assigns to each code token a type for each dimension. Table 5 displays an example of CT3 information for code tokens in the Python150k dataset. The values of the dimensions Syntax Type, Origin, Length, and Frequency are illustrated by their indices in the lists of possible values (list index starts with 1, see Table 4). A special value −1 is used to indicate that the code token is a non-terminal node or its token type is not under our consideration. The values of the Context dimension are displayed as a list of tuples (context_index, quantity), which is explained later on. The value −1 is also used to denote the context value of non-terminal nodes. However, if a terminal node does not belong to any context we are focusing on (see Table 4), then its context value will be [0]. Besides, we also recorded the indices of nodes in an AST and the AST indices in a dataset (for easily tracking back to original code files). For instance, the code token ClassDef is a non-terminal node so its token types are −1 for all dimensions. Meanwhile, the token tabs is a terminal node and a base class in a class definition (i.e. class NetworkProfileTab(tabs)). Its token types are: 4/class_usg for Syntax Type , 2/from_extlib for Origin, 2/medium for Length, and 1/high_frequent for Frequency. Ultimately, the value [(4, 1), (11, 1)] of the Context dimension illustrates that tabs is in the context of in_class_def once and in_parameter once. More details of the CT3-data and the storage formats are presented in our published dataset. 11 Continuing the workflow in Fig. 3, after applying the analyzer, we combine the results with logs of code completions provided by a prediction model (see the middle part of Fig. 3). The result is a log file which contains each code token's CT3-data and information on the correctness of the completion. Table 6 presents an example of this log file. The prediction results are obtained from Transformer-based models with closed vocabulary (i.e. first and second columns) and with open vocabulary (i.e. third and fourth columns).
For example, the token tabs is predicted correctly in both closed and open vocabulary cases while only the first part of the django.utils.translation token can be predicted by the open vocabulary variant. Two special symbols <UNKNOWN> and Table 5 An example of CT3 data for code tokens in the Python150k dataset

Table 6
An example of a combination log of CT3-data and prediction results for code tokens in the Python50k dataset   Table 6 also shows that CT3-data can be combined with any completion logs, as long as they can specify the order of evaluated code tokens in the dataset. 12 This makes our methodology more flexible than other related works, as mentioned in Sect. 2.2.
In the final step of the workflow, the log is partitioned according to the types per dimension, and the desired evaluation metric (e.g. MRR, accuracy) is computed for each type (per dimension). Examples for this step are discussed in Sect. 4.3. Our prototypical implementation for Python analyses the abstract syntax tree (AST) and the relationships between its nodes. It is worth to recall that we consider only terminal (leaf) nodes in the AST.

CT3 schema for python
In the following paragraphs, we discuss the CT3 schema used for Python (see Table 4). A detailed explanation with examples for each value of the Syntax Type and Context dimensions are presented in the Appendix A.1.
Syntax Type refers to the syntactic category of a token in the source code. The values of this dimension (i.e. first column in Table 4) can be generalized for various programming languages and offer information regarding the code token's purpose. The majority of these values describe identifiers since predicting them is the most relevant function of code completion in practice (Hellendoorn et al. 2019).
Syntax type values are derived based on the syntactic information on the source code and the patterns in ASTs. The difficulty of identifying these token types varies greatly. Simpler types mostly depend on conditions of identifying AST node types. Complex types are aggregations of conditions which identify feature-specific AST patterns. Currently we define 18 values for this dimension.
The arg_def token type refers to code tokens which are arguments in definitions of functions or methods. These code tokens can be regular arguments, keyword arguments, or *args and **kwargs in Python. Code tokens which are attributes of packages or classes have attribute as their syntax type.
Definitions and usages of classes are denoted as class_def and class_usg token types, respectively; analogously for functions, methods, and variables (i.e. func_def and func_usg, method_def and method_usg, var_def and var_usg, respectively). While the token types func_usg and method_usg indicate function and method calls, the type class_usg expresses usages of defined classes, e.g. a class instantiation or a class being used as an inheritance parameter. The var_usg comprises usages of defined variables, declared arguments, imported libraries, and function keywords in function/method calls.
In the code snippet in Listing 1, the tokens get_info and std_id in line 1 are identified as func_def and arg_def types. While the variables student (line 2), s_name (line 3), and s_class (line 4) are classified as the var_def token type, the std_id (line 2) and student (line 3) variables are grouped into the var_usg type. The tokens Student and StudentClass are labeled as the class_usg type. Ultimately, the attribute token type is used to represent the remaining tokens, i.e. profile and name in line 3 as well as info and name in line 4.  Listing 1: Example for the dimension Syntax Type The syntax type dimension also contains numeric and literal constants (const_num and const_str token types) and keywords defined for the target programming language (keyword type). Code tokens referring to exceptions defined in try/catch blocks are assigned to the exception token type, which covers all built-in exceptions. Lastly, token types imp_lib, imp_sublib, and imp_alias are used to identify imported libraries, imported sublibraries, and their aliases, respectively. The 18th value of the dimension (i.e. unknown) indicates tokens which can not be categorized in other values. Most of these tokens are empty lists of parameters. These lists are still presented in ASTs but don't affect the code completion accuracy.
Context describes the surrounding code structures (e.g. loop body, condition expression) in which the token is found. The context values (i.e. second column in Table 4) aim to reflect the local context, which plays a large role in code completions (Hellendoorn et al. 2019). We propose 16 values corresponding to different levels of code structure.
The values in_arithmetic_op, in_assign, in_bool_op, and in_comparison indicate code tokens found in arithmetic, assignment, boolean, and comparison operations, respectively. Code tokens located in class or function/method definitions are represented by the context values in_class_def or in_func_def. The parameters used in these definitions, as well as in function/method calls, are assigned to the in_parameter value.
Code tokens being used in if or else statements are labeled with the in_if or in_else values; analogously for the values in_try, in_except, and in_raise. The context values in_for and in_while refer to the tokens occurring in loop structures (i.e. for and while). Finally, in_return value expresses tokens in return statements and the value in_with indicates code tokens in with statements.
Since the code structures represented by these context values can be nested in some cases, we record in how many contexts of a given value each code token is included. Listing 2 illustrates the token var_example (line 7) which is in the context of in_class_def, in_func_def, in_if, in_for (twice), in_try, in_parameter, and in_arithmetic_op. We provide comprehensive examples for each context value in the Appendix A.1.

Listing 2: Example for the dimension Context
Origin labels (i.e. third column in Table 4) indicate the location where an identifier or a keyword is defined. A code token's origin can provide valuable information about its purpose and importance. While built-in code tokens (e.g. keywords or methods available without importing) are frequently used and have a rather general purpose, code tokens originating from within the same file serve a specific purpose which is closely related to the task the code solves.
The from_builtin label represents built-in code tokens, which do not require an explicit import such as keywords (e.g. True/False). Tokens categorized as from_extlib originate from external (non-standard) libraries or packages. Identifiers defined in the same file have from_infile as their origin label. Ultimately, the from_stdlib label refers to identifiers defined in standard libraries.
The above labels are obtained by analyzing the import commands in each AST. Built-in code tokens are those within a predefined python_keyword set. Tokens from within the file are determined by exclusion, as these tokens are neither built-in nor from the standard library or an external library. Code tokens appearing as attributes of a particular library are categorized according to the origin information on the library. For example, considering the code line 4 in Listing 1, if the token StudentClass is assigned to the from_infile label, then the info and name tokens also have from_infile as their origin label.
Length describes the number of characters in a code token. This dimension is motivated by the fact that long code tokens benefit more from code completions than short ones, since they save more typing effort (Hellendoorn et al. 2019). The length also correlates with the importance of a code token. Short tokens usually hold temporary values (e.g. i as a loop counter), which are less significant.
A code token is labeled as short if it has up to 3 characters, the label medium is used for 4 to 10 characters, and the label long indicates tokens longer than 10 characters. These thresholds are acquired from the statistic of terminal length distribution in the Python150k dataset (see Sect. 6.1). Around 19.7% of terminal tokens in the dataset have a length of 4 characters and this fraction decreases dramatically for greater lengths. Besides, in the top-15 of the most common (terminal) token lengths in the dataset, most tokens assume lengths between 4 and 10 characters.
Finally, Frequency refers to a code token's frequency relative to the frequency distribution of all code tokens within each individual AST. We use three values here: low, medium, and high, based on intervals explained in Fig. 4. Long and frequent code tokens are likely to be significant. On the other hand, while short code tokens can be frequent, in general they carry insignificant (e.g. temporary) values.
The three intervals specifying the frequency of occurrence as low, medium, and high are adjusted based on the min, mean, and max values of the frequency distribution of all tokens within the AST. Equations (1a) to (1e) show the calculation for these intervals. Except the high_interval, which is a closed interval, the low_interval and medium_interval are half-open intervals (i.e. the higher endpoints are not included, see Eqs. (1c) and (1d)).

Open vocabulary for transformers
Since Transformers recently gained a lot of attention with promising results on the code completion task ( To evaluate how using open vocabulary enhances the original Transformers, or in general, to evaluate whether CT3 is beneficial for improving code completion models, we performed two steps. Firstly, we implemented a Transformer-based code completion approach in two variants: with a closed vocabulary model (i.e. Transformer learns on a fixed set of strings), and with an open vocabulary model using Byte-Pair Encoding (BPE) as suggested by Karampatsis et al. (2020). In the latter version, each token can be encoded by several subtokens (potentially even letters). Secondly, we evaluated the results using aggregated accuracy and CT3-data to observe what additional information we can gain by utilizing CT3.
It is essential to indicate that the open vocabulary model uses greedy search for finding the next possible subtoken. We assume that a prediction is correct in this model if all subtokens of the original token are suggested accurately. This leads to the circumstance that the accuracy after training (which just focuses on the next subtoken in the sequence) is typically higher than the accuracy gained from the evaluation step (see Sect. 6.3).

Research questions
In our evaluation, we address the following two research questions:

Experimental setup
We conduct the experiments using Python150k and JavaScript150k datasets. The model is fitted on the original training sets (i.e. 100k each). We use Python 3.7.9, TensorFlow 2.3.0, and an open source Cosy 13 implementing Transformer for code completion. Instead of translating between two sequences of tokens in different languages as the vanilla Transformer (Vaswani et al. 2017), our task is to predict the most likely next token based on previous tokens. Therefore, the Transformer model used in our experiments features the same architecture as an encoder of the vanilla Transformer, i.e. a stack of identical layers where each layer has two sub-layers. We use six layers and six heads like other related works (Kim et al. 2021;Chirkova and Troshin 2021;Ciniselli et al. 2021b, a). Since these related works used various values for embedding size (e.g. 256, 300, and 384), we choose 300-dimensional embedding, which is the commonly used embedding size of Word2Vec model (Mikolov et al. 2013). Besides, we also select 300 as the size of hidden layers (i.e. hidden units).
Our experiments include main and tuning experiments. The former focus on emphasizing the benefits of refined evaluations by using the proposed methodology in comparison of completion results obtained from the closed and open vocabulary models. The latter experiments assess effects of some parameters (e.g. token length and window size) on prediction accuracies. To avoid distraction from the main experi-ments and their results, we present the tuning experiments at the end of the paper (see Sect. 6).
Data preprocessing The code token and subtoken sequences are generated by traversing ASTs in Depth-First Search (DFS) order. We eliminate all white spaces, tabs and new lines to reduce noise before collecting tokens for building encoders and creating input files for our Transformer model. Each type or value in the ASTs is represented as an AST node.
Building closed and open vocabulary There are special cases that require special values and be reserved in the vocabulary. We use <UNKNOWN> token to indicate tokens that cannot be encoded by the vocabulary. The token <PADDING> is used to fill up the data windows in data files (explained below). Ultimately, the token <ENDTOKEN> is utilized for marking the end point of a sequence of subtokens and therefore can only be applicable for the open vocabulary case. In other words, if n is the vocabulary size, which is a hyperparameter defined before building the vocabulary, then n −2 and n −3 are the number of tokens/subtokens in the closed and open vocabulary, respectively.
The closed vocabulary is assembled based on the frequency of code tokens in the training dataset. The vocabulary size (e.g. 10,000) determines the number of most frequent tokens being selected for the vocabulary after subtracting the two special tokens as mentioned above. In this way, rare and/or very long tokens hardly contribute to the vocabulary. Hence, a length threshold for building closed vocabulary is not necessary.
For the open vocabulary built by BPE approach, we use HuggingFace Tokenizers. 14 All tokens in the training dataset are pre-tokenized to build a set of unique words. Frequency of each word is then calculated based on its occurrence in the training set. The open vocabulary first encompasses all characters in the set of unique words and then being expanded by merge rules. Two characters in the vocabulary are combined if their merged character has the highest frequency, compared to the other combinations. The process keeps going until the vocabulary achieves the desired size.
All tokens participate in forming the vocabulary, which leads to the risk of increasing noise due to very long and rare tokens. To address this issue, we experimented with several thresholds of token length, including unlimited length down to 200, 100, and 50 (see Sect. 6.2). The obtained results show that limiting token length to 50 can create a BPE vocabulary which contains all single letters and a minimum number of missing non-terminal types, which are rarely used.
Creating input data files To create input data for Transformer models, i.e. sequences of tokens/subtokens, we traverse ASTs in the dataset in DFS order. Similarly to building the vocabulary, rare and/or long tokens are most likely encoded as <UNKNOWN> token and being omitted in the evaluation step. Consequently, a length threshold is again not needed for this case.
For the open vocabulary case, we performed additional experiments on the effect of token length on prediction accuracy (see Sect. 6.3). The experimental results show that there should be a threshold for token length when creating data files (e.g. tfrecord files if using TensorFlow). There are 0.9% of tokens in the Python dataset that are longer than 30 characters. This applies to 0.3% of tokens in the JavaScript dataset. Together with other reasons (e.g. length distribution, accuracy after training, see details in Sect. 6.3), we consider 30 as the length threshold for creating our tfrecord files. In addition, theoretically vanilla Transformer models can handle arbitrarily long input sequences, but in practice all the input data fed to the model should have the same length with <PADDING> tokens added according to the certain lengths of the input sequences (Vaswani et al. 2017). Determining the length for the input data involves various factors, such as the length of ASTs and memory capacity. Since ASTs in the datasets are diverse and some can be exceedingly large, it is infeasible to set a value sufficiently high to embed any AST of any length, especially with the limitation of memory. To integrate large ASTs into the input data, we use the same technique as Kim et al. (2021), which splits the large ASTs into smaller chunks with a sliding window while keeping the overlaps between windows to preserve the context. Each window is defined by a window_size (i.e. number of tokens/subtokens within the window) and a step_size of the sliding window. To ensure all windows have the same size, the <PADDING> tokens are used to pad the last windows of the sequences.
We use the same experimental setup as Kim et al. (2021) for our closed vocabulary case with a window_size of 1000 and a step_size fixed to 500. However, these values need to be adjusted for the open vocabulary case to prevent dropping contextual information in the window, since the sequences of subtokens (i.e. with open vocabulary) can be much longer than the original sequences of code tokens (i.e. with closed vocabulary). After performing several experiments (see Sect. 6.4), we selected 2000 and 1000 for the window_size and step_size of the open vocabulary case, respectively. Experiment configuration Table 7 presents the settings used for the conducted experiments. The model is trained and evaluated on one GPU (GeForce RTX 2080 Ti or TITAN Xp). The preprocessing and training processes take days 15 while the evaluation for each token prediction with the closed and open vocabulary models requires around two weeks (Python dataset) to more than one month (JavaScript dataset) to finish.

Evaluation results
In this section, we present our evaluation results and address the two research questions from Sect. 4.1.

RQ1: The refined evaluation gives more detailed information about the completion models
The experimental results show that evaluating accuracy for individual token types provides a better view of the prediction approaches than using the aggregated metrics. The first line of Table 8 compares the aggregated accuracy results of the closed and open vocabulary models on the Python50k dataset. The refined evaluation conducted on this dataset is presented in part (a) of Figs. 5, 6, 7 and 8 for the dimensions Syntax Type, Origin, Length, and Frequency, respectively. The analysis for the Context dimension is similar to other dimensions and omitted in this paper. We use the value non_leaf to refer to code tokens that are non-terminal nodes or terminal nodes with token length greater than 30 (see the length threshold for creating input data files in Sect. 4.2). In part (a) of the mentioned figures, the bars present accuracies while the line shows the relative fraction of a token type over all tokens (i.e. value_frac). The pie charts in part (b) of these figures are another representation of the lines in part (a).
While the aggregated metric indicates that the open vocabulary model increases the prediction accuracy by only 5.88%, the refined evaluation clearly reveals that the open vocabulary model outperforms the closed vocabulary model in every dimension of CT3, with improvements ranging up to 2.03 times. One of the reasons that aggregated metrics show only a moderate improvement is the token type non_leaf (internal AST nodes and long token terminal nodes), since it makes up more than half of test instances, but does not benefit much from the open vocabulary model. Figure 5a shows that for the dimension Syntax Type the open vocabulary model achieves a higher accuracy for all token types, excluding the exception, imp_lib, and keyword types. There is a significant improvement for the class_usg type (95.2%). The accuracies of the types var_usg and const_str, which are two of the most relevant completions in practice, increased by 28.6 and 40.1%, respectively. Other notable enhancements are achieved for the labels attribute (40.2%), func_usg (22.5%), and method_usg (22.3%). Both closed and open vocabulary models perform quite well for tokens categorized as keyword type. However, for this type, the closed vocabulary model slightly outperforms the open vocabulary model by 1.2%. For simplicity, the unknown type (0.1% of all tokens) is left out from the bar chart.
The refined evaluation in Fig. 5a also illustrates that it is much harder to predict definition than usage token types. The most impressive evidence for this is the great disparity of accuracy between the func_def and func_usg types. Their accuracies differ by a factor of 7.4 (closed vocabulary model) and 7.8 (open vocabulary model) in favor of the usage token type. The differences of accuracy between code completion for definition and usage of types related to class, method, and variable are also remarkable.
Apart from the non-leaf label in Fig. 5b, the top-5 most frequent values of the Syntax Type dimension consist of var_usg, const_str, keyword, attribute, and method_usg. For instance, 27.3% of terminal tokens or 9.2% of all code tokens are categorized as the usages of variables, arguments, libraries and function keywords (i.e. var_usg). This statistic also emphasizes the importance of predicting already defined identifiers in practice.
In conclusion, the above analysis can inspire the right approaches to improve accuracy, which can not be derived from the aggregated evaluation. The interpretation for the remaining dimensions works similarly. Figure 6a shows that the open vocabulary model outperforms the closed one in most of values of the dimension Origin. The closed vocabulary model just slightly increases the accuracy of the open one for values from_builtin and from_stdlib by 0.8 and 0.6%, respectively. The notable point is the result for the from_infile value. Even though 72.7% of terminal tokens are located in the same file (Fig. 6b), the accuracy of suggesting these tokens is deficient when using the closed vocabulary model (ca. 38.6%). A possible reason for this is the source code file (from_infile), as well as the external libraries (from_extlib), containing an immense variety of tokens defined by developers, which causes the OOV issue in the closed vocabulary model. The open vocabulary model improves by 28.6% and by 40.3% the prediction accuracies for the labels from_infile and from_extlib, respectively. However, the results are still lower than the ones for the from_builtin and from_stdlib labels. It reflects the fact that recommending the tokens from the same file or from other source files is harder than the tokens from the built-in and standard libraries of the target programming language.
The refined evaluation for the Length dimension is displayed in Fig. 7a. For longer code tokens, the difference between the accuracies of the open and closed vocabulary model increases. Although the overall accuracy for long tokens is not high (ca. 32.8%), the improvement by a factor of 3.03 is still impressive. This finding is crucial, as long code tokens are likely to be requested for code completion. The tokens tagged with the medium label make up the largest fraction of terminal tokens in this dimension (Fig. 7b). In this case, the closed vocabulary model is outperformed by the open one with a margin of 15.7%. For short code tokens, the accuracy difference is not substantial.
The open vocabulary model also surpasses the closed one for all values of the Frequency dimension (Fig. 8a). The low frequency tokens not only constitute a relatively significant fraction of the dataset (Fig. 8b), but are also quite difficult to predict when using the closed vocabulary model. The increased accuracy for these tokens (44.1%), between the two models is particularly impressive for the usages of identifiers (e.g. classes or variables), tokens from in-file or external libraries, and low frequency long tokens. These token types are closely related to the OOV issue, which is caused by the arbitrariness of identifiers. The refined evaluation results confirm the expectation that the open vocabulary model can alleviate the OOV issue in the traditional approaches. To complement this point, we count the number of OOV tokens encountered when using the closed vocabulary model and determine how many of these tokens can still be recommended correctly by the open vocabulary model. We also calculate the number of tokens encoded as OOV with the open vocabulary to compare with the statistic of the closed one. It is worth to note that since there is no included tokens, i.e. tokens shorter than the length threshold, which are encoded as OOV with the open vocabulary (see Sect. 6.2), the number of OOV tokens in the open vocabulary case is also the number of excluded tokens. The last three columns of Table 8 presents our calculation. The open vocabulary model accurately predicts 27.0 and 28.1% of OOV tokens in the Python50k and JavaScript50k datasets, respectively. Moreover, the open vocabulary decreases the number of tokens encoded as OOV with the closed vocabulary by factors of 12.2 and 16.9 for the Python and JavaScript datasets, respectively. This again confirms the advantage of the open vocabulary approach.

CT3 Challenges
The first and most important step of constructing Code Token Type Taxonomy (CT3) is to determine dimensions and possible values for each dimension. In this section, we discuss two challenges occurred while conducting the CT3 schema for Python.
More specific, considering the Syntax Type dimension, the main purpose is to separate between various token types and their definitions and usages (e.g. definition and usage of function and method). While the prediction of token definitions is harder to achieve, it is also less crucial than the prediction of token usages. In other words, distinguishing between the usages of different token types is important. However, for some token types it is hard to accomplish this distinction. One example for this is the usages of imported libraries, defined variables and declared arguments all being treated as variable usages in Python. In these cases, the analysis can be simplified based on interests of developers. For instance, a token can be considered as an imported library usage if it is identified as a var_usg in the Syntax Type dimension and its Origin value is from_extlib or from_stdlib, see Table 4.
Another challenge is to identify definitions of local variables. We assume that local variables are defined by assignments in ASTs and only the first assignment of each variable within a scope (i.e. file, class, method or function) is captured. The definition of local variables inside lamda functions, if/else blocks or loops might be obtained by slightly modifying the AST parser.
Surmounting these challenges will optimize the CT3 schema, which is not in the scope of the paper, and therefore will be investigated in our future work. Besides, the effect of these challenges on the evaluation result is negligible since CT3 still provides fundamental and essential dimensions of code token types (e.g. definitions and usages of identifiers, token length, and token origin). Moreover, analyzing other token types based on the ASTs is straightforward and can be applied to other programming languages with minimum effort (e.g. adjusting the types of nodes in ASTs between languages). We already published the scripts for conducting CT3 schema for Python, which support user adding/modifying dimensions and values in each dimension easily.

Threats to validity
There are several threats to validity of our work. We analyze them as follows: Training corpus We use the original train-test split of Python150k and JavaScript150k datasets to easily compare with the state-of-the-art. However, Chirkova and Troshin (2021) suggested that the datasets should be redistributed and deduplicated to avoid data leaks. We also observe that the datasets are quite noisy with a lot of empty or long tokens and auto-generated files. These characteristics might affect the semantics of the prediction. It also underlines a need for a set of standardized benchmarks for code completion.
Language specificity The utility of CT3 methodology is only evaluated for Python. The code token types are obtained by analyzing ASTs and the relationship between tokens, which is slightly different for each programming language. However, the dimensions of CT3 are determined by the demands of developers for code completion in practice, e.g. predicting identifiers over punctuation and distinguishing between definitions and usages of identifiers (i.e. Syntax Type dimension), saving typing effort (i.e. Length dimension), perceiving the importance of local context/information in the predictions (i.e. Origin and Context dimensions), and ultimately, suggesting rare and difficult completions as vital cases for real world efficacy (i.e. Frequency dimension). These interests are applied for every programming language. Furthermore, most of the values identified for the dimensions of CT3 are critical elements, which are shared among programming languages. Accordingly, we expect that our results will generalize to other languages.
Models comparison We choose a Transformer-based model in two variants (i.e. with open and with closed vocabulary) as representative ML-based models for this work due to the dominance of Transformers in comparison to other models. Besides, Transformers also achieve notable results on other tasks, such as classification (Kanade et al. 2020), vulnerability/code clone detection (Ding et al. 2021), and Natural Language to code (Ahmad et al. 2021) with leading tools GitHub Copilot 16 and AlphaCode. 17 With the rapid development of Transformers, it is unlikely that previously known models can surpass this model. We assume that a comparison between other approaches (e.g. Transformers vs. GNN) using CT3 would further confirm the advantages of our proposed methodology.
Out-of-Vocabulary (OOV) issue Handling the OOV issue by BPE is not the main focus in this work and is only used to demonstrate the proficiency of CT3 in supporting a refined evaluation. In the paper of Chirkova and Troshin (2020), the authors mention that splitting tokens into subtokens makes it hard to apply structure-aware models. But our results show that the open vocabulary model constructed with BPE outperforms the closed one in every category of token types. However, the efficiency of solutions for the OOV issue needs further investigation and is out of scope of this paper.

Auxiliary experiments
This section presents our additional statistics and results of tuning experiments to clarify some values and thresholds chosen for the main experiments above.

Length distribution of terminal tokens in the Python150k dataset
While preprocessing the code tokens in the dataset, we recorded their length distribution to identify thresholds for the three values in the Length dimension of the CT3 schema used for Python (i.e. short, medium, and long). Figure 9 displays the top-15 common terminal token lengths in the Python150k dataset.
Around 19.7% of terminal tokens in both training and evaluation datasets have a length of 4 characters and the longer lengths make up only a small portion of the dataset. Figure 9 shows that most tokens assume lengths between 4 and 10 characters. Besides, there are nearly 4k and 3k different values of length in the training and evaluation datasets, respectively. Therefore, the longer a terminal token, the less often it appears in the dataset, which makes it unlikely to be predicted.
Moreover, since typing effort plays an important role in the code completion task (Hellendoorn et al. 2019), we prefer to focus on the code tokens that can help with (i) saving typing effort as well as (ii) covering a large part of the dataset. Based on these criteria and the analysis above, we chose 4 and 10 as the thresholds for the Length dimension of CT3 (for Python). 4 ≤ token_length ≤ 10) indeed account for nearly 60% of the terminal tokens in the dataset.

Length threshold for building the open vocabulary
As mentioned in Sect. 4.2, building an open vocabulary can be disturbed by very long tokens. For this reason, we investigated several values of token length while using HuggingFace Tokenizers 18 to build the open vocabulary on the Python100k dataset (i.e. training set). Table 9 summarizes the obtained results under three criteria: (i) single letters in the vocabulary (second column), (ii) number of missing non-terminal types (third column), and (iii) how the code tokens are encoded by considering the maximum length, most frequent length, and average length of generated sequences of subtokens, and the number of included tokens encoded as OOV in the training dataset (the last four columns of the table, respectively). Included tokens are tokens having lengths that do not exceed the length threshold. For the first factor, since each code token is encoded by several subtokens with the main purpose of addressing the OOV issue, the learned vocabulary should contain all single letters 19 (in both uppercase and lowercase) to make sure that code tokens can possibly be encoded. Secondly, non-terminal tokens in the training dataset are included while building the vocabulary and training the completion model. Besides, types of non-terminal tokens are more frequent than values of terminal tokens. As a result, we expect that the learned vocabulary also covers these non-terminal types at an acceptable rate. Ultimately, we assume that a code token is predicted properly in the open vocabulary case if all of its subtokens are suggested correctly. Consequently, the shorter the sequence of subtokens, the higher the probability for a correct prediction. Moreover, the learned vocabulary should reduce the number of OOV tokens in the dataset after encoding. 18 https://huggingface.co/docs/tokenizers/python/latest/index.html. 19 In our experiments, we mostly focused on English letters. Tokens that satisfy the length threshold but are encoded as OOV Since our main purpose in this paper is to propose the methodology for a refined evaluation and to illustrate the utility of the proposed approach, finding optimal values for all the thresholds is out of the scope of the paper. Therefore, we only experimented with token lengths from unlimited down to 50, which is higher than the threshold for creating data files in Sect. 6.3 to make sure that long tokens have less subtokens. While the vocabulary learned by all the thresholds covers single letters and produces zero included tokens as OOV, limiting the length to 50 brings acceptable results with 63 non-terminal types missing and the smallest maximum/average length of subtoken sequences, compared to other thresholds in the table. Besides, we examined the missing types of non-terminal tokens and most of them are rarely used, e.g. CompareLtLtLt, CompareLtLtEEqLtLtLt, or AugAssignLShift.
We assume that the token lengths in the JavaScript100k dataset follow the same trend to the ones in the Python set. Hence, we analyzed the JavaScript100k dataset using the above criteria while limiting the token lengths to 50. The result verifies our inference through the obtained vocabulary with only one missing non-terminal type, the maximum length of subtoken sequences of 31 and an average token length of 2.17. As expected, there are no included tokens encoded as OOV while using the built open vocabulary. Accordingly, we chose 50 as the length threshold for building the open vocabulary in both datasets.

Length threshold for creating input data files in the open vocabulary case
For the code completion task, the quality of input data is an essential factor, since very long tokens might increase noise while training the models. To minimize this risk, we eliminated very long tokens from the dataset before creating data files and considered them as <UNKNOWN> tokens after the preprocessing step. We again studied the length distribution of code tokens in Python and JavaScript datasets considering both nonterminal 20 and terminal tokens to identify the threshold for very long tokens. Figure 10 presents the top-15 of the most common token lengths in the Python and JavaScript datasets for both training and evaluation sets. Most code tokens in the Python150k dataset have a length of 4 characters and the tokens with lengths between 4 and 10 make up more than 60% of the dataset. The thresholds 4 and 14 can be applied similarly to the JavaScript150k dataset where the most common length is 10. In other words, code tokens longer than 10 for Python and 14 for JavaScript occur less and less in the dataset.
To determine the acceptable-longest-length for the code tokens in our experiments, we tested several lengths fom 200 down to 30 for code tokens in the Python150k dataset first and got the results in Table 10. If the allowed maximum length of code tokens is 200, only 0.1% of code tokens have a length above the threshold. Similarly, reducing the threshold down to 100 and 30 brings 0.2 and 0.9% of code tokens that longer than the max length, respectively. However, after training the model for 10 epochs, limiting the max length to 30 achieves the highest accuracy for the 10th epoch, among other values. This means the model might be affected by noise created from the long tokens. Besides, discarding 0.9% of code tokens by excluding them from the dataset does not 20 Non-terminal tokens are also fed to the completion models.  affect the evaluation much. Therefore, we chose 30 as the length threshold for creating data files in the Python dataset.
Since the most common length of tokens in the JavaScript150k dataset is longer than the one in the Python150k dataset, we predicted the threshold for the JavaScript dataset to be higher. However, only 0.3% of code tokens in the JavaScript150k dataset have a length longer than 30. Furthermore, in this case the accuracy after training is 0.8642, which is sufficiently high. Accordingly, we still selected 30 as the length threshold for creating data files in the JavaScript150k dataset. It is worth to note that the accuracy after training is higher than the one gained from the evaluation step, explicitly for the open vocabulary case. More specific, with the length threshold fixed to 30, the accuracy after training on the Python100k dataset is 0.83 but it reduces to 0.72 on the evaluation set, see Table 8. While the former accuracy is obtained by TensorFlow Keras, which just considers if the next subtoken is predicted correctly while training, the latter accuracy is calculated with a stricter rule, i.e. all subtokens of the original token should be suggested correctly.

Window size for creating input data files in the open vocabulary case
As mentioned in Sect. 4.2, we chose 1000 and 500 as the window_size and step_size, respectively, for creating input data with the closed vocabulary in both the Python and JavaScript datasets. Table 11 presents our experiments to determine these two values for the open vocabulary case taking into account the following factors: • step_size is inferred as half of window_size for the sake of simplicity.
• Each token is encoded to at least two subtokens (i.e. the token itself and the mark symbol <ENDTOKEN>). As a result, to prevent omitting information, the window used to incorporate sequences of subtokens (i.e. with open vocabulary) should be larger than the one used for sequences of tokens (i.e. with closed vocabulary). • The value of window_size should be restricted due to the limited capacity of the used GPUs. • batch_size, i.e. the number of data windows grouped in a batch, also affects the training process. We establish a batch_size threshold for each considered size of the window, since values higher than the threshold exceed the capacity of our GPUs. • Another important element is the Estimated Time of Arrival (ETA), which is the estimated time that the model needs to complete one epoch (i.e. one training iteration). In other words, the training time can be measured by multiplying ETA with the number of epochs. We expect our model to be trained in an acceptable duration of one to two days. • Ultimately, the accuracy obtained after training should be sufficiently high in an adequate amount of time (e.g. two days). Besides, as mentioned at the end of Sect. 6.3, the accuracy declines in the evaluation step since all subtokens of a token must be predicted correctly to form a proper prediction in the open vocabulary case.
We investigated three different sizes for the window (i.e. 5k, 3k, and 2k) with corresponding values of epoch and batch_size. Table 11 shows that setting window_size to 5k makes training per epoch faster than the case of window_size 3k due to the reduced batch_size. However, the accuracy after training of both sizes 5k and 3k are just half of the one obtained by the window_size 2k with the same number of epochs. For the size 3k, the accuracy improves after performing 10 more epochs but still lower than the one from size 2k. Moreover, the ETA values for the window_size 3k are pretty high. Consequently, 2k is the most suitable size for the window based on the above criteria and in comparison to the other values. The step_size is then adjusted to 1k accordingly. We apply the same value to the window_size used for the open vocabulary in the JavaScript150k dataset, since the maximum, most frequent, and average lengths of sequences of subtokens in both datasets are considerably close (see Sect. 6.2) and they have the same setup for the closed vocabulary case.

Conclusions
We proposed a methodology called Code Token Type Taxonomy (CT3) for a refined evaluation and comparison of code completion models. Our empirical study shows that CT3 is helpful in characterizing and comparing the accuracy of approaches. As a side-effect, we demonstrated that the open vocabulary method significantly enhances the accuracy of Transformer-based models on code completion for relevant tokens like usage of (defined) variables and literals. We also compared the state-of-the-art of MLbased code completion approaches. The overview shows that there is a demand for a set of standardized benchmarks for a rigid and reproducible comparison of prediction models. We published the CT3 information 21 and the analyzer source code 22 for the Python150k dataset. Further work will include extending the method to other programming languages and datasets, as well as implementation of specialized code predictors according to the proposed CT3 schema.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Tables 12 and 13 summarize the explanation with examples for each label of these two dimensions. The target code token is highlighted in every example. For the sake of simplicity, we just point out some common positions of code tokens in the context examples, since all the tokens under the considered scope have the same context value.