Automated variable renaming: are we there yet?

Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g., statistical language models, deep learning models) or relying on static code analysis. Still, limited empirical investigations have been performed on the effectiveness of such techniques for recommending developers with meaningful identifiers, possibly resulting in rename refactoring operations. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. The three approaches have been trained and tested on three datasets we built with the goal of evaluating their ability to recommend meaningful variable identifiers. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools. Nonetheless, our results also highlight limitations of the experimented approaches that call for further research in this field.


Introduction
Low-quality identifiers, such as meaningless method or variable names, are a recognized source of issues in software systems [1].Indeed, choosing an expressive name for a program entity is not always trivial and requires both domain and contextual knowledge [2].Even assuming a meaningful identifier is adopted in the first place while coding, software evolution may make the identifier not suitable anymore to represent a given entity.Moreover, the same entity used across different code components may be named differently, leading to inconsistent use of identifiers [3].For these reasons, rename refactoring has become part of developers' routine [4], as well as a standard built-in feature in modern integrated development environments (IDEs).While IDEs aid developers with the mechanical aspect of rename refactoring, developers remain responsible for identifying low-quality identifiers and choosing a proper rename.
To support developers in improving the quality of identifiers, several techniques have been proposed [5,6,3,7].Among those, data-driven approaches are on the rise [6,3,7].This is also due to the recent successful application of these techniques in the code completion field [8,9,10,11,12], which is a more general formulation of the variable renaming problem.Indeed, if a model is able to predict the next code tokens that a developer is likely to write (i.e., code completion), then it can be used to predict a token representing an identifier.Nevertheless, strong empirical evidence about the performance ensured by such data-driven techniques for supporting developers in identifier renaming is still minimal.
In this paper, we investigate the performance of three data-driven techniques in supporting automated variable renaming.We experiment with: (i) an n-gram cached language model [13]; (ii) the Text-to-Text Transfer Transformer (T5) model [14], and (iii) the Transformer-based model presented by Liu et al. [11].The n-gram cached language model [13] has been experimented against what were state-of-the-art deep learning models in 2017, showing its ability to achieve competitive performance in modeling source code.Thus, it represents a light-weight but competitive approach for the prediction of code tokens (including identifiers).Since then, novel deep learning models have been proposed such as, for example, those based on the transformer architecture [15].Among those, T5 has been widely studied to support code-related tasks [16,17,18,19,20,21].Thus, it is representative of transformer-based models exploited in the literature.Finally, the approach proposed by Liu et al. [11] has been specifically tailored to improve code completion performance for identifiers, known to be among the most difficult tokens to predict.These three techniques provide a good representation of the current state-of-the-art of data-driven techniques for identifiers prediction.
All experimented models require a training set to learn how to suggest "meaningful" identifiers.To this aim, we built a large-scale dataset composed of 1,221,193 instances, where an instance represents a Java method with its related local variables.This dataset has been used to train, configure (i.e., hyperparameters tuning), and perform a first assessment of the three techniques.In particular, as done in previous works related to the automation of code-related activities [7,22,23,24,25,26], we considered a prediction generated by the models as correct if it resembles the choice made by the original developers (i.e., if the recommended variable name is the same chosen by the developers).However, this validation assumes that the identifiers selected by the developers are meaningful, which is not always the case.
To mitigate this issue and strengthen our evaluation, we built a second dataset using a novel methodology we propose to increase the confidence in the dataset quality (in our case, in the quality of the identifiers in code).In particular, we mined variable identifiers that have been introduced or modified during a code review process (e.g., as a result of a reviewer's comment).These identifiers result from a shared agreement among multiple developers, thus increasing the confidence in their meaningfulness.This second dataset is composed of 457 Java methods with their related local variables.
Finally, we created a third dataset aimed at simulating the usage of the experimented tools for rename variable refactoring: We collected 400 Java projects in which developers performed a rename variable refactoring.By doing so, we were able to mine 442 valid commits.For each commit c in our dataset, we checked-out the system's snapshot before (s c−1 ) and after (s c ) the rename variable implemented in c.Given v the variable renamed in c, we run the three techniques on the code in s c−1 (i.e., before the rename variable refactoring) to predict v's identifier.
Then, we check whether the predicted identifier is the one implemented by the developers in s c .If this is the case, this means that the approach was able to successfully recommend a rename variable refactoring for v, selecting the same identifier chosen by developers.
Our quantitative analysis shows that the Transformer-based model proposed by Liu et al. [11] is by far the best performing model in the literature for the task of predicting variable identifiers.This confirms the effort performed by the authors that aimed at specifically improve the performance of DL-based models in this task.This approach, named CugLM, can correctly predict the variable identifier in ∼ 63% of cases when tested on the large scale dataset we built.Concerning the other two datasets, the performance of all models drop, with CugLM still ensuring the best performance with ∼ 45% of correct predictions on both datasets.
We also investigate whether the "confidence of the predictions" generated by the three models (i.e., how "confident" the model are about the generated prediction) can be used as a proxy for prediction quality.We found that when the confidence is particularly high (> 90%), the predictions generated by the models, and in particular by CugLM, have a very high chance of being correct (>80% on the large-scale dataset).This suggests that the recommendations generated such tools, under specific conditions (i.e., high confidence) are ready to be integrated in rename refactoring tools.
We complement the study with a qualitative analysis aimed at inspecting "wrong" predictions to (i) see whether, despite being different from the original identifier chosen by the developer, they still represent meaningful identifiers for the variable provided as input; and (ii) distill lessons learned from these failure cases.
Concerning the first point, it is indeed important to clarify that even wrong predictions may be valuable for practitioners.This happens, for example, in the case in which the approach is able to recommend a valid alternative for an identifier (e.g., surname instead of lastName) or maybe even suggesting a better identifier, thus implicitly recommending a rename refactoring operation.Such an analysis helps in better assessing the actual performance of the experimented techniques.
Finally, we analyze the circumstances under which the experimented tools tend to generate correct and wrong predictions.For example, not surprisingly, we found that these approaches are effective in recommending identifiers that they have already seen used, in a different context, in the training set.Also, the longer the identifier to predict (e.g., in terms of number of terms composing it), the lower the likelihood of a correct prediction.
Significance of research contribution.To the best of our knowledge, our work is the largest study at date experimenting with the capabilities of state-of-the-art data-driven techniques for variable renaming across several datasets, including two high-quality datasets we built with the goal of increasing the confidence in the obtained results.The three datasets we built and the code implementing the three techniques we experiment with are publicly available [27].Our findings unveil the potential of these tools as support for rename refactoring and help in identifying gaps that can be addressed through additional research in this field.

Related Work
We start by discussing works explicitly aiming at improving the quality of identifiers used in code.Then, we briefly present the literature related to code completion by discussing approaches that could be useful to suggest a variable name while writing code.For the sake of brevity, we do not discuss studies about the quality of code identifiers and their impact on comprehension activities (see e.g., [28,29,30,31]).

Improving the Quality of Code Identifiers
Before diving into techniques aimed at improving the quality of identifiers, it is worth mentioning the line of research featuring approaches to split identifiers [32,33] and expand abbreviations contained in them [33,34,35].These techniques can indeed be used to improve the expressiveness and comprehensibility of identifiers since, as shown by Lawrie et al. [28], developers tend to better understand expressive identifiers.
On a similar research thread, Reiss [36] and Corbo et al. [37] proposed tools to learn coding style from existing source code such that it can then be applied to the code under development.The learned style rules can include information about identifiers such as the token separator to use (e.g., Camel or Snake case) and the presence of a prefix to name certain code entities (e.g., OBJ_varName).The rename refactoring approaches proposed by Feldthaus and Møller [38] and by Jablonski and Hou [39], instead, focus on the relations between variables, inferring whether one variable should be changed together with others.Although the above-described methods can improve identifiers' quality (e.g., by expanding abbreviated words and increasing consistency), they cannot address the use of meaningless/inappropriate identifiers as program entities' names.
To tackle this problem, Caprile and Tonella [40] proposed an approach enhancing the meaningfulness of identifiers with a standard lexicon dictionary and a grammar collected by analyzing a set of programs, replacing non-standard terms in identifiers with a standard one from the dictionaries.
Thies and Roth [41] and Allamanis et al. [42] also presented techniques to support the renaming of code identifiers.Thies and Roth [41] exploit static code analysis: if a variable v 1 is assigned to an invocation of method m (e.g., name = getFullName), and the type of v 1 is identical to the type of the variable v 2 returned by m, then rename v 1 to v 2 .
Allamanis et al. [42] proposed NATURALIZE, a two-step approach to suggest identifier names.In the first step, the tool extracts, from the code AST, a list of candidate names and, in the second step, it leverages an n-gram language model to rank the name list generated in the previous step.The authors evaluated the meaningfulness of the recommendations provided by their approach through analyzing 30 methods (for a total of 33 recommended variable renamings).Half of these suggestions were identified as relevant.Building on top of NATURALIZE, Lin et al. [3] proposed lear, an approach combining code analysis and n-gram language models.The differences between lear and NATURALIZE are: (i) while NATURALIZE considers all the tokens in the source code, lear only focuses on tokens containing lexical information; (ii) lear also considers the type information of variables.Note that these techniques are meant to promote a consistent usage of identifiers within a given project (i.e., renaming variables that represent the same entity but are named differently within different parts of the same project).Thus, they cannot suggest naming a variable using an original identifier learned, for example, from other projects.Differently from previous works, we empirically compare the effectiveness of datadriven techniques to support variable renaming, aiming to assess the extent to which developers could adopt them for rename refactoring recommendations.
From this perspective, the most similar work to our study is the one by Lin et al. [3].However, while their goal is to promote a consistent usage of identifiers within a project, we aim at supporting identifier rename refactoring in a broader sense, not strictly related to "consistency".Moreover, differently from Lin et al. [3], we include in our study recent DL-based techniques.
Finally, a number of previous works explicitly focused on the method renaming.
Daka et al. [43] designed a technique to generate descriptive method names for automatically generated unit tests by summarizing API-level coverage goals.Høst and Østvold [44] identify methods whose names do not reflect the responsibilities they implement.Alon et al. [7] proposed a neural model to represent code snippets and predict their semantic properties.Then, they use their approach to predict a method's name starting from its body.Differently from these works [44,43,7], we are interested in experimenting with data-driven techniques able to handle the renaming of variables.
A precursor of modern code completion techniques is the Prospector tool by Mandelin et al. [54].Prospector aims at suggesting, within the IDE, variables or method calls from the user's codebase.Following this goal, other tools such as InSynth by Gvero et al. [55] and Sniff by Chatterjee et al. [56] have added support for type completion (e.g., expecting a type at a given point in the source code, the models search for type-compatible expressions).
Tu et al. [57] introduced a cache component that exploits code locality in n-gram models to improve their support to code completion.Results show that since code is locally repetitive, localized information can be used to improve performance.This enhanced model outperforms standard n-gram models by up to 45% in terms of accuracy.On the same line, Hellendoorn and Devanbu [58] further exploit the cached models considering specific characteristics of code (e.g., unlimited, nested, and scoped vocabulary).They also showed the superiority of their model when compared to deep learning for representing source code.This is one of the three techniques we experiment with (details in Section 3.1).Karampatsis et al. [59], a few years later, came to a different conclusion: Neural networks are the best language-agnostic algorithm for representing code.To overcome the out of vocabulary problem, they propose the use of Byte Pair Encoding (BPE) [60], achieving better performance as compared to the cached n-gram model proposed in [58].Also Kim et al. [9] showed that novel DL techniques based on the Transformers neural network architecture can effectively support code completion.Similarly, Alon et al. [61] addressed the code completion problem with a language-agnostic approach leveraging the syntax to model the code snippet as a tree.Based on Long short-term Memory networks (LSTMs) and Transformers, the model receives an AST representing a partial statement with some missing tokens to complete.The versatility of addressing general problems with deep learning models pushed researchers to explore the potential of automating various code-related tasks.
For example, Svyatkovskiy et al. [10] created IntelliCode Compose, a generalpurpose code completion tool capable of predicting code sequences of arbitrary tokens.
Liu et al. [11] presented a Transformer-based neural architecture pre-trained on two development tasks: code understanding and code generation.Successively, the model is specialized with a fine-tuning step on the code completion task.With this double training process, the authors try to incorporate a generic knowledge of the project into the model with the explicit goal of improving performance in predicting code identifiers.This is one of the techniques considered in our study (details in Section 3.3).
Ciniselli et al. [12] demonstrated how deep learning models can support code completion at different granularities, not only predicting inline tokens but even complete statements.To this aim, they adapt and train a BERT model in different scenarios ranging from simple inline predictions to entire blocks of code.
Finally, Mastropaolo et al. [16] 1 recently showed that T5 [14], properly pre-trained and fine-tuned, achieves state-of-the-art performance in many code-related tasks, including bug-fixing, mutants injection, code comment and assert statement generation.This is the third approach considered in our study (details in Section 3.2).
Companies have also shown active interest in supporting practitioners while developing sources with code completion tools.For example, IntelliCode was introduced by Svyatkovskiy et al. [10], a group of researchers working at the Microsoft Corporation.
It is multilingual code completion tool that aims at predicting sequences of code tokens such as entire lines of syntactically correct code.IntelliCode leverages a transformer model and has been trained on 1.2 billion lines of diverse programming languages.Similarly, GitHub Copilot has been recently introduced as a state-of-the-art code recommender [62,63].It has been experimentally released through their API that aims at translating natural language descriptions into source code [64].Copilot is based on a GPT-3 model [65] fine-tuned on publicly available code from GitHub.Nonetheless, the exact dataset used for its training is not publicly available.This hinders the possibility to use it in our study, since we cannot check for possible overlaps between the Copilot's training set and the test sets used in our study (also collected from GitHub open source projects).
The successful applications of DL for code completion suggest its suitability for supporting rename refactoring.Indeed, if a model is able to predict the identifier to use in a given context, it can be used to boost identifiers' quality as well.As a representative of DL-based techniques to experiment with variable rename refactoring task, we chose (i) the model by Liu et al. [11], given its focus on the predicting code identifiers; and (ii) the T5 model as used in [16], given its state-of-the-art performance across many code-related tasks.We acknowledge that other choices are possible given the vast literature in the field.We focused on two DL-models recently published in top software engineering venues (i.e., ASE'20 [11] and ICSE'21 [16]).

Data-driven Variable Renaming
In our study, we aim at assessing the effectiveness of data-driven techniques for automated variable renaming.We focus on three techniques representative of the state-ofthe-art.The first is a statistical language model that showed its effectiveness in modeling source code [13].The second, T5 [14], is a recently proposed DL-based technique already applied to address code-related tasks [16].The third is the Transformer-based model presented by Liu et al. [11] to boost code completion performance on identifiers.work at method-level granularity: For each local variable v declared in a method m, we mask every v's reference in m asking the experimented techniques to recommend a suitable name for v.If the recommended name is different from the original one, a rename variable recommendation can be triggered.
We provide an overview of the experimented techniques, pointing the reader to the papers introducing them [13,14,11] for additional details.Our implementations are based on the ones made available by the original authors of these techniques and are publicly available in our replication package [27].The training of the techniques is detailed in Section 4.

N-gram Cached Model
Statistical language models can assess a probability of a given sequence of words.The basic idea behind these models is that the higher the probability, the higher the "familiarity" of the scored sequence.Such familiarity is learned by training the model on a large text corpus.An n-gram language model predicts a single word following the n − 1 words preceding it.In other words, n-gram models assume that the probability of a word only depends on the previous n − 1 words.
Hellendoorn and Devanbu [13] discuss the limitations of n-gram models that make them suboptimal for modeling code (e.g., the unlimited vocabulary problem due to new words that developers can define in identifiers).To overcome these limitations, the authors present a dynamic, hierarchically scoped, open vocabulary language model [13], showing that it can outperform Recurrent Neural Networks (RNN) and LSTM in modeling code.While Karampatsis et al. [59] showed that DL models can outperform the cached n-gram model, the latter ensures good performance at a fraction of the DL models training cost, making it a competitive baseline for code-related tasks.The T5 model has been introduced by Raffel et al. [14] to support multitask learning in Natural Language Processing.The idea is to reframe NLP tasks in a unified text-to-text format in which the input and output of all tasks to support are always text strings.For example, a single T5 model can be trained to translate across a set of different languages (e.g., English, German) and identify the sentiment expressed in sentences written in any of those languages.This is possible since both these tasks (i.e., translation and sentiment identification) are text-to-text tasks, in which a text is provided as input (i.e., a sentence in a specific language for both tasks) and another text is generated as output (i.e., the translated sentence or a label expressing the sentiment).T5 is trained in two phases: pre-training, which allows defining a shared knowledge-base useful for a large class of text-to-text tasks (e.g., guessing masked words in English sentences to learn about the language), and fine-tuning, which specializes the model on a specific downstream task (e.g., learning the translation of sentences from English to German).As previously said, T5 already showed its effectiveness in code-related tasks [16].However, its application to variable renaming is a premier.Among the T5 variants proposed by Raffel et al. [14] that mostly differ in terms of architectural complexity, we adopt the smallest one (T5 small ).The choice of such architecture is driven by our limited computational resources.However, we acknowledge that bigger models have been shown to further increase performance [14].

Deep-Multi-Task code completion model
Liu et al. [11] recently proposed the Code Understanding and Generation pre-trained Language Model (CugLM), a BERT-based model for source code modeling.Albeit under-the-hood CugLM still features a Transformer-based network [66] as T5, such an approach has been specifically conceived to improve the performance of language models in identifiers, thus making it very suitable for our study on variable renaming.CugLM is pre-trained using three objectives.The first asks the model to predict masked identifiers in code (being thus similar to the one used in the T5 model, but focused on identifiers).The second task asks the model to predict whether two fragments of code can follow each other in a snippet.Finally, the third is a left-to-right language modeling task, in which the classic code completion scenario is simulated, i.e., given some tokens (left part), guess the following token (right part).Once pre-trained, the model is fine-tuned for code completion in a multi-task learning setting, in which the model has first to predict the type of the following token and, then, the predicted type is used to foster the prediction of the token itself.As reported by the authors, such an approach achieves state-of-the-art performance when it comes to predicting code identifiers.

Study Design
The goal is to experiment the effectiveness of data-driven techniques in supporting automated variable renaming.The context is represented by (i) the three techniques [13,14,11] introduced in Section 3 and (ii) three datasets we built for training and evaluating the approaches.Our study answers the following research question: To what extent can data-driven techniques support automated variable renaming?

Datasets Creation
To train and evaluate the experimented models, we built three datasets: (i) the largescale dataset, used to train the models, tune their parameters, and perform a first assessment of their performance; (ii) the reviewed dataset and (iii) the developers dataset used to further assess the performance of the experimented techniques.Our quantitative evaluation is based on the following idea: If, given a variable, a model is able to recommend the same identifier name as chosen by the original developers, then the model has the potential to generate meaningful rename recommendations.
Clearly, there is a strong assumption here, namely that the identifier selected by the developers is meaningful.For this reason, we have three datasets.The first one (large-scale dataset) aims at collecting a high number of variable identifiers that are needed to train the data-driven models and test them on a large number of data points.
The second one (reviewed dataset) focuses instead on creating a test set of high-quality identifiers for which our assumption can be more safely accepted: These are identifiers that have been modified or introduced during a code review process.Thus, more than one developer agreed on the appropriateness of the chosen identifier name for the related variable.Finally, the third dataset (developers dataset) focuses on identifiers that have been subject to a rename refactoring operation (i.e., the developer put effort in improving the quality of the identifier through a refactoring).Again, this increases our confidence in the quality of the considered identifiers.
In this section, we describe the datasets we built, while Section 4.2 details how they have been used to train, tune, and evaluate the three models.

Large-scale Dataset
We selected projects on GitHub [67] by using the search tool by Dabic et al. [68].This tool indexes all GitHub repositories written in 13 different languages and having at least 10 stars, providing a handy querying interface [69] to identify projects meeting specific selection criteria.We extracted all Java projects having at least 500 commits and at least 10 contributors.We do so as an attempt to discard toy/personal projects.We decided to focus on a single programming language to simplify the toolchain building needed for our study.Also, we excluded forks to reduce the risk of duplicated repositories in our dataset.
Such a process resulted in 5,369 cloned Java projects from which we selected the 1,425 using Maven 2 [70] and having their latest snapshot being compilable.Maven allows to quickly verify the compilability of the projects, which is needed to extract information about types needed by one of the experimented models (i.e., CugLM ).CugLM leverages identifiers' type information to improve its predictions.To be precise and comprehensive in type resolution, we decided to rely on the JavaParser library [71], running it on compilable projects.This allows to resolve also types that are implemented in imported libraries.We provide the tool we built for such an operation as part of our replication package [27].
We used srcML [72] to extract from each Java file contained in the 1,425 projects all methods having #tokens ≤ 512, where #tokens represents the number of tokens composing a function (excluding comments).The filter on the maximum length of the method is needed to limit the computational expense of training DL-based models (similar choices have been made in previous works [22,25,26], with values ranging between 50 and 100 tokens).All duplicate methods have been removed from the dataset to avoid overlap between training and test sets we built from them.
From these 1,425 repositories, we set apart 400 randomly selected projects for constructing the developers dataset (described in Section 4.1.3).Concerning the remaining 1,025, we use ∼40% of them (418 randomly picked repositories) to build a dataset needed for the pre-training of the T5 [14] and of the CugLM model [11] (pre-training dataset).Such a dataset is needed to support the pre-training phase that, as shown in the literature, helps deep learning models to achieve better performance when dealing with code-related tasks [73,18,19,12].Indeed, the pre-training phase conveys two major advantages summarized as follows: (i) once the model has been pre-trained, it can learn general representations and patterns of the language the model is working with, (ii) the pre-trained model yields to a more robust model initialization of the neural network weights that can then support the specialization phase (i.e., fine-tuning).
The remaining 615 projects (large-scale dataset) have been further split into training (60%), evaluation (20%), and test (20%).The training set has been used to finetune the two DL-based models (i.e., T5 and CugLM ).This dataset, joined with the pre-training dataset, has also been used to train the n-gram model.In this way, all models have been trained using the same set of data, with the only difference being that the training is organized in two steps (i.e., pre-training and fine-tuning) for T5 and CugLM, while it consists of a single step for the n-gram model.
For the T5 model, we used the evaluation set to tune its hyperparameters (Section 4.2), since no previous work applied such a model for the task of variable renaming.
Instead, for CugLM and n-gram we used the best configurations reported in the original works presenting them [13,11].Finally, the test set has been used to perform a first assessment of the models' performance.Table 1 shows the size of the datasets in terms of the number of extracted methods (reviewed dataset and developers dataset are described in the following).

Reviewed Dataset
Also in this case, we selected GitHub projects using the tool by Dabic et al. [68].Since the goal for the reviewed dataset is to mine code review data, we added on top of the selection criteria used for the large-scale dataset a minimum of 100 pull requests per selected project.Also in this case we only selected Maven projects having their latest snapshot successfully compiling.We then mined from the 948 projects we obtained information related to the code review performed in their pull requests.Let us assume that a set of files C s is submitted for review.A set of reviewer comments R c can be made on C s possibly resulting in a revised version of the code C r1 .
Such a process is iterative and can consists of several rounds each one generating a new revised version C ri .Eventually, if the code contribution is accepted for merging, this concludes the review process with a set of C f files.This whole process "transforms" C s → C f .We use srcML to extract from both C s and C f the list of methods in them and, by performing a diff, we identify all variables that have been introduced or modified in each method as result of the review process (i.e., all variables that were not present in C s but that are present in C f ).We conjecture that the identifiers used to name these variables, being the output of a code review process, have a higher chance of representing high quality data that can be used to assess the performance of the experimented models.
Also in this case we removed duplicate methods both (i) within the reviewed dataset, and (ii) between it and the previous ones (pre-training dataset and training set of large-scale dataset), obtaining 457 methods usable as a further test set of the three techniques.

Developers' Dataset
We run Refactoring miner [74] on the history of the 400 Java repositories we previously put aside.Refactoring miner is the state-of-the-art tool for refactoring detection in Java systems.We run it on every commit performed in the 400 projects, looking for commits in which a Rename Variable refactoring has been performed on the local variable of a method.This gives us, for a given commit c i , the variable name at commit c i−1 (i.e., before the refactoring) and the renamed variable in commit c i .We use this set of commits as an additional test set (developers dataset) to verify if, by applying the experimented techniques on the c i−1 version, they are potentially able to recommend a rename (i.e., they suggest, for the renamed variable, the identifier applied with the rename variable refactoring).After removing duplicated methods from this dataset as well (similarly, we also removed duplicates between developers dataset, pre-training dataset, and the training set in large-scale dataset), we ended up with 442 valid instances.

N-gram model
The n-gram model has been trained on the instances in pre-training dataset ∪ largescale dataset.This means that all Java methods contained in pre-training dataset and in the training set of large-scale dataset have been used for learning the probability of sequences of tokens.We use n = 3 since higher values of n have been proven to result in marginal performance gains [13].

T5
To pre-train the T5, we use a self-supervised task similar to the one by Raffel et al. [14], in which we randomly mask 15% of code tokens in each instance (i.e., a Java method) from the pre-training dataset, asking the model to guess the masked tokens.Such a training is intended to give the model general knowledge about the language, such that it can perform better on a given down-stream task (in our case, guessing the identifier of a variable).The pre-training has been performed for 200k steps (corresponding to ∼13 epochs on our pre-training dataset) since we did not observe any improvement going further.We used a 2x2 TPU topology (8 cores) from Google Colab to train the model with a batch size of 128.As a learning rate, we use the Inverse Square Root with the canonical configuration [14].We also created a new SentencePiece model (i.e., a tokenizer for neural text processing) by training it on the entire pre-training dataset so that the T5 model can properly handle the Java language.We set its size to 32k word pieces.
In order to find the best configuration of hyper-parameters, we rely on the same approach used by Mastropaolo et al. [16].Specifically, we do not tune the hyperparameters of the T5 model for the pre-training (i.e., we use the default ones), because the pre-training itself is task-agnostic, and tuning may provide limited benefits.Instead, we experiment with four different learning rate schedulers for the fine-tuning phase.Since this is the first time T5 is used for recommending identifiers, we also perform an ablation study aimed at assessing the impact of pre-training on this task.Thus, we perform the hyperparameter tuning for both the pre-trained and the non pre-trained model, experimenting with the four configurations in Table 2: constant (C-LR), slanted triangular (ST-LR), inverse square (ISQ-LR), and polynomial (PD-LR) learning rate.We experiment the same configurations for the pre-trained and the non-pretrained models, with the only difference being the LR starting and LR end of the PD-LR.Indeed, in the non pre-trained model (ablation study), we had to lower those values to make the gradient stable (see Table 2).We fine-tune the T5 for 100k steps for each configuration.Then, we compute the percentage of correct predictions (i.e., cases in which the model can correctly predict the masked variable identifier) achieved in the evaluation set.The achieved results reported in Table 3 showed a superiority of the ST-LR (second column) for the non pre-trained model, while for the pre-trained model, the PD-LR works slightly better.Thus, we use these two scheduler in our study for fine-tuning the final models for 300k steps.The fine-tuning of the T5 required some further processing to the large-scale dataset.Given a Java method m having n distinct local variables, we create n versions of it m 1 , m 2 , . . ., m n each one having all occurrences of a specific variable masked with a special token.Such a representation of the dataset allows to fine-tune the T5 model by providing it pairs (m j , i j ), where m j is a version of m having all occurrences of variable v j replaced with a <MASK> token and i j is the identifier selected by the developers for v j .This allows the T5 to learn proper identifiers to name variables in specific code contexts.The same approach has been applied on the large-scale dataset evaluation and test set, as well as on the reviewed dataset and developers dataset.In these cases, an instance is a method with a specific variable masked, and the trained model is used to guess the masked identifier.Table 4 reports the number of instances in the datasets used for the T5 model.Note that such a masking processing was not needed for the n-gram model nor for CugLM, since they just scan the code tokens during training, and they try to predict each code token sequentially during testing.Still, it is important to highlight that all techniques have been trained and tested on the same code.

CugLM
To pre-train and fine-tune the CugLM model we first retrieved the identifiers' type information for all code in the pre-training dataset and large-scale dataset.Then, we leveraged the script provided by the original authors in the replication package [75] to obtain the final instances in the format expected by the model.For both pre-training and fine-tuning (described in Section 3.3), we rely on the same hyper-parameters configuration used by the authors in the paper presenting this technique [11].

Performance Assessment
We assess the performance of the trained models against the large-scale test set, the reviewed dataset, and the developers dataset.For each prediction made by each model, we collect a measure acting as "confidence of the prediction", i.e., a real number between 0.0 and 1.0 indicating how confident the model is about the prediction.For the ngram model, such a measure is a transformation of the entropy of the predictions.
Concerning the T5, we exploited the score function to assess the model's confidence on the provided input.The value returned by this function ranges from minus infinity to 0 and it is the log-likelihood (ln) of the prediction.Thus, if it is 0, it means that the likelihood of the prediction is 1 (i.e., the maximum confidence, since ln(1) = 0), while when it goes towards minus infinity, the confidence tends to 0. Finally, CugLM outputs the log-prob for each predicted tokens.Hence, we normalize this value throught the exp function.
We investigate whether the confidence of the predictions represents a good proxy for their quality.If the confidence level is a reliable indicator of the predictions' quality (e.g., 90% of the predictions having c > 0.9 are correct), it can be extremely useful in the building of recommender systems aimed at suggesting rename refactorings, since only recommendations with high confidence could be proposed to the developer.We split the predictions by each model into ten intervals, based on their confidence c going from 0.0 to 1.0 at steps of 0.1 (i.e., first interval includes all predictions having 0 ≤ c < 0.1, last interval has 0.9 ≤ c).Then, we report for each interval the percentage of correct predictions generated by each model in each interval.To assess the performance of the techniques overall, we also report the percentage of correct predictions generated by the models on the entire test datasets (i.e., by considering predictions at any confidence level).
A prediction is considered "correct" if the predicted identifier corresponds to the one chosen by developers in the large-scale dataset and in the reviewed dataset, and if it matches the renamed identifier in the developers dataset.However, a clarification is needed on the way we compute the correct predictions.We explain this process through Fig. 2, showing the output of the experimented models, given an instance in the test sets.The grey box in Fig. 2 represents an example of an instance in the test set: a function having all references to a specific variable originally named sum masked.The T5 model, given such an instance as input, predicts a single identifier (i.e., sum) for all three references of the variable.Thus, for the T5, it is easy to say whether the single generated prediction is equivalent to the identifier chosen by the developers or not.The n-gram and the CugLM model, instead, generate three predictions, one for each of the masked instances, despite they represent the same identifier.
Thus, for these two models, we use two approaches to compute the percentage of correct predictions in the test sets.The first scenario, named complete-match, considers the prediction as correct only if all three references to the variable are correctly predicted.Therefore, in the example in Fig. 2, the prediction of the CugLM model (2 out of 3) is considered wrong.Similarly, the n-gram prediction (1 out of 3 correct) is considered wrong.The second scenario, named partial-match, considers a prediction as correct if at least one of the instances is correctly predicted (thus, in Fig. 2 both the n-gram and the CugLM predictions are considered correct).
We also statistically compare the performance of the models in terms of correct predictions: We use the McNemar's test [76], which is a proportion test suitable to pairwise compare dichotomous results of two different treatments.We statistically compare each pair of techniques in our study (i.e., T5 vs CugLM, T5 vs n-gram, CugLM vs n-gram).To compute the test results for two techniques T 1 and T 2 , we create a confusion matrix counting the number of cases in which (i) both T 1 and T 2 provide a correct prediction, (ii) only T 1 provides a correct prediction, (iii) only T 2 provides a correct prediction, and (iv) neither T 1 nor T 2 provide a correct prediction.We complement the McNemar's test with the Odds Ratio (OR) effect size.Also, since we performed multiple comparisons, we adjusted the obtained p-values using the Holm's correction [77].
We also manually analyzed a sample of wrong predictions generated by the approaches with the goal of (i) assessing whether, despite being different from the original identifiers used by the developers, they were still meaningful; and (ii) identifying scenarios in which the experimented techniques fail.To perform such an analysis, we selected the top-100 wrong predictions for each approach (300 in total) in terms of confidence level.Three of the authors inspected all of them, trying to understand if the generated variable name could have been a valid alternative for the target one, while the fourth author solved conflicts.Note that, given 100 wrong predictions inspected for a given model, we do not check whether the other models correctly predict these cases.This is not relevant for our analysis, since the only goal is to understand the extent to which the "wrong" predictions generated by each model might still be valuable for developers.
Finally, we compare the correct and wrong predictions in terms of (i) the size of the context (i.e., number of tokens composing the method and number of times a variable is used in the method), (ii) the length of the identifier in terms of number of characters and number of words composing it (as obtained through camelCase splitting); (iii) the number of times the same identifier appears in the training data.We show these distributions using boxplots comparing the two groups of predictions (e.g., compare the length of identifiers in correct and wrong predictions).We also compare these distributions using the Mann-Whitney test [78] and the Cliff's Delta (d) to estimate the magnitude of the differences [79].We follow well-established guidelines to interpret the effect size: negligible for |d| < 0.10, small for 0.10 ≤ |d| < 0.33, medium for 0.33 ≤ |d| < 0.474, and large for |d| ≥ 0.474 [79].

Results Discussion
Table 5 reports the results achieved by the three experimented models for each dataset in terms of correct predictions.For T5, both pre-trained and non pre-trained versions are presented.For the n-gram and CugLM, we report the results both when using  Fig. 3 depicts the relationship between the percentage of correct predictions and the confidence of the models.Similarly, to Fig. 4, the orange line represents the ngram model, while the purple and red lines represent CugLM and the T5 pre-trained model, respectively.Within each confidence interval (e.g., 0.9-1.0) the line shows the percentage of correct predictions generated by the model (e.g., ∼80% of predictions having a confidence higher than 0.9 are correct for CugLM in the large-scale dataset).
The achieved results show a clear trend for all models: Higher confidence corresponds to higher prediction quality.The best performing model (CugLM) is able, in the highest confidence scenario, to obtain 66% of correct predictions on the developers dataset, 71% on the reviewed dataset, and 82% on the large-scale dataset.These results have a strong implication for the building of rename refactoring recommenders on top of these approaches: Giving the possibility to the user (i.e., the developer) to only receive recommendations when the model is highly confident can discard most of the false positive recommendations.Concerning the manual analysis we performed on 100 wrong recommendations generated by each model on the large-scale dataset, a few findings can be distilled.First, the three authors observed that the T5 was the one more frequently generating, in the set of wrong predictions we analyzed, identifiers that were meaningful in the context in which they were proposed (despite being different from the original identifier used by the developers).For example, value was recommended instead of number or harvestTasks instead of tasks.The three authors agreed on 31 meaningful identifiers proposed by the T5 in the set of 100 wrong predictions they inspected.Surprisingly, this was not the case for the other two models, despite the great performance we observed for CugLM.However, a second observation we made partially explains such a finding: We found that several failure cases of CugLM and of the n-gram model are due the recommendation of identifiers already used somewhere else in the method and, thus, representing wrong recommendations.We believe this is due to the different prediction mechanism adopted by the T5 as compared to the other two models.As previously explained, the T5 generates a single prediction for all instances of the identifier to predict, thus considering the whole method as a context for the prediction and inferring that identifiers already used in the context should not be recommended.

0-50
#Tokens per method The other two models, instead, scan the method token by token predicting each identifier instance in isolation.This means that if an identifier x is used for the first time in the method after the first instance of the identifier p to predict (e.g., p appears in line 2 while x appears in line 7), the existence of x is not considered when generating the prediction for p.This reduces the information available to CugLM and to the ngram model.Also, CugLM has limitations inherited from the fixed size of its vocabulary set to the 50k most frequent tokens [11], which are the only ones the model can predict.This means that CugLM is likely to fail when dealing with rare identifiers composed by several words.The T5, using the SentencePiece tokenizer, can instead compose complex identifiers.Finally, Fig. 5 shows the comparison of five characteristics between correct (in green) and wrong (in red) predictions.The comparison has been performed on the three datasets (see labels on the right side of Fig. 5), and for correct/wrong predictions generated by the three models (see labels on the left side of Fig. 5).The characteristics we inspected are summarized at the top of Fig. 5.For each comparison (i.e., each pair of boxplots in Fig. 5) we include a * if the Mann-Whitney test reported a significant difference (p-value < 0.05) and, if this is the case, the magnitude of the Cliff's Delta is reported as well.
Concerning the length of the target identifier (#Characters per identifier and #Tokens per identifier ), the models tend to perform better on shorter identifiers, with this difference being particularly strong for CugLM.Indeed, this is the only model for which we observed a large effect size in the length of identifiers between correct and wrong predictions.Focusing for example on the length in terms of number of tokens (#Tokens per identifiers), it is clear that excluding rare exceptions, CugLM mostly succeeds for one-word identifiers.This is again likely to be a limitation dictated by the fixed size of the vocabulary (50k tokens) that cannot contain all possible combinations of words used in identifiers.The size of the coding context (i.e., method) containing the identifier to predict (#Tokens per method ) does not seem to influence the correctness of the prediction, with few significant differences accompanied by a negligible or small effect size.This is also visible by looking at Fig. 4, which portrays the relationship between the percentage of correct predictions and the number of tokens composing the input method.The orange line represents the n-gram model, while the purple and red lines represent CugLM and the T5 pre-trained model, respectively.Within each interval (e.g., 0-50), the line shows the percentage of correct predictions generated by the model for methods having a tokens length falling in that bucket.The only visible trend is that of CugLM on the large-scale dataset, which shows a clear downward trend in correct predictions with the increase in length of the input method.This is indeed the only scenario for which the statistical tests reported a significant differences in the method length of correct and wrong predictions with a small effect size (in all other cases, the difference is not significant or accompanied by a small effect size).
Differently, identifiers appearing in the training set tend to help the prediction (#Overlapping identifiers within the training set).This is particularly true for CugLM (large effect size on all datasets), since its vocabulary is built from the training set.The boxplot for the wrong predictions is basically composed only by outliers, with its third quartile equal 0. This indicates that the predictions on which CugLM fails are usually those for identifiers never seen in the training set.
Finally, the number of times that an identifier to predict appears in the context (i.e., #Occurences of identifier within methods), only has an influence for the T5 on the large-scale dataset.However, there is no strong trend to discuss for this characteristic.

Implications of our Findings
Our findings have implications for practitioners and researchers.For the first, our results show that modern DL-based techniques presented in the literature may be already suitable to be embedded in rename refactoring engines.Clearly, they still suffer of limitations that we will discuss later.However, especially when the confidence of their predictions is high, the generated identifiers are often meaningful, matching the ones chosen by developers.
In terms of research, there are a number of improvements these tools can benefit from.First, we noticed that the main weakness of the strongest approach we tested (i.e., CugLM) is the fixed vocabulary size.Such a problem has been addressed in other models using tokenizers such as byte pair encoding [60] or the SentencePiece tokenizer exploited by the T5.Integrating these tokenizers in CugLM (or similar techniques) could help in further improving performance.Second, we noticed that several falsepositive recommendations could be avoided by just integrating into the models more contextual information.
For example, if the model is employed to recommend an identifier in a given location l, other identifiers having l in their scope do not represent a viable option, since they are already in use.Similarly, the integration of type information in CugLM demonstrated the boost of performance that can be obtained when the prediction model is provided with richer data.Also, the employed models are predicting identifier names without exploiting information such as (i) the original identifier name that could be improved via a rename refactoring, and (ii) the naming convention adopted in the project.Augmenting the context provided to the models with such information might substantially boost their prediction performance.Finally, while we performed an extensive study about the capabilities of datadriven techniques for variable renaming, our experiments have been performed in an "artificial" setting.The (mostly positive) achieved results encourage the natural next step represented by case studies with developers to assess their perceived usefulness of these techniques.

Threats to Validity
Construct validity.Our study is largely based on one assumption: The identifier name chosen by developers is the correct one the models should predict.We addressed this threat when building two of our datasets: (i) we ensure that the variable identifiers in the reviewed dataset have been checked in the context of a code review process involving multiple developers; (ii) we built developers dataset by looking for identifiers explicitly renamed by developers.Thus, it is more likely that those identifiers are actually meaningful.
Internal validity.An important factor that influences DL performance is hyperparameters tuning.Concerning T5, for the pre-training phase we used the default T5 parameters selected in the original paper [14] since we expect little margin of improvement for such a task-agnostic phase.For the fine-tuning, due to feasibility reasons, we did not change the model architecture (e.g., number of layers), but we experimented with different learning rates-scheduler as did before by Mastropaolo et al. [16].For the other two techniques we relied on the parameters proposed in the papers presenting them.
External validity.While the datasets used in our study represent hundreds of software projects, the main threat in terms of generalizability is represented by the focus on the Java language.It is important to notice that the experimented models are language agnostic, but would require the implementation of different tokenizers to support specific languages.

Conclusion
We presented a large-scale empirical study aimed at assessing the performance of datadriven techniques for variable renaming.We experimented with three different techniques, namely the n-gram cached language model [13], the T5 model [14], and the Transformer-based model presented by Liu et al. [11].We show that DL-based models, especially when considering predictions they generate with high confidence, represent a valuable support for variable rename refactoring.Our future research agenda is dictated by the implications discussed in Section 5.1.

Data Availability
All the code and data we used in our study is publicly available in our replication package [27].In particular, we provide: (i) the code needed to train the models, (ii) the datasets we built for training, validating, and testing the models, (iii) all predictions generated by the three different approaches, and (iv) our trained models.

Fig. 2 :
Fig. 2: Example of models' outputs for a test instance

Table 5 :
Correct predictions: C-match indicates the complete-match heuristic, P-match the

Fig. 3 :
Fig. 3: Percentage of correct predictions by confidence level

Table 1 :
Num. of methods in the datasets used in our study

Table 2 :
Hyperparameters tuning for the T5 Model

Table 4 :
Instances in the datasets used for training, evaluating, testing the T5 model

Table 6 :
McNemar's test (adj.p-value and OR) considering complete matches as correct predictions.P-t=pre-trained