Styler: learning formatting conventions to repair Checkstyle violations

Ensuring the consistent usage of formatting conventions is an important aspect of modern software quality assurance. While formatting convention violations can be automatically detected by format checkers implemented in linters, there is no satisfactory solution for repairing them. Manually fixing formatting convention violations is a waste of developer time and code formatters do not take into account the conventions adopted and configured by developers for the used linter. In this paper, we present Styler, a tool dedicated to fixing formatting rule violations raised by format checkers using a machine learning approach. For a given project, Styler first generates training data by injecting violations of the project-specific rules in violation-free source code files. Then, it learns fixes by feeding long short-term memory neural networks with the training data encoded into token sequences. Finally, it predicts fixes for real formatting violations with the trained models. Currently, Styler supports a single checker, Checkstyle, which is a highly configurable and popular format checker for Java. In an empirical evaluation, Styler repaired 41% of 26,791 Checkstyle violations mined from 104 GitHub projects. Moreover, we compared Styler with the IntelliJ plugin CheckStyle-IDEA and the machine-learning-based code formatters Naturalize and CodeBuff. We found out that Styler fixes violations of a diverse set of Checkstyle rules (24/25 rules), generates smaller repairs in comparison to the other systems, and predicts repairs in seconds once trained on a project. Through a manual analysis, we identified cases in which Styler does not succeed to generate correct repairs, which can guide further improvements in Styler. Finally, the results suggest that Styler can be useful to help developers repair Checkstyle formatting violations.

plugin CheckStyle-IDEA and the machine-learning-based code formatters Naturalize and CodeBuff. We found out that Styler fixes violations of a diverse set of Checkstyle rules (24/25 rules), generates smaller repairs in comparison to the other systems, and predicts repairs in seconds once trained on a project. Through a manual analysis, we identified cases in which Styler does not succeed to generate correct repairs, which can guide further improvements in Styler. Finally, the results suggest that Styler can be useful to help developers repair Checkstyle formatting violations.
Keywords Coding conventions · Linter · Format checker · Checkstyle · Formatting violations · Automatic repair 1 Introduction Coding conventions are widely recognized as a means to improve the internal quality of software systems (Prause and Jarke, 2015). They are rules that developers agree on for writing code, which encode best coding practices, widely adopted standards, or developers' preferences. The usage of coding conventions helps to reduce style deviations, which are nothing but distracting noise when reading code (Spinellis, 2011;Prause and Jarke, 2015).
However, keeping all source code files of a project compliant with the coding conventions agreed by a development team is a challenge. For that, two main activities must be performed: the detection and the repair of coding convention violations. The detection of coding convention violations can be automatically performed using linters. A linter is a tool that statically analyzes code to check its compliance with rules and warns software developers when rule violations are found. The usage of linters also brings challenges because the developers need to create a configuration according to their adopted conventions so that the linter detects the right violations (not more and not less). Nevertheless, in this paper, we focus on the latter task, i.e., the repair of violations, which is a little researched problem.
To repair coding convention violations, developers can either perform fixes manually or use automated solutions that produce fixes. Manually fixing these violations is a waste of valuable developer time. Considering formatting convention violations, which are the focus of this paper, developers could use code formatters as automated solution. However, this alternative is also not satisfactory. With code formatters, the key problem is that they do not take into account the project-specific rules, those that are configured by developers for the used linter.
Inspired by the problem statement of program repair (Monperrus, 2018), we state in this paper the problem of automatically repairing formatting violations: given a program, a set of format checker rules, and one rule violation, the goal is to modify the source code formatting so that no violation is raised by the format checker. A format checker is a linter, or a part of a linter, that focuses on formatting checks, since linters cover several classes of coding conventions, e.g., naming and formatting.
In this paper, we explore that problem in the context of Checkstyle 1 , a popular format checker for the Java language. We present Styler, a tool dedicated to fixing formatting violations in Java source code. The uniqueness of Styler is its applicability to any formatting convention because its approach is not based on specific format checker rules. The key idea behind Styler is the usage of machine learning to learn the formatting conventions that are used in a software project. The learning is based on training data generated by Styler through the modification of source code files to trigger violations of the formatting rules configured by developers for a given project. Once trained, Styler predicts changes on formatting characters (e.g., whitespaces) to fix formatting convention violations happening in the wild. Technically, Styler encodes Java source code containing formatting violations into abstract token sequences and uses sequence-to-sequence machine learning models based on long short-term memory neural networks (LSTMs).
To evaluate Styler, we conducted a large scale experiment using 26,791 Checkstyle formatting violations mined from 104 GitHub projects. Based on our research questions, we found out that Styler repairs many violations (41%) from a diverse set of formatting rules (24/25). It generally performs better for fixing violations related to horizontal whitespace between Java tokens than violations related to tabulations and line length. Moreover, Styler produces smaller repairs compared to the state-of-the-art machine learning formatters (Allamanis et al., 2014;Parr and Vinju, 2016) and the IntelliJ plugin CheckStyle-IDEA (CheckStyle-IDEA, 2021). Finally, Styler repairs violations in seconds, once it is trained for a given project.
To sum up, our contributions are: -A novel approach to fix violations of code formatting conventions, based on machine learning. The approach is able to learn project-specific formatting rules with a self-training data generation strategy and repair formatting rule violations with a sequence-to-sequence machine learning model; -A tool, called Styler, which implements our approach in the context of Java and Checkstyle, to repair Checkstyle formatting violations. The tool is made publicly available 2 for future research and usage; -A dataset of real-world Checkstyle violations mined from GitHub repositories. The dataset is publicly available 3 for future research; -A comparative experiment of the performance of Styler against the stateof-the-art code formatters (CheckStyle-IDEA, 2021; Allamanis et al., 2014;Parr and Vinju, 2016). The results of the experiment are also publicly available 4 for the sake of open science.
The remainder of this paper is organized as follows. Section 2 presents the background of this work. Section 3 presents Styler in detail, including its workflow and technical principles. Section 4 presents the design of our experiment for evaluating Styler and comparing it with three code formatters. The experimental results are then presented in Section 5. Section 6 presents discussions, and Section 7 presents the related works. Finally, Section 8 presents the concluding remarks of this work.

Background
Coding conventions play an important role in software development and maintenance. In this section, we present a background on coding conventions and tools that help developers enforce them. In addition, we report on a study of the usage of Checkstyle, a tool that statically checks Java code against a specified set of coding conventions.

Coding conventions
Coding conventions, also known as coding style and coding standards, are rules that developers agree on for writing code. The usage of coding conventions does not affect the behavior of software systems. Instead, developers use them to improve code readability and maintainability. Although not all coding practices are perceived by developers as enhancing code readability (Santos and Gerosa, 2018), they help to reduce style deviations, which are nothing but distracting noise when reading code (Spinellis, 2011;Prause and Jarke, 2015).
There are several kinds of coding conventions, e.g., conventions related to naming and formatting. In this paper, we focus on the latter, i.e., formatting conventions. Formatting refers to the appearance or the presentation of the source code. One can change the formatting by using, for instance, non-printable characters, such as spaces, tabulations, and line breaks. In freeformat languages such as Java and C++, the code formatting does not change the abstract syntax tree of programs. In non-free-format languages, such as Python and Haskell, the formatting is even related to behavior, which means that correcting formatting issues can fix bugs.
To exemplify formatting conventions, consider Figure 1, which shows two well-known ways that developers may follow when placing left curly braces in code blocks. Note that one way is to place the left curly brace in a new line ( Figure 1a) while another way is to place it at the end of the conditional expression line (Figure 1b). The way to actually do it in a software project depends on what the project's development team chooses. Agreeing on coding conventions to be followed in a software project is important to avoid edit wars and endless debates.

Detection of coding convention violations
A challenge faced by developers is to keep their code compliant with the agreed coding conventions. Basically, every new change in the code must satisfy the adopted coding conventions. Manual analysis of code changes for checking if they do not violate the adopted coding conventions is time-consuming and error-prone. To do so automatically, one can use linters. A linter is a tool that statically analyzes code to check its compliance with rules and warns software developers when rule violations are found. The rules might be related to functional problems, such as resource leakage or incorrect logic, and maintainability problems, such as non-compliance with best practices or violations of style conventions (Beller et al., 2016). As a side note, the literature does not consistently relate linters and automated static analysis tools (abbreviated as ASATs, also known as static analyzers). However, we understand that an automated static analysis tool is any tool that analyzes source code without the need to run it, including, for instance, tools for software analytics. Therefore, in this paper, we consider that any linter belongs to the family of automated static analysis tools, but that automated static analysis tools are not all about analyzing code against a set of rules.
Linters can be usually integrated into IDEs and build tools. On the one hand, when integrated into IDEs, developers may manually run the linter before they commit their changes. If they do not do it, they might face a lot of violations raised by the linter after the end of the building step for a release or for shipping the program. On the other hand, when a linter is integrated into build tools, it might be automatically executed in Continuous Integration (CI) environments. The important coding conventions might be configured to make CI builds break when they are violated. This way, developers are forced to repair coding convention violations early in the software development process.

Usage of Checkstyle in the wild
Linters have been the subject of investigation in recent research (Zampetti et al., 2017;Vassallo et al., 2018;Marcilio et al., 2019). However, the existing studies did not investigate at scale and look into how style checking tools are specifically used. In this section, we present a study focused on the usage of Checkstyle, which is a popular linter for Java that checks source code style.

Checkstyle usage in open-source projects
Method. To measure the usage of Checkstyle, we queried GitHub 10 to only retrieve Java projects with at least five stars, because stars have been shown meaningful to sample projects from GitHub (Beller et al., 2017). We found 171,195 Java projects. Then, we searched each of them 11 for finding a Checkstyle ruleset file. Note that a Checkstyle ruleset file can have any name, but we followed a conservative approach towards identifying true positive files by using a set of commonly used names 12 . For simplicity, hereafter, we refer to a Checkstyle ruleset file as checkstyle.xml.
Results. We found 4,334 Java projects containing a checkstyle.xml file, which is 2.53% of all Java projects with at least five stars on GitHub. Table 1 shows the proportion of these projects that use Maven, Gradle, or Ant as their build tools, and the Travis or Circle CI services. We note that build tools are widely used among projects using Checkstyle: 98% of the projects use at least one build tool. Moreover, 44% of the projects use a continuous integration service, which shows the software engineering maturity of the sampled projects.
10 On June 23, 2021. 11 On June 23-24, 2021. 12 Commonly used names for Checkstyle ruleset files: 'checkstyle.xml ', '.checkstyle.xml', 'checkstyle_rules.xml', 'checkstyle_config.xml', 'checkstyle_configuration.xml', 'check-style_checker.xml', 'checkstyle_checks.xml', 'google_checks.xml', 'sun_checks.xml'. Variants by replacing '_' by '-' were also used. Results. We found out that all Checkstyle rules are used. Figure 2 shows the top-10 most used rules. The bars in dark gray represent formatting-related rules, and the bars in light gray represent non-formatting rules. In addition, the bar in gray with a dot pattern represents a rule that can be about formatting, but it depends on how it is configured since it is a regex rule. In the top-10 most used rules, there are three rules related to formatting and one that can be. Notably, the top-3 most used rules are or can be formatting-related ones. Therefore, we conclude that formatting-related rules are important for developers, which validates the relevance of our work.

Styler
Styler is a tool dedicated to helping developers keep their source code compliant with their adopted formatting conventions by automatically fixing formatting violations in Java source code. Styler could be used in different software development workflows. For instance, Styler could be used locally as a pre-hook commit when developers are about to release projects. It could also be used in continuous integration environments, where pull requests could be automatically opened with formatting fixes' suggestions. In this section, we present the workflow and the technical principles of Styler.

Targeted violation types
Styler is about learning and repairing violations related to formatting conventions. For instance, consider that a developer specified that left curly braces must always be placed at the end of lines (as shown in Figure 1b). If this rule is not satisfied (e.g., such as in Figure 1a), a given linter triggers a formattingrelated violation: for instance, Checkstyle would output the violation presented in Figure 4a, and SonarJava 14 would find a violation of the rule "An open curly brace should be located at the end of a line" 15 . In order to fix this violation, the line break before the token "{" must be replaced by a single space.
As mentioned in Section 2.1, there are different classes of conventions, e.g., formatting and naming, and consequently different automated checks in linters. In Styler, we exclusively focus on formatting checks related to non-printable characters, such as indentation and whitespace before and after punctuation. Hereafter, we refer to the linter part related to these formatting checks as format checker. Figure 3 shows the workflow of Styler. It is composed of two main components: 'Styler training' for learning how to repair formatting violations and 'Styler prediction' for repairing a real formatting violation raised by a format checker. Styler receives as input a software project, including its source code and its format checker ruleset.

Styler workflow
The component 'Styler training' is responsible for learning how to repair formatting violations on the given project according to the project-specific format checker ruleset. It creates the training data by injecting formatting violations into violation-free source code files belonging to the project (step A). Then, it encodes the training data into abstract token sequences (step B) in order to train LSTM neural networks (step C). The learned LSTM models are later used to predict repairs.  The component 'Styler prediction' is responsible for predicting fixes for real formatting violations. It first detects formatting violations by running the format checker on the project (step D). Then, Styler encodes the violating code into an abstract token sequence (step E), which is given as input to the LSTM models (step F) previously learned. The models predict fixes for the given formatting violation. These fixes are in the format of formatting token sequences, so they are translated back to Java code (step G). Styler then runs the format checker on the new Java code containing the predicted fixes (step H). Finally, among the predicted fixes where no violation is raised by the format checker, Styler selects one formatting fix to give as output (step I). As Styler only impacts the formatting of source code, its repairs do not change the behavior of the program under consideration.

Styler in action
Consider the formatting violation presented in Figure 4a. This violation is about the Checkstyle LeftCurly rule, which was configured to enforce that left curly braces are placed at the end of lines. The Java source code that caused such a violation is presented in Figure 4b.
For that violation, Styler encodes the incorrectly formatted lines (Figure 4b) into the abstract token sequence shown in Figure 4c. Then, this abstract token sequence is given as input to LSTM models, which predict alternative formatting token sequences, as the one shown in Figure 4d, that may fix the current formatting violation. These predicted formatting token sequences  are then used to modify the formatting tokens of the original abstract token sequence. It results in predicted abstract token sequences, as the one shown in Figure 4e. The difference between Figure 4c and Figure 4e is the replacement of the formatting token 1_NL by 1_SP. This predicted repair means that the line break before the token "{" should be replaced by a single space. Then, the predicted abstract token sequence ( Figure 4e) is translated back to Java code ( Figure 4f). Finally, when running Checkstyle on the new Java code, no Checkstyle violation is raised, meaning that Styler successfully repaired the violation.

Java source code encoding
Styler encodes Java source code into an abstract token sequence that is required to predict formatting changes. An abstract token sequence is composed of pairs of abstract Java tokens and abstract formatting tokens. Styler represents each Java token as an abstract token by keeping the value of the Java keywords, separators, and operators (e.g., + → +), and by replacing the other token kinds such as literals, comments, and identifiers by their types (e.g., x → Identifier). For each pair of subsequent Java tokens, Styler creates an abstract formatting token, which depends on the presence of a new line. If there is no new line, Styler counts the number of whitespace characters, and then represents it as n_SP when the characters are spaces and n_TB when the characters are tabulations, where n is the number of whitespaces characters (e.g., ␣ → 1_SP). If there is no whitespace between two Java tokens (e.g., x=), Styler adds 0_None between the two Java tokens. If there are new lines between two Java tokens, Styler first counts the number of new lines and represents it as n_NL, where n is the number of new lines. Then, Styler calculates the indentation delta (∆), i.e., the indentation difference, between the line containing the first Java token and the line containing the second Java token. Positive indentation deltas are represented by ∆_ID (indent), negative ones are represented by ∆_DD (dedent), and deltas equal to zero, i.e., no indentation change between the lines, are represented by the absence of an indentation delta representation. The complete representation after the calculation of the number of new lines and the indentation delta is n_NL_∆_(ID|DD)_(SP|TB). For instance, in Figure 4b, the new line between lines 7 and 8 is represented by 1_NL_4_ID_SP, i.e., one new line and indentation delta +4.

Self-supervised training data generation
Styler does not use predefined templates for repairing formatting violations. Styler uses machine learning for inferring a model to repair formatting violations and, consequently, it needs training data. One option would be to mine past commits from the project under consideration to collect training data. However, there might not exist enough data in the history of the project for training models.
Therefore, to have enough data for training, our key insight is to generate the training data in a self-supervised manner. The idea is to modify violationfree Java files belonging to the project under analysis to trigger formatting rule violations. A similar idea has been explored by Yasunaga and Liang (2020). Then, one obtains a pair of files (α orig , α err ): α orig is the file without the formatting violation, and α err is the file with the formatting violation. α orig is a repaired version of α err , and we can use supervised machine learning to predict α orig given α err . We explore this idea in two different ways to generate training data, hereafter referred to as formatting violation injection protocols. The protocol names are Styler random and Styler 3grams .
The Styler random protocol for injecting formatting violations in a project consists of automated insertion or deletion of a single formatting character (space, tabulation, or new line) in Java source files. These modifications require a careful procedure so that 1) the project still compiles and 2) its behavior is not changed. For this, we specify the locations in the source code files that are suitable to perform the modifications. For insertions, the suitable locations are before or after any token. For deletions, the suitable locations are 1) before or after any punctuation (".", ",", "(", ")", "[", "]", "{", "}", and ";"), 2) before or after any operator (e.g., "+", "-", "*", "=", "+="), and 3) in any token sequence longer than one indentation character.
The Styler 3grams protocol is meant to produce likely violations. Instead of directly changing the Java source code as Styler random , Styler 3grams performs modifications at the abstract token level. The idea is to replace formatting tokens with the ones used by developers in the same context, i.e., between the same surrounding Java tokens. For that, we use 3-grams, where a 3-gram = {Java token, formatting token, Java token}. So given a violation-free Java file, the task of Styler 3grams is the following. First, the Java file is tokenized (see Section 3.4), and a random formatting token is picked and used to form a 3-gram, which is 3-gram orig . Then, given a corpus of 3-grams previously created from software projects, Styler 3grams finds a 3-gram i in the corpus that matches the Java tokens of 3-gram orig . Several matches can be found, but the selection of a 3-gram i is random according to its frequency in the corpus. Then, the formatting token of 3-gram orig is replaced by the formatting token of 3-gram i . Finally, Styler 3grams performs a de-tokenization so that a violating Java version of the original violation-free Java file is created.
Algorithm 1 presents the algorithm that Styler uses to generate one training dataset per formatting violation injection protocol (Styler random and Styler 3grams ). The input of the algorithm is the format checker ruleset of the project, a corpus of violation-free Java files taken from the project, the number of violating files to be generated, the injection protocol to be used, and the maximum duration of the process. Then, in each batch iteration (line 7), a file is randomly selected from the corpus of violation-free Java files (line 12), and the specified injection protocol is applied to it (line 13). Once a batch is completed, the format checker is executed on the resulting modified files (line 16) so that the algorithm selects the ones that contain a single violation (line 17). The algorithm ends when the desired number of files with violations is reached or when the process reaches the specified maximum duration.

Violation encoding
In order to repair formatting violations, the Java source code encoded as an abstract token sequence must capture both the violation in the code and the context surrounding the violation. So, for a given violation, Styler considers a token window of k source code lines before and after the violation location provided by the format checker for creating an abstract token sequence (see Section 3.4). Once the violating line and the ones surrounding it are tokenized, Styler places two tags around the tokens related to the origin of the violation so that the violation location and its type can be further identified. The tags consist of the name of the format checker rule that was violated. For instance, the violation presented in Figure 4a is about the Checkstyle LeftCurly rule, so the tags around the violation are <LeftCurly> and </LeftCurly> as shown in Figure 4c.

Algorithm 1 Injection of formatting violations in Java files.
Input: ruleset -format checker configuration of the project under consideration Input: f iles -corpus of violation-free Java files taken from the project Input: numberOf V iolations -number of files with one violation to be generated Input: protocol in [Styler random , Styler 3grams ] Input: maxDuration -maximum duration of the process Output: dataset with Java files containing formatting violations raised by a format checker 1: var dataset ← {} 2: var maxT ime ← time.now + maxDuration 3: var timeout ← f alse 4: while dataset.size < numberOf V iolations and not timeout do 5: if time.now >= maxT ime then 9: timeout ← true 10: go to line 16 11: end if 12: end for 16: f ormatCheckerResults ← runF ormatChecker(modif iedF iles, ruleset) 17: violatingF iles ← selectF ilesW ithOneV iolation(f ormatCheckerResults) 18: dataset.append(violatingF iles) 19: end while 20: return dataset The strategy to place the tags in the abstract token sequence is primarily based on the fact that the tags should surround the tokens related to the origin of the violation. At the same time, the number of tokens between the two tags should be minimal so to keep precise information about the violation location. Thus, Styler places the tags according to the location information given by the format checker. When the format checker provides the line and the column, Styler places <ViolationType> one token before the violation and </ViolationType> one token after. When the format checker provides the line but not the column (e.g., when the violation is about the Checkstyle LineLength rule), Styler places <ViolationType> one token before the line and </ViolationType> one token after the end of the line.

Machine learning model
Learning ( Figure 3-step C). Styler aims to translate a token sequence with a formatting violation (input sequence) to a new token sequence with no formatting violation (output sequence). Styler uses a sequence-to-sequence translation based on recurrent neural network LSTMs (Long Short-Term Memory), similar to what is used for natural language translation. Thanks to the token abstraction employed by Styler to encode Java source code (see Section 3.4 and Section 3.6), the input and output vocabularies are small (respectively ∼150 and ∼50), hence are well handled by LSTM models. Styler uses LSTMs with bidirectional encoding, which means that the embedding is able to catch information around the formatting violation in the two directions. For instance, a violation triggered by the Checkstyle WhitespaceAround rule, which checks that a token is surrounded by whitespaces, requires the contexts before and after the token.
Repairing (Figure 3-step F). Once the LSTM models are trained (one per formatting violation injection protocol, see Section 3.5), Styler can be used for predicting fixes for a token sequence I as in Figure 4c. For an input sequence I, an LSTM model predicts x alternative formatting token sequences using a technique called beam search, which we use off-the-shelf. These alternatives are all potential repairs for the formatting violation (e.g., Figure 4d).
Note that the LSTM models predict formatting token sequences (e.g., Figure 4d), but the goal is to have abstract token sequences containing Java and formatting tokens (e.g., Figure 4e), so they can further be translated back to Java code. For that, Styler generates a new abstract token sequence (O i ) for each formatting token sequence (F i ), based on the original input I, such as in Figure 5a. Recall that I is composed of pairs of Java tokens and formatting tokens (see Section 3.4), therefore its number of formatting tokens is L I = length(I)/2. However, an LSTM model does not enforce the output size, thus we cannot guarantee that the length of a predicted formatting token sequence (L Fi = length(F i )) is equal to L I . If L Fi > L I , Styler uses the first L I formatting tokens from F i and ignores the remaining ones to generate O i , such as in Figure 5b. If L Fi < L I , Styler uses all formatting tokens from F i and copies the L Fi + 1, L Fi + 2, . . . , L I original formatting tokens from I, such as in Figure 5c. Finally, after creating x abstract token sequences O, Styler continues its workflow (Figure 3-step G).

Repair verification and selection
Styler performs x predictions per LSTM model (i.e., Styler random -based model and Styler 3grams -based model), so in the end Styler generates x × 2 predictions to repair a single violation. After the translation of these predictions back to Java source code (Figure 3-step G), Styler performs a verification ( Figure 3-step H), where the format checker is executed on the resulting Java source code files. Finally, given the files that do not result in formatting violations, Styler selects the one that has the smallest source code diff to give as output (Figure 3-step I).

Implementation
The approach employed by Styler is independent of the considered format checker. The current implementation uses Checkstyle, which is a popular format checker for Java. Other format checkers can be integrated in Styler. However, they must output the violation type and the violation location. This is necessary for the violation encoding (see Section 3.6).
Styler is implemented in Python. We use javalang 16 for parsing and OpenNMT-py 17 for the machine learning part. Styler is publicly available at https://github.com/KTH/styler/. The current calibration of Styler is presented in Section 4.4.1.

Evaluation design
We conducted an empirical study to evaluate Styler from different perspectives (see Section 4.1), including a comparison against three state-of-the-art code formatting systems (see Section 4.2). We first built a dataset of Checkstyle violations mined from GitHub repositories (see Section 4.3), and then we gave these violations as input to all the four tools (see Section 4.4) to measure their repairability. In this section, we present the design of our study.

Research questions
Our goal is to answer the following six research questions. Understanding the cases in which Styler fails to generate a correct repair is important so that i) Styler can be further improved, ii) hard-to-repair violations are identified and, consequently, researchers might study them and develop tools specialized to repair them, and iii) the limitations of Styler can be taken into account by developers when deciding whether or not to use Styler in their projects. To discover the most frequent cases in which Styler does not succeed to generate a correct repair, we manually analyzed violations of the rules for which Styler does not perform well.

RQ #4 [Quality]
: What is the size of the repairs generated by Styler, compared to other systems? There may be several alternative repairs that fix a given Checkstyle violation, including ones that change source code lines other than the ill-formatted line. In this research question, we compare the size of the repairs generated by Styler against the repairs generated by the other systems.

RQ #5
[Performance]: How fast is Styler for learning and predicting formatting repairs? To investigate if Styler could be applicable in practice, we measure its performance for fixing the mined Checkstyle violations. This is valuable information for those who could be interested in using Styler as a pre-commit hook in IDEs or continuous integration services.
RQ #6 [Technical analysis]: How do the two training data generation techniques of Styler contribute to its repairability? Finally, we perform a comparison between the two formatting violation injection protocols used to generate training data (see Section 3.5). This comparison is done through the LSTM models trained with the two different training sets. We investigate if, and to what extent, one of the models contributes more to the repairability of Styler. This is an important investigation from the point of view of users who might want to use Styler with only one model for performance reasons.

Systems under comparison
We selected three systems to compare Styler with: CheckStyle-IDEA (CheckStyle-IDEA, 2021), Naturalize (Allamanis et al., 2014), and Code-Buff (Parr and Vinju, 2016). CheckStyle-IDEA, also referred to as CS-IDEA in this paper, is an IDE-based code formatter plugin for the IntelliJ IDE. It provides IDE-integrated feedback against a given Checkstyle ruleset and fixes Checkstyle violations through the IntelliJ formatter by taking into consideration a Checkstyle ruleset. Naturalize is a tool dedicated to assisting developers in fixing coding conventions related to naming and formatting in Java programs. It learns coding conventions from a codebase and suggests fixes to developers, such as formatting modifications, based on the n-gram model. CodeBuff is a code formatter applicable to any programming language with an ANTLR grammar. Instead of formatting the code according to ad-hoc rules for a language, CodeBuff aims to infer the formatting rules given a grammar for the language and a set of files following the same formatting rules. For each token, a KNN model decides to indent it or to align it with another token based on the abstract syntax tree of the source file.
All three systems are code formatters. CheckStyle-IDEA takes a Checkstyle ruleset into consideration, and Naturalize and CodeBuff are the state-of-the-art machine learning formatters that aim to assist developers to fix formatting-related issues without any prior or ad-hoc formatting rules.

Data collection
To execute Styler and the systems under comparison and, consequently, answer our research questions, we created a dataset of Checkstyle formatting violations by mining open-source projects. The first step was to build a list of projects, which was done based on the data previously collected for the study presented in Section 2.3. We selected all the projects that have exactly one Checkstyle ruleset file and use Maven. This resulted in 2,143 projects.
For each project, we tried to reproduce Checkstyle violations with the following automated lightweight approach. First, the remote repository of the project is cloned from GitHub 18 . Then, a sanity check is performed on the checkstyle.xml file contained in the project. If the file contains variables, the project is discarded. Otherwise, a search in the history of the project is done for the last commit (c i ) that contains modifications in the checkstyle.xml file, which is the commit to be used as the starting point for the reproduction of real violations. Then, c i is checked out, and all the files of the project are submitted to a process that aims to check if our automated approach can successfully execute Checkstyle on the project and with which version of Checkstyle. The latter is necessary because new versions of Checkstyle might introduce breaking backward compatibility 19 and, then, fail to parse a checkstyle.xml file that was used with previous versions of Checkstyle. Such a process consists of executing multiple Checkstyle versions on the project, from a newer version to an older one, until finding one version that does not fail or until the available options end 20 . If a successful Checkstyle execution is found, the last tested Checkstyle version, x, is chosen to be used on the project. All commits since c i are then gathered, inclusive, so that all commits to be analyzed are based on the same Checkstyle ruleset.
Then, each selected commit is checked out, and a sanity check is performed on the pom.xml file of the commit. If it points to a suppression file, the commit is discarded because we want violations that happened in practice and our lightweight approach does not solve paths. Otherwise, the Checkstyle version x is executed on the files of the project. If the commit under analysis is the first one to be analyzed, Checkstyle is executed on all the files of the project. Otherwise, Checkstyle is executed only on the files changed in the commit under analysis to avoid duplicate violations in the dataset. Moreover, Java files in folders named test or resources are ignored since we want violations that happened in the main source code. Then, after executing Checkstyle, if at least one Checkstyle violation is raised, the violating Java files and information about the violations, e.g., Checkstyle violation types and locations, are saved.
Such a process was executed for all the 2,143 projects in our list. At the end of the process, we removed duplicate Java files according to the file content among all commits if any. Then, we selected the files containing a single Checkstyle violation that is related to formatting. We performed this selection to accurately evaluate repairs produced by Styler and the other tools. Finally, we kept projects where all criteria yield at least 20 Checkstyle formatting violations. By applying this systematic reproduction and selection process, we obtained a dataset containing 27,058 Checkstyle violations spread over 105 projects. We used one project, and the violations found in it (267), to calibrate Styler (see Section 4.4.1), and the other 104 projects with 26,791 violations for the actual evaluation.

Setup and execution of the systems
We gave the dataset of violations as input to Styler and the three systems under comparison to evaluate their repairability. In this section, we present the setup of the systems, which includes the calibration of Styler, the adaptations performed in Naturalize and CodeBuff, and how the four systems were executed.

Styler calibration
To calibrate Styler, i.e., the Styler random -and Styler 3grams -based models, we performed an exploratory study by training LSTM models with different configurations. The configurations combine values for key parameters, which are the model attention type (general or mlp), the number of layers (one or two) and units (256 or 512) for the model encoder/decoder, and the model word embedding size (256 or 512). For each configuration, the training was performed for a maximum of 20k iterations, with a batch size of 32, and a model was saved in the iterations 5k, 10k, 15k, and 20k. This means that, in the end, we obtained 64 models (2 model attention types × 2 numbers of layers × 2 numbers of units × 2 embedding sizes × 4 training iterations) per training data generation protocol (i.e., Styler random and Styler 3grams ). Those models were created for one open-source project 23 contained in our dataset (see Section 4.3), which was randomly selected from the top-5 projects with the most diversity of violated formatting rules. The project was given as input to Styler, which produced training data by injecting Checkstyle violations in violation-free files found in the project (see Section 3.5). For each protocol, 10k violations were injected. This data was used to train the LSTM models, where 9k violations were used for training and 1k for validation. Once the 64 models per protocol were created, we executed Styler with each of them on the real violations found in the project so that we could test the models and choose the configuration of the best ones. Then, for each protocol, we picked the configuration of the model that repaired violations in a more balanced way in terms of Checkstyle rules. The best Styler random -based model was with the mlp model attention, one layer, 256 units, embedding size of 512, and 5k training iterations, and the best Styler 3grams -based model was with the same values for the numbers of layers, embedding size, and training iterations, but with the general model attention and 512 units. These are the configurations we used for training the models for our experiments presented in Section 5.
For prediction, the beam search creates x = 5 potential repairs per model. As for the violation encoding (see Section 3.6), we set k = 6. Recall that this parameter is about the token window before and after the violation (i.e., the context surrounding the violation). This parameter is made big enough to contain important information and, at the same time, small enough to still allow learning and prediction, and was set based on meta-optimization.

Naturalize and CodeBuff adaptation
To use Naturalize, we had to slightly modify it. Naturalize recommends multiple fixes, so we take the first one for a given violation as being the repair. In addition, we changed Naturalize to only work for indentation, excluding fixes regarding naming conventions (which are out of the scope of this paper). To run CodeBuff, we give it the required configuration, including the number of spaces for indentation. This value depends on the project given as input to CodeBuff. Thus, before running CodeBuff on a project, we count the most frequent indentation size found in the violation-free files of the project and provide it to CodeBuff.

Execution of the systems
The four systems were executed to repair the 26,791 violations found in the 104 projects contained in the real violation dataset. The machine-learningbased systems (Styler, Naturalize, and CodeBuff) require a corpus of violation-free files to be trained. Therefore, for each subject project, we selected, as training seeds, all violation-free Java files from the first commit, or any subsequent one, that uses the same Checkstyle ruleset used to collect the real violations. We took special care of consistency in our experiment: all the three machine-learning-based systems were trained to repair a given project using the same corpus of violation-free files from the project.
Styler requires other input for training. Recall that its training process includes a step for creating the actual training data (see Figure 3-step A), which is based on the corpus of violation-free files. For each protocol, we set Algorithm 1 to create 10,000 files per project, with a maximum duration of three hours. The resulting files with violations were split for learning and validation in a balanced way according to the violation types, considering 90% for learning and 10% for validation.
Finally, to run CheckStyle-IDEA on each subject project, we first loaded the violating Java files and the checkstyle.xml file contained in the project in IntelliJ. Then, we imported the Checkstyle ruleset (Settings > Editor > Code Style > Import Scheme > Checkstyle Configuration) and simply called the function "Reformat Code" from the IDE.

Evaluation results
We present and discuss the results for our six research questions in this section.

Overall repairability (RQ #1)
To investigate the overall repairability of Styler and the other three systems on the 26,791 Checkstyle violations, we categorized the repair attempts per status, as shown in Table 4. There are two groups of status: repaired and not repaired. The repaired violations are either fully repaired, i.e., no violation is raised after the repair attempt, or partially repaired, i.e., the violation no longer exists in the source code but new violations were introduced. For the sake of clarity, it is worth mentioning that only the full repairs are used for the other five research questions. The group of violations that were not repaired includes the cases where a resulting source code file still contains the same violation only or the same + new violations, or is broken, which means that the file cannot be parsed by javalang after the repair attempt.
Styler fully repaired 41% of the violations while CS-IDEA repaired 50%, which is the greatest overall repairability among the four considered tools. Naturalize and CodeBuff repaired fewer violations (15% and 20%, respectively). To check if there is a significant difference between Styler and the other tools regarding the full repairs, we used McNemar test. Table 5 shows the contingency tables given as input to the test. We found p-value < 0.00001 for all the three tests. Considering α = 0.05, this means that Styler and any other tool have a statistically significant different proportion of errors on our dataset of violations. Note that the p-values were not adjusted since they are too small and the adjustment would have no impact.
Considering the numbers presented in Table 4 other than the proportions of fully-repaired violations, we noticed that CS-IDEA and Styler are the most reliable tools in the sense of delivering to an end-user either a repaired source code or, in the worst-case scenario, the code with the same violation. It is not the same case of Naturalize and CodeBuff, which had higher rates of delivering broken source code. They were, however, designed for a different goal and do not take into account the Checkstyle ruleset of the project like Styler and CS-IDEA do. Yet, they are relevant for our experiment since they are the state of the art of machine-learning-based code formatters. Our results show the need for specialized, focused tools to repair Checkstyle violations.
In addition, we observed that some violation types, i.e., violations of different Checkstyle rules, occur in a much higher frequency than others in our dataset (see Table 3). This might cause bias in the results presented in Table 4.   Because of that, we performed a normalization of the data by sub-sampling the most frequent violation types. In this way, we obtained a sub-sample of violations that contains the same number of instances for all violation types. We ignored the less frequent ones to avoid using too few instances. For that, we calculated the median of the distribution of the violation types, which is 274, and used it as the minimum number of instances for including Checkstyle rules in the analysis. Then, we randomly selected 274 violations of the included rules. In the end, the analysis comprises half of the rules (13) and 3,562 violations. The normalized results are presented in Table 6. All the tools are impacted positively in terms of fully-repaired violations. However, we note that the normalized results present a different ranking of the tools' performance, where Styler outperforms CS-IDEA. CS-IDEA is the tool most negatively impacted by the normalization because it increases only 4% fullyrepairs, while the other three tools considerably increase their repairability by 7%-20%. This suggests that CS-IDEA performs better than the other tools on violation types that are frequent in our dataset, which is investigated in more detail in the next section, for answering RQ #2. Finally, we also performed McNemar test in the normalized results, as shown in Table 7. Considering α = 0.05, the results show that Styler and any other tool have a statistically significant different proportion of errors on the sub-sample too. Styler repaired 41% (11,008/26,791) of the Checkstyle formatting violations found in the wild. Styler outperformed the machine learning systems Naturalize and CodeBuff. CS-IDEA outperformed Styler on our entire dataset of violations, with a repairability of 50%. However, this is not the case when we consider a sub-sample of the dataset by normalizing the number of instances according to violation types, which suggests that CS-IDEA performed better than the other tools on violations of frequent types in our dataset. In addition, note that CS-IDEA depends on the code formatter of the IntelliJ IDE, whereas Styler's approach is fully automated and hence more appropriate for handling new and configurable rules.

Violation-type-based repairability (RQ #2)
To answer RQ #2, we investigated the extent to which Styler and the other three systems repair different Checkstyle violation types, i.e., violations of different Checkstyle rules. Figure 6 shows the Checkstyle violations fully repaired by the systems per violation type in a heatmap. The color scale is from black to white, where black represents 0% of fully-repaired violations and white represents 100% (i.e., the lighter, the better).
Styler and Naturalize repaired violations of 24/25 Checkstyle rules, which is the highest coverage of rules considering all the four tools. CS-IDEA and CodeBuff fixed violations of 21 rules. Surprisingly, Naturalize produced fixes for a higher number of violation types than CS-IDEA, even though it does not consider the Checkstyle ruleset of projects because of its different goals. CodeBuff performed relatively well considering that it does not target Checkstyle violations as Naturalize. These facts suggest that our idea of employing a machine learning approach for repairing format checker violations is promising.
To that extent, automation is still valuable. Moreover, Styler is able to repair these violation types for which one would not need to put engineering effort to write the repair code. Finally, we observed that all tools performed poorly on violations of the most frequent type in our dataset, i.e., RegexpSingleline.

RQ #2:
To what extent does Styler repair different violation types, compared to other systems? Styler and Naturalize repaired violations of a greater diversity of Checkstyle rules (24/25) than the other tools (CS-IDEA and CodeBuff: 21). Styler performed well for fixing violations related to horizontal whitespace between Java tokens. For some rules, Styler fixed all violations while CS-IDEA did not fix any and, for other rules, CS-IDEA had a much higher repairability than Styler, suggesting that they can be considered as complementary in practice. Finally, we confirmed that CS-IDEA outperformed the other tools on the five most frequent violation types in our dataset.

Unsuccessful repair cases (RQ #3)
Styler repaired violations of 24/25 Checkstyle rules, but it did not perform well for some rules as shown in RQ #2. To understand in which cases Styler does not successfully generate repairs, we manually analyzed violations of the Checkstyle rules for which Styler repaired less than 50% violations. The analysis was ad-hoc, where, for each rule, both repaired and non-repaired violations were investigated so that patterns of non-repaired violations or their contexts could be identified. We present the cases of unsuccessful repair we found as follows.
Styler encodes a violation according to the source code position returned by a format checker which is, in this case, Checkstyle. However, in some cases, this position is not where a fix should be applied. For instance, for a violation of the type OneStatementPerLine 24 , a line break should be added in the column 19 or 20 of the line 42, just after the first statement. However, Checkstyle returns column 31, which is the end of the second statement. In such a case, Styler tried to repair the violation in an inappropriate location.
Several non-repaired violations were inside comments. For instance, we found lines of comments exceeding the maximum length of characters, therefore triggering violations of the type LineLength 25 . We also found tab characters inside comments, triggering FileTabCharacter violations 26 . These violations also happen with strings 27 . Styler does not handle cases in which comments or strings should be modified. This is a limitation of Styler due to its tokenization. Comments and strings are tokenized as a single token, i.e., Styler does not take into account the separation of words.
Moreover, we found several occurrences of a case in which Styler repaired a given violation but then another existing one, which was not previously reported by Checkstyle, was triggered. This case of only one violation being reported when multiple ones exist in files is recurrent and happens with FileTabCharacter violations. We were not aware of that at the time we built the real violation dataset, but when a file contains more than one tab character, Checkstyle reports only the first instance of it. In some cases, Styler repaired the first instance 28 , but the next one was then raised by Checkstyle. Even though Styler repaired the originally reported violation, it was not counted as a repaired violation in our study. Note that one criterion to select files containing violations when building the dataset was the existence of a single violation in them (see Section 4.3). This was a decision we made to guarantee we could automatically check if a given violation was fixed. In such a case with FileTabCharacter violations, however, we could not check that precisely.
Finally, we observed that RegexpSingleline violations are the most frequent ones in our dataset and are poorly handled not only by Styler, but all tools (see the last column of Figure 6). When analyzing the violations related to this rule and other regex ones, we found out that many violations are not related to formatting. Some examples are violations related to missing, wrong, or duplicated license header 29 and the usage of specific patterns, such as a tag in javadoc, that are forbidden in some projects 30 . Since these violations are not about formatting, they are not in the targeted violation types of Styler and other tools. However, the occurrence of these violations is very frequent in our dataset and, consequently, the repairability of the tools for such regex violations is impacted. For instance, our dataset contains 8,678 RegexpSingleline violations, and 8,102 (93%) of them are non-formatting violations. The overall repairability results about that rule, as presented in Figure 6, are 3.2% for Styler, 5.8% for CS-IDEA, 1.4% for Naturalize, and 2.6% for Code-Buff. Adjusting the repairability results of the tools by considering only the 576 RegexpSingleline violations that are about formatting, we found out that Styler, CS-IDEA, Naturalize, and CodeBuff repaired 45.1%, 86.8%, 21%, 38.9% of RegexpSingleline violations, respectively. 28 https://kth.github.io/styler-experiments/#!/violation/actiontech-txle/19 29 https://kth.github.io/styler-experiments/#!/violation/neo4j-neo4j/0 30 https://kth.github.io/styler-experiments/#!/violation/apache-usergrid/0 RQ #3: What are the cases in which Styler fails to generate a correct repair? Styler does not generate a correct repair when Checkstyle returns a source code position other than the one to be modified and when the violation is inside comments or strings. Moreover, Styler produces repairs that make Checkstyle raise non-originally-reported violations. These cases in which Styler failed to generate a correct repair relate to some limitations of Styler, which can be further addressed in new studies. Finally, we found out that most of the violations of regex rules are not about formatting, which explains why the repairability of the tools is low for these rules.

Size of the repairs (RQ #4)
One aspect of repair quality is the size of the diff between the source code with a formatting violation and the repaired source code. There might be different repairs for the same violation that pass all Checkstyle rules, but the one with the smaller diff size would be preferable for being the least disrupting for the developers. In the context of a pull request on GitHub, a smaller diff is usually considered as easier to review and merge (Dias, 2020).
To answer RQ #4, we calculated the diff size, in number of lines, of the repairs generated by Styler, CS-IDEA, Naturalize, and CodeBuff. Figure 7 shows the distributions of diff size per tool. We observed that the distributions of the repairs generated by Styler and Naturalize have the smallest medians, which are equal to one and three changed lines, respectively. Yet, they suffer from a few bad cases (the right-hand part of the distributions), mainly Naturalize. CS-IDEA and CodeBuff produced larger repairs, with medians equal to nine and 42, respectively. In the worst cases, they produced several repairs with more than 200 changed lines, which can be seen by the fact that their 95th percentiles are not shown in Figure 7. On the other hand, the 95th percentile of Styler is three. We performed Wilcoxon rank-sum test to verify if the distributions of diff sizes obtained by Styler and the other tools are significantly different from one another. We found p-value < 0.00001 when testing Styler with all the other tools. Considering α = 0.05, we rejected the null hypothesis, which means that the distribution of Styler is significantly different from the other ones.
RQ #4: What is the size of the repairs generated by Styler, compared to other systems? The size of the repairs produced by Styler is usually small. Styler had the smallest median repair size of one changed line, followed by Naturalize, with a median size of three lines. CS-IDEA and CodeBuff produced larger repairs. The ability to produce small diffs is an important property for code review and pull-request-based development, hence our results show that Styler can be realistically used in a modern software development context.

Performance (RQ #5)
To investigate if Styler can be used in practice, we measured the time Styler spent on the real violation dataset. Table 8 shows the minimum, median, average, and maximum time spent on the 104 projects, split over the different steps of the Styler workflow. For training data generation, Styler took at least 15 minutes and up to six hours, which is the maximum execution time allowed by our experimental setup (see Section 4.4.3). The median time for training data generation was 45 minutes. To tokenize the training data, Styler took around two minutes on average, and a maximum of 14 minutes for training the models. The entire training process of Styler (data generation + tokenization + model training) took around one hour and a half on average. This can be considered just fine since the training is meant to happen only when the coding conventions used in a project change (i.e., the Checkstyle ruleset file). After Styler is trained for a given project, it takes on average two seconds to predict a repair, which is fast enough to be used in IDEs or in continuous integration environments. RQ #6: How do the two training data generation techniques of Styler contribute to its repairability?
The LSTM model based on the Styler 3grams protocol is more effective than the model based on the Styler random protocol. Even though the latter exclusively contributes to the overall repairability of Styler with 9% of the fixes, in a real case scenario, one could use only the former to reduce the time for training Styler.

Discussion
We discuss, in this section, machine learning versus rule-based approaches, as well as the threats to the validity of our study.

Machine learning versus rule-based approaches
Styler employs a machine-learning-based approach for repairing formatting convention violations. An alternative approach would be a rule-based one. In such a case, there would be at least one transformation to be applied in the code per format checker rule. However, the engineering of a transformation for every single linter rule is time-consuming. While this is costly, this might be even impractical for highly configurable linters such as Checkstyle, because the rule-based repair system would need to have different transformations for the same linter rule due to the configurable properties. On the contrary, a machine learning approach does not require costly human engineering. It is able to infer transformations for a diverse set of linter rules. Our experiments have validated this property in the context of formatting violations raised by Checkstyle. However, since our approach is far from being perfect and does not work well for certain rules, one avenue for future research is the development of a rule-based system for simple, non-highly configurable rules, to be complementary to Styler.

Threats to validity
Styler generates training data for learning how to repair violations based on the Checkstyle ruleset file contained in a given project. This means that Styler assumes that all formatting rules contained in such a file are valid. In practice, however, developers might ignore the violations of certain rules. Our experiment does not take this scenario into account, thus we do not claim that all the fixes produced by Styler are necessarily relevant for developers. The real violation dataset contains Checkstyle violations mined from GitHub repositories. It is to be noted that it does not cover all existing Checkstyle formatting rules. Moreover, the dataset might not be representative of the real distribution of the 25 rules in the real world. Consequently, future research is needed to strengthen the validity of our study.
At the end of the violation collection process for building the dataset, we removed duplicate Java files according to file contents. However, there might still exist duplicate violations in our dataset. A file containing a violation might have changed, but that change might not be related to the line where the violation exists. Therefore, the same violation would exist in both versions of the file, but since the overall file contents are different, they are both kept in our dataset, which might create noise in it.
Another threat related to the creation of the dataset is that, when selecting violating files, we chose only the ones containing a single Checkstyle violation. We performed this selection so that we could accurately check if the violation was correctly repaired by the tools. Files containing more than one violation would make it hard to automatically check the correctness of repairs because once a violation is repaired, the location of the other ones in the file could be different. Therefore, our results are based on single-violation files, and future investigations on multiple-violation files are needed.
Finally, to compare the quality of the repairs produced by Styler with the repairs produced by the other three tools, we measured the size in lines of the diff between the original and repaired program versions. However, the diff size is only one dimension for comparing the tools, which only approximates the developer's perception of formatting repairs. User studies, such as proposing formatting repairs to developers, are interesting future experiments to further investigate the practical value of this research.

Related work
Styler aims to repair formatting violations raised by linters. Linters are a kind of automated static analysis tool. In this section, we first present works on the usage of static analysis tools. Then, we present systems that share similar goals with Styler, which are systems that target linter violation repair and systems that fix source code formatting. Finally, we present works related to Styler in terms of used technique, i.e., machine learning, for repairing compiler errors and behavioral bugs.

The usage of automated static analysis tools
Static analysis tools have been the subject of investigation in recent research. Zampetti et al. (2017) investigated their usage in 20 popular Java open source projects hosted on GitHub that use Travis CI to support CI activities. They first found out that the projects use seven static analysis tools-Checkstyle, FindBugs (Ayewah et al., 2008), PMD 31 , License Gradle Plugin 32 , Apache Rat 33 , Clirr 34 , and jDepend 35 -being Checkstyle the most used one. About the integration of static analysis tools in CI pipelines, they found out that build breakages due to those tools are mainly related to adherence to coding conventions, while breakages related to likely bugs or vulnerabilities occur less frequently. Zampetti et al. (2017) discuss that some tools are sometimes configured to just produce warnings without breaking the build, possibly because of the high number of false positives. Vassallo et al. (2018) investigated the usage of static analysis tools from the perspective of the development context in which these tools are used. For that, they surveyed 42 developers and interviewed 11 industrial experts that integrate static analysis tools in their workflow. They found out that static analysis tools are used in three main development contexts, which are local environment, code review, and continuous integration. Moreover, they also found out that developers consider different warning types depending on the context, e.g., when performing code review they mainly look at style conventions and code redundancies. Marcilio et al. (2019) focused on one specific static analysis tool: Sonar-Qube 36 . Through an online survey with 18 developers from different organizations, they found out that most respondents agree that the issues reported by static analysis tools are relevant for improving the design and implementation of software.

Linter violation repair and code formatters
Linter violation repair. There are some tools to fix violations of rules checked by linters. Considering academic systems, there are Phoenix (Bavishi et al., 2019), which repairs violations of rules checked by FindBugs (Ayewah et al., 2008), and Getafix (Bader et al., 2019), which focuses on rules checked by Infer (Calcagno et al., 2015) and Error Prone (Aftandilian et al., 2012). These tools learn fix patterns by mining past human-written fixes for linter violations. Another tool is SpongeBugs (Marcilio et al., 2020), which repairs violations of rules checked by the two well-known static analyzers SonarJava and Spot-Bugs with fixed repair templates. Styler shares with these tools the goal of generating patches for linter violations. However, while the mentioned tools focus on rules related to bugs and code smells, Styler focuses on formatting. In addition, there is C-3PR (Carvalho et al., 2020), which does not generate patches itself but proposes fixes through pull request on GitHub generated by linter violation repair tools.
Beyond those academic systems, there are other tools that repair violations found by linters. Related to formatting rules, there is, for instance, ESLint, which is a linter for JavaScript, and it also includes automated solutions to repair violations raised by it.
Code formatters. A way to enforce formatting conventions lies in code formatters (also known as pretty-printers). In Section 4.2, we described Naturalize (Allamanis et al., 2014) and CodeBuff (Parr and Vinju, 2016). Naturalize recommends fixes for coding conventions related to naming and formatting in Java programs, and CodeBuff infers formatting rules to any language given a grammar. Similar to the idea behind CodeBuff, Reiss (2007) had previously experimented with different learning algorithms and feature set variations to learn the style of a given corpus so that it could be applied to arbitrary code. More recently, Markovtsev et al. (2019) presented Style-Analyzer, which helps developers to fix code formatting during code reviews. Style-Analyzer mines the formatting style of the git repository under analysis and expresses the found format patterns with compact human-readable rules. Then, it suggests style inconsistency fixes in the form of code review comments.
Beyond those academic systems, there are code formatters such as googlejava-format 37 , which reformats source code according to the Google Java Style Guide 38 . However, these formatters are usually not configurable or require manual tweaking, which is a tedious process for developers. This is a problem because not all developers are ready to follow a unique convention style. Styler, on the other hand, is generic and automatically captures the conventions used in a project to fix formatting violations.
Finally, there is the CheckStyle-IDEA plugin for IntelliJ (CheckStyle-IDEA, 2021), which we used to compare Styler with. CheckStyle-IDEA provides both real-time and on-demand scanning of Java files with Checkstyle from within IDEA. It also uses the Checkstyle ruleset of projects to configure the formatter available in IntelliJ, making it possible to repair Checkstyle formatting violations. However, it is limited in repairing violations of a great number of Checkstyle rules as shown in RQ #2 and creates large repairs as shown in RQ #4.

Learning for repairing compiler errors and behavioral bugs
Learning for repairing compiler errors. There are related works in the area of automatic repair of compiler errors. In this case, the compiler syntax rules are the equivalent of the formatting rules. There, recurrent neural networks and token abstraction have been used to fix syntactic errors (Bhatia et al., 2018). In DeepFix, Gupta et al. (2017) use a language model for repairing syntactic compilation errors in C programs. Out of 6,971 erroneous C programs, DeepFix was able to completely repair 27% and partially repair 19% of the programs. Later, Ahmed et al. (2018) proposed TRACER, which outperformed DeepFix, repairing 44% of the programs.  confirmed the efficiency of LSTM over n-grams and of token abstraction for single token compiling errors. These approaches do not target formatting violations, which is the target of Styler.
Learning for repairing behavioral bugs. As for repairing compiler errors, there are also learning systems for repairing behavioral bugs, those that, for instance, break test cases. Tufano et al. (2018) investigated the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. They mined millions of buggy and patched program versions from the history of GitHub repositories and abstracted them to train an Encoder-Decoder model. The model was able to fix hundreds of unique buggy methods in the wild. Chen et al. (2019) proposed SequenceR, a program repair tool based on sequence-to-sequence learning focused on one-line fixes. In an experiment with Defects4J (Just et al., 2014), SequenceR was shown to be able to learn how to repair behavioral bugs by generating patches that pass all tests. Styler and SequenceR share the same idea for formatting violation and bug encoding.

Conclusion
In this paper, we presented Styler, which implements a novel approach to repair formatting violations raised by format checkers. Styler creates a corpus of violations, learns from it, and predicts fixes for new violations, using machine learning. Currently, its implementation supports Checkstyle, a popular linter for Java programs. Our experimental results on 26,791 real Checkstyle violations showed that Styler repairs real violations from a diverse set of Checkstyle rules and performs better for fixing violations related to horizontal whitespace between Java tokens than for fixing violations related to tabulations and line length. Moreover, Styler produces smaller repairs than the compared systems, and its prediction time is low, which suggests that it can be used in development environments such as IDEs. Finally, we identified cases in which Styler does not succeed to generate correct repairs, e.g., for Checkstyle violations inside comments or strings. These findings can guide improvements in Styler and help researchers and developers to understand Styler's limitations.
There are several interesting avenues for future research. First, improvements on the violation injection protocols for creating training data can be done to improve the representativeness of seeded formatting violations. This might increase the repairability of Styler. Second, user studies can be conducted, where repairs predicted by Styler are proposed to developers through, for instance, pull requests on GitHub. This type of study would bring practical insights on the potential of Styler. Third, Styler could be integrated into development environments, such as IDEs and social coding sites, for supporting the mentioned user studies and possibly for developers to use Styler. Fourth, other linters could be plugged in Styler so it could be applicable on projects that use linters other than Checkstyle. Fifth, since Styler does not work well for certain rules, the development of a rule-based system for simple, non-highly configurable rules, could be beneficial to complement Styler. Finally, the overall idea behind Styler could be tried out to repair other linter violations beyond purely formatting ones.