A preliminary study on the adequacy of static analysis warnings with respect to code smell prediction

Code smells are poor implementation choices applied during software evolution that can affect source code maintainability. While several heuristic-based approaches have been proposed in the past, machine learning solutions have recently gained attention since they may potentially address some limitations of state-of-the-art approaches. Unfortunately, however, machine learning-based code smell detectors still suffer from low accuracy. In this paper, we aim at advancing the knowledge in the field by investigating the role of static analysis warnings as features of machine learning models for the detection of three code smell types. We first verify the potential contribution given by these features. Then, we build code smell prediction models exploiting the most relevant features coming from the first analysis. The main finding of the study reports that the warnings given by the considered tools lead the performance of code smell prediction models to drastically increase with respect to what reported by previous research in the field.


INTRODUCTION
During software maintenance and evolution, developers continuously modify source code to fix defects, enhance existing functionalities or adapt the system to new environments [22].In such a context, the need of delivering the system in a timely manner often leads developers to set aside good design and implementation solutions and apply modifications that potentially cause the introduction of the so-called technical debt [8]: this is a metaphor introduced to explain, in more practical terms, the compromise between delivering fast and producing high-quality code.One of the most relevant forms of technical debt is represented by code smells [16], i.e., symptoms of the presence of sub-optimal implementation solutions.Complex classes or overly long methods are just two examples of code smells that often arise in practice [31].Previous research has shown that code smells hinder program comprehensibility [1], increase source code change-and fault-proneness [19,31], and increase maintenance effort [38].These reasons have inspired the research effort around the definition of automatic solutions to detect code smells in source code [9].While a number of heuristicbased techniques, relying on different types of software metrics, have been devised (e.g., [28,30,33]), a recent trend is represented by the use of machine learning approaches [4].In particular, machine learning has the potential to address some common limitations of heuristic-based approaches: (1) the subjectivity with which their output is interpreted by developers [14,29], (2) the need of defining thresholds for the detection [15], and (3) the low agreement among them [13].Indeed, machine learning may be exploited to combine multiple metrics, learning code smell instances considered relevant by developers without the specification of any threshold [4].
Despite this potential, however, machine learning models for code smell detection have still poor performance [11], especially due to (1) the little contributions given by the features investigated so far [35] and (2) the limited amount of code smell instances available to train a machine learner in an appropriate manner [34].
In this paper, we focus on the first problem: we aim at advancing the state of the art in machine learning for code smell detection by focusing on the contribution given by the warnings of automated static analysis tools to the classification capabilities.The motivation behind our study is twofold.On the one hand, static analysis tools provide indications about the quality of source code [44], hence being potentially useful to characterize code smell instances.On the other hand, their usage in practice is threatened by the high amount of false positives they output [18]: to deal with it, new instruments able to incorporate static analysis warnings within smarter solutions may represent an interesting use case to make static analysis tools more useful in practice.
Driven by these motivations, we first investigate the potential contribution given by individual types of warnings output by three static analysis tools, i.e., Checkstyle, FindBugs, and PMD, to the prediction of three code smell types, i.e., God Class, Spaghetti Code, and Complex Class.Then, we used the most relevant features coming from the first analysis to build and assess the capabilities of machine learning models when detecting the three considered smells.
The key results of the study highlight promising results: models built using the warnings of individual static analysis tools score between 55% and 91% in terms of F-Measure.The warning types that contribute the most to the performance of the learners depend on the specific code smell considered.

RESEARCH METHODOLOGY
In our study, we defined the following research questions (RQ  ): RQ 1 Which warning categories contribute the most to the prediction of code smells?RQ 2 To what extent can static analysis warnings output by different tools predict the presence of code smells?
More specifically, RQ 1 represents a preliminary research question in which we aim at quantifying whether and to what extent each warning category of the considered tools is relevant for the task of code smell prediction.In RQ 2 , instead, we assess the actual capabilities of a machine learner built using the relevant features coming from the previous research question when predicting the presence of code smells in source code; to this aim, we create individual models, i.e., one for each static analyzer considered.  1.It is important to point out that, we aim at providing preliminary insights into the adequacy of static analysis warnings for code smell detection: a larger-scale analysis is part of our future research agenda.

Selected Tools.
To detect static analysis tool warnings, we selected three tools, namely Checkstyle, Findbugs, and PMD.The selection of these tools is driven by recent findings that showed that these are among the static analysis tools more employed in practice by developers [24,42,43]. Checkstyle.
Checkstyle is an open-source developer tool that evaluates Java code according to a certain coding standard, which is configured according to a set of "checks".These checks are classified under 14 different categories, are configured according to the coding standard preference, and are grouped under two severity levels: error and warning.More information regarding the standard checks can be found from the Checkstyle web site. 1indbugs.Findbugs is another commonly used static analysis tool for evaluating Java code, more precisely Java bytecode.The analysis is based on detecting "bug patterns", which arise for various reasons.Such bugs are classified under 9 different categories, and the severity of the issue is ranked from 1-20.Rank 1-4 is the scariest group, rank 5-9 is the scary group, rank 10-14 is the troubling group, and rank 15-20 is the concern group. 2MD.
PMD is an open-source tool that provides different standard rule sets for major languages, which can be customized by the users, if necessary.PMD categorizes the rules according to five priority levels (from P1 "Change absolutely required" to P5 "Change highly optional").Rule priority guidelines for default and custommade rules can be found in the PMD project documentation. 3.1.3Selected Code Smells.The study considers three class-level code smell types, such as: • God Class.This smell generally appears when a class is large, poorly cohesive, and has a number of dependencies with other data classes of the system [16].• Spaghetti Code.Instances of this code smell arise when a class does not properly use Object-Oriented programming principles (e.g., inheritance), declares at least one long method with no parameters, and uses instance variables [5].• Complex Class.As the name suggests, instances of this smell affect classes that have high cyclomatic complexity [27] and that, therefore, may primarily make the testing of those classes harder [16].The selection of these smells was driven by two main observations.Firstly, previous studies have connected them to an increase of change-and fault-proneness of source code [6,19,31] as well as maintenance effort [38].Secondly, these smells are highly relevant for developers that, indeed, often recognize them as harmful for the evolvability of software projects [29,39,46,47].

Data Collection
The data collection phase aimed at gathering information related to independent and dependent variables of our study.These concern with the collection of static analysis warnings from the selected analyzer, which will represent the features to be used in the machine learner, and the labeling of code smell instances, namely the identification of real code smells affecting the considered systems.

Collecting Static Analysis Tool
Warnings.This step differs based on the static analysis tool considered, as each of them has a different process to be executed.
Checkstyle.The jar file for the Checkstyle analysis was downloaded directly from the Checkstyle's website 4 in order to engage the analysis from the command line.The executable JAR file used in this case was checkstyle-8.30-all.jar.In addition to downloading the JAR executable, Checkstyle offers two different types of rule sets for the analysis.For each of the rule sets, the configuration file was downloaded directly from Checkstyle's website. 5n order to start the analysis, the files checkstyle-8.30-all.jarand the configuration file in question were saved in the directory where all the projects resided.
Findbugs.FindBugs 3.0.1 was installed by running the brew install findbugs in the command line.Once installed, the GUI was then engaged by writing spotbugs.From the GUI, the analysis was executed through File → New Project.The classpath for the analysis was identified to be the location of the project directory.Moreover, the source directories were identified to be the project JAR executable.Once the class path and source directories were identified, the analysis was engaged by clicking Analyze in the GUI.Once the analysis finished, the results were saved through File → Save as using the XML file format.The main specifications were the "Classpath for analysis (jar, ear, war, zip, or directory)" and "Source directories (optional; used when browsing found bugs)" where the project directory and project jar file were added.
PMD. PMD 6.23.0 was downloaded from GitHub6 as a zip file.After unzipping, the analysis was engaged by identifying several parameters: project directory, export file format, rule set, and export file name.In addition to downloading the zip file, PMD offers 32 different types of rule sets for Java. 7All 32 rule sets were used during the configuration of the analysis.
Using these procedures, we ran the three static analysis tools against the source code of the considered systems.At the end of the analysis, these tools extracted a total of 60,904, 4,707, and 179,020 warnings for Checkstyle, FindBugs, and PMD respectively.[2,20,26], recent findings showed that such a procedure could threaten the reliability of the dependent variable and, as a consequence, of the entire machine learning model [10].Hence, in our study we preferred a different solution, namely considering manuallyvalidated code smell instances.In particular, for all the systems considered, there exist a publicly available dataset reporting actual code smell instances [32] which has been also used in more recent studies evaluating the performance of machine learning models for code smell detection [31,34,35].For each code smell, Table 2 reports the distribution of the code smells in the dataset.

Data Analysis
In this section, we report the methodological steps conducted to address our research questions.

RQ 1 . Contribution of Static Analysis Warnings in Code Smell
Prediction.In the first RQ, we assessed the extent to which the various warning categories of the considered static analysis tools can potentially impact the performance of a machine learning-based code smell detector.To this aim, we employed an information gain algorithm [37], and particularly the Gain Ratio Feature Evaluation technique, to establish a ranking of the features according to their importance for the predictions done by the different models.Given a set of features F = { 1 , ...,   } belonging to the model , the Gain Ratio Feature Evaluation computes the difference, in terms of Shannon entropy, between the model including the feature   and the model that does not include   as independent variable.The higher the difference obtained by a feature   , the higher its value for the model.The outcome of the algorithm is represented by a ranked list, where the features providing the highest gain are put at the top.This ranking was used to address RQ 1 .

RQ 2 . The Role of Static Analysis Warnings in Code Smell
Prediction.Once we had investigated which warning categories relate the most to the presence of code smells, in RQ 2 we proceeded with the definition of machine learning models.Specifically, we defined a feature for each warning type raised by the tools, where each feature contained the number of violations of that type identified in a class.For instance, suppose that for a class C  Checkstyle identifies seven violations to the warning type called "Bad Practices": the machine learner is fed with the integer value "7" for the feature "Bad Practices" computed on the class C  .
The dependent variable is, instead, given by the presence/absence of a certain code smell.This implies the construction of three models for each tool, i.e., for God Class, Spaghetti Code, and Complex Class, respectively.Overall, we therefore built nine models per projectone for each code smell/static analysis tool pair.
As for the supervised learning algorithm, the literature in the field still misses a comprehensive analysis of which algorithm works better in the context of code smell detection [4].For this reason, we experimented with multiple classifiers such as J48, Random Forest, Naive Bayes, Support Vector Machine, and JRip.When training these algorithms, we followed the recommendations provided by previous research [4,40] to define a pipeline dealing with some common issues in machine learning modeling.Particularly, we exploit the output of the Gain Information algorithm-used in the context of RQ 1 -to discard unrelevant features that can bias the interpretation of the models [40]: we did that by excluding the features not providing any information gain.We also configured the hyper-parameters of the considered machine learners using the MultiSearch algorithm, which implements a multidimensional search of the hyper-parameter space to identify the best configuration of the model based on the input data.Finally, we considered the problem of data balancing: it has been recently explored in the context of code smell prediction [34] and the reported findings showed that data balancing may and may not be useful to improve the performance of a model.Hence, before deciding on whether to apply data balancing, we benchmarked (i) Class Balancer, which is an oversampling approach (ii) Resample, an undersampling method (iii) Smote, an approach including synthetic instances to oversample the minority class, and (iv) NoBalance, namely the application of no balancing methods.
After training the models, we proceeded with the evaluation of their performance.We applied a 10-fold cross-validation: with this strategy, the dataset (including the training set) was divided in 10 parts respecting the proportion between smelly and non-smelly elements.Then, we trained for ten times the models using 9/10 of the data, retaining the remaining fold for testing purpose-in this way, we allowed each fold to be the test set exactly once.For each test fold, we evaluated the models by computing a number of performance metrics, such as precision, recall, F-Measure, AUC-ROC, and Matthews Correlation Coefficient (MCC).

RESULTS AND DISCUSSION
In the following, we discuss the results of our research questions.When analyzing the most powerful features of Checkstyle and PMD, we could notice that source code design-related features are constantly at the top of the ranked list for all the considered code smells.This is, for instance, the case of the Regexp warnings given by Checkstyle for Complex Class or the Design metrics output by PMD for Spaghetti Code.The most relevant warnings also seem to be strongly related to specific code smells: as an example, the complexity of regular expressions might strongly affect the likelihood to have a Complex Class smell; similarly, design-related issues are the most characterizing aspects of a Spaghetti Code.In other words, from this analysis we could delineate a relation between the most relevant features output by Checkstyle and PMD and the specific code smells considered in this paper.
A different discussion should be done for FindBugs: in this case, the most powerful metrics mostly relate to Performance or Security, which are supposed to cover different code issues than code smells.As such, we expect this static analysis tool to have lower performance when applied to code smell detection.
Finally, it is worth noting that in some cases the information gain of the considered features seems to be low, e.g., the Error Prone warning category of PMD in the case of God Class.On the one hand, this may potentially imply a low capability of the features when employed within a machine learning model.On the other hand, it may also be the case the such a little information would already be enough to characterize and predict the existence of code smell instances.Next section addresses this point further.
3.2 RQ 2 .Assessing the models built using static analysis tools alone.
Table 4 reports the performance capabilities of the models built using the warnings given by Checkstyle, FindBugs, and PMD, respectively.For the sake of space limitations, we only discuss the overall results obtained with the best configuration of the models, namely the one considering Random Forest as classifier and Class Balancer as data balancing algorithm.The results for the other models are available in our online appendix [25].We can immediately point out that the performance of the models built using the warnings of static analysis tools can drastically improve the capabilities of code smell prediction models previously reported in literature [11,35].As an example, Pecorelli et al. [35] reported that models built using code metrics of the Chidamber-Kemerer suite [7] work worst than a constant classifier that always considers an instance as non-smelly.Instead, our findings report that it is possible to achieve high classification values by relying on different sets of metrics that cover various aspects of source code quality.While the values of F-Measure and AUC-ROC vary depending on the specific models built, all of them range between 55% and 91% and from 78% and 98% when considering F-Measure and AUC-ROC, respectively.
The lowest performance was obtained by the model built using the output of FindBugs to predict the presence of Spaghetti Code

THREATS TO VALIDITY
Construct Validity.This threat concerns the relationship between theory and observation due to possible measurement errors.The selected static analysis tools are among the most reliable static analysis tools and most adopted by developers [42].Nevertheless, we cannot exclude the presence of false positives or false negatives in the detected warnings; further analyses on these aspects are part of our future research agenda.As for code smells, we employed a manually-validated oracle, hence avoiding possible issues due to the presence of false positives and negatives.
Internal Validity.This threat concerns internal factors related to the study that might have affected the results.When assessing the role of static analysis tools for code smell detection, we considered three tools to increase our knowledge on the matter.Yet, we recognize that other tools might consider different, more powerful warnings that may affect the performance of the learners.Also in this case, further analyses are part of our future research agenda.
External Validity.As for the generalizability of the results, our study should be considered as a preliminary investigation on five open-source software projects with different scope and characteristics.We plan to conduct a larger scale analysis as future work.
Conclusion Validity.This threat concerns the relationship between the treatment and the outcome.We adopted different machine learning techniques to reduce the bias of the low prediction power that a single classifier could have.We also addressed possible issues due to multi-collinearity, missing hyper-parameter configuration, and data unbalance.We recognize, however, that other statistical or machine learning techniques might have yielded similar or even better accuracy than the techniques we used.

RELATED WORK
The use of machine learning techniques for code smell detection is recently gaining attention, as proved by the amount of publications in the last years.A complete overview of the research done in the field is available in the survey by Azeem et al. [4].
While machine learning has been originally applied to detect individual code smell types, e.g., [20,21,45], some effort has recently been made to generalize its usage.Arcelli Fontana et al. applied machine learning techniques to detect multiple code smell types [2], estimate their harmfulness [2], and compute their intensity [3], showing the potential usefulness of these techniques.Pecorelli et al. [36] investigated the adoption of machine learning to classify code smells according to their perceived criticality.Nonetheless, Di Nucci et al. [10] reported that the composition of the training data can notably influence the performance of machine learning-based code smell detection methods: in particular, this is due to the small amount of actual smelly instances that can be retrieved in a software system which does not allow a learner to properly characterize code smells [34].In addition, the features exploited so far (e.g., the CK metrics [7]) are not able to properly describe code smells so these techniques do not perform better than simpler constant baselines [35].The works discussed above represent the main motivation leading to our study.Indeed, we aimed at advancing the state of the art by understanding the value of the warnings of static analysis tools as features of a machine learning-based code smell detector.
On a different note, a few works have applied machine learning techniques to analyze static analysis warnings and, particularly, to evaluate change-and fault-proneness of SonarQube violations [12,17,41].Tollin et al. [17] analyzed whether the warnings given by the tool are associated to classes with higher changeproneness, confirming the relation.Falessi et al. [12] analyzed 106 SonarQube violations in an industrial project: the results demonstrated that 20% of faults were preventable should these violations have been removed.Lenarduzzi et al. [41] assessed the fault-proneness of SonarQube violations on 21 open-source systems, showing that violations classified as "bugs" hardly lead to a failure.In another work, Lenarduzzi et al. [23] showed that technical debt cannot be predicted using standard software metrics.Our work is complementary to those discussed above, since our goal is to exploit the outcome of different static analysis tools in order to improve the accuracy of code smell detection.

CONCLUSION
In this paper, we assessed the adequacy of static analysis warnings in the context of code smell prediction.We started by analyzing the contribution given by each warning type to the prediction of three code smell types.Then, we measured the performance of machine learning models using static analysis warnings as features and aiming at identifying the presence of code smells.To sum up, in this paper we provide: (1) An investigation into the role of static analysis warnings for machine learning-based code smell detection, (2) An empirical study of how static analysis warnings contribute to the accuracy of existing machine learning approaches for code smell detection, and (3) An online appendix [25] reporting all data and scripts used to conduct our study.
Our future research agenda includes a larger scale evaluation of the devised models as well as the definition of a combined model able to exploit warnings coming from different static analysis tools to improve the overall code smell identification performance.

Table 1 :
Software systems considered in the project.

Table 2 :
Descriptive statistics about the number of code smell instances.

Table 3 :
Information Gain of our independent variables for each static analysis tool.
3.1 RQ 1 .Investigating the contribution of warning types.

Table 3
summarizes the information gain values obtained by the metrics composing the nine models built in our study.The first thing to notice is that, depending on the code smell type, the warning types have a different weight: this practically means that a machine learner for code smell identification should exploit different features depending on the target code smell rather than rely on a unique set of metrics to detect them all.

Table 4 :
Results reporting the performance of the models built with the warning generated by the three static automatic tools.Based on the results obtained in RQ 1 , we can reason on the motivations behind this result.Unlike the other static analysis tools, FindBugs has the specific goal to identify bug patterns rather than more generic design problems: despite containing a number of warning types analyzing the overall quality of a class, it often looks at individual lines of code trying to spot the existence of possible implementation errors.For example, this is the case of "Method call passes null to a non-null parameter", that is a type of warning that validates the exchange of information between methods.The granularity of these warnings is lower than the other tools, hence influencing the ability of the model the overall quality of a class affected by Spaghetti Code.Hence, we may conclude that the class-level warning types of FindBugs are not enough to identify code smells-this is the tool performing worst also when considering the other code smells.