Effects of measurements on correlations of software code metrics
Abstract
Context
Software metrics play a significant role in many areas in the lifecycle of software including forecasting defects and foretelling stories regarding maintenance, cost, etc. through predictive analysis. Many studies have found code metrics correlated to each other at such a high level that such correlated code metrics are considered redundant, which implies it is enough to keep track of a single metric from a list of highly correlated metrics.
Objective
Software is developed incrementally over a period. Traditionally, code metrics are measured cumulatively as cumulative sum or running sum. When a code metric is measured based on the values from individual revisions or commits without consolidating values from past revisions, indicating the natural development of software, this study identifies such a type of measure as organic. Density and average are two other ways of measuring metrics. This empirical study focuses on whether measurement types influence correlations of code metrics.
Method
To investigate the objective, this empirical study has collected 24 code metrics classified into four categories, according to the measurement types of the metrics, from 11,874 software revisions (i.e., commits) of 21 open source projects from eight wellknown organizations. Kendall’s τB is used for computing correlations. To determine whether there is a significant difference between cumulative and organic metrics, MannWhitney U test, Wilcoxon signed rank test, and pairedsamples sign test are performed.
Results
The cumulative metrics are found to be highly correlated to each other with an average coefficient of 0.79. For corresponding organic metrics, it is 0.49. When individual correlation coefficients between these two measure types are compared, correlations between organic metrics are found to be significantly lower (with p < 0.01) than cumulative metrics. Our results indicate that the cumulative nature of metrics makes them highly correlated, implying cumulative measurement is a major source of collinearity between cumulative metrics. Another interesting observation is that correlations between metrics from different categories are weak.
Conclusions
Results of this study reveal that measurement types may have a significant impact on the correlations of code metrics and that transforming metrics into a different type can give us metrics with low collinearity. These findings provide us a simple understanding how feature transformation to a different measurement type can produce new noncollinear input features for predictive models.
Keywords
Software code metrics Measurement effects on correlations Collinearity Software engineering Cumulative measurement1 Introduction
The exponential growth of software size (Deshpande and Riehle 2008) is bringing in many challenges related to maintainability, release planning, and other software qualities. Thus, a natural demand to predict external product quality factors to foresee the future state of software has been observed. Maintainability is related to the size, complexity and documentation of software (Coleman et al. 1994). Size and complexity metrics are common among other metrics to predict software maintainability (Riaz et al. 2009). Growing software size and complexity have made it increasingly difficult to select features to be implemented in the next product release and have challenged existing assumptions and approaches for release planning (Jantunen et al. 2011).
Validating software metrics has gained importance as predicting external software qualities are becoming more demanding day by day to be able to manage future revisions of software. Researchers have proposed many validation criteria for software metrics over the last 40 years, e.g., a list of 47 criteria is reported in a systematic literature study by Meneely et al. (2013) where one of them is noncollinearity. Collinearity (also known as multicollinearity) exists between two independent features if they are linearly related. Since prediction models are often multivariate, i.e., use more than one independent feature or metric, it is important that there is no significant collinearity among the independent features. Collinearity results in two major problems (Meloun et al. 2002). First, it makes a model less useful as individual effects of the independent features on a dependent feature can no longer be isolated. Second, extrapolation is most likely be highly erroneous. Thus, El Emam and Schneidewind (2000) and Dormann et al. (2013) suggested diagnosing collinearity among the independent features for a proper interpretation of regression models.
Cumulative: This indicates the traditional or the most common way how software code metrics are measured by cumulative sum or running sum. Here, by a revision, we indicate a commit, which is a single entry of source code in a repository. For example, if the number of total Lines Of Code (LOC) written for a project’s first three revisions are 50, 30, and 30 consecutively, the corresponding cumulative measures of ncloc for the revisions would be 50, 80, and 110.
Density: This measure tells us how representative a measure is within a per unit of artifact with a standard portion. Generally, the unit of density is a ratio. Within the context of code, we consider 100 LOC as a unit, the measurement unit becomes a percentage. Under this consideration, such a metric can take a value from 0 to 100. For example, the metric comment_lines_density measures lines of comments per 100 lines of code.
Average: This is the mean value of a measure with respect to artifacts related to a specific type. An example of such a metric is file_complexity which measures the mean complexity per file.
Organic: A metric that measures artifact from a single revision or two consecutive revisions without being influenced by any other revisions in the repository is organic. We have introduced the term organic as this measure has no effect from the entire list of unbounded preceding revisions like the cumulative measure. An organic metric can measure purely from a single revision, e.g., new_lines measures the lines of code (that are not comments) specific to a single revision. It can be zero in case no new code is added to a revision however, it cannot be negative (like a code churn measure). An organic metric can also measure a single revision relative to its one preceding revision. Since in this case it reflects a change or delta compared to the preceding revision, it can be positive, negative and zero.
The core idea of this study was developed while following a previous study (Mamun et al. 2017) where we focused on the domainlevel correlation of metrics from four domains that are size, complexity, documentation, and duplications. In the followup study, we explored correlations at the metriclevel and observed that the organic metrics consistently have lower correlations. Based on this observation from the followup study, we initiated and designed this study by grouping the code metrics based on how they are measured.
RQ: How measurement types affect correlations of code metrics?
The complete revision histories of the selected projects have been analyzed using a static analysis tool to generate code metrics. The code metrics are then mined from the database for analysis. Before performing data analysis, data is explored using various visual and theoretical statistical tools. Based on the nature of data, we selected Kendall’s τB (a nonparametric method for correlation), for all selected projects. Motivations for selecting Kendall’s τB is discussed in Section 4. Correlation coefficients are divided into different sets based on their level of strength and level of significance. Based on the results up to this point, we transformed all cumulative metrics into organic metrics and ran statistical tests to determine whether there is a significant difference in correlation between these two sets.
Results of this study indicate how correlations of code metrics are influenced by their measurement types, i.e., the way they are measured. We can see whether there is a difference between intracategory correlations of metrics from the same category and intercategory correlations of metrics from different categories. Based on the data analysis, we will also report whether there is a significant difference between intracategory correlations of cumulative metrics and intracategory correlations of organic metrics. These understandings are fundamental because they can reveal whether high collinearity between code metrics are due to their measurement types. Such knowledge can be helpful in making an informed decision while selecting code metrics as features for predictive models.
In the following sections of this paper, we first discuss the methodology including design of this study, data collection procedures, nature of the collected data and data processing. Based on the nature of data observed in Section 3.3, Section 4 (data analysis method), presents a comparative discussion of applicable correlation methods and pros and cons of different measures to aggregate results from the data. Section 5 shows results and implications. Based on some results, this study performed an additional test. Retaining the actual work flow of this study, we have put the design and execution of this test in Section 5.4. This section also includes discussion, limitations and validity threats to this study. Finally, Section 6 summarizes the conclusions of this study.
2 Related Work
Software code metrics are generally known to be highly correlated as many studies have reported high correlation among various code metrics. A recent systematic literature review from 2016 (Landman et al. 2016) presents a summary of 33 articles reporting correlations between McCabe’s cyclomatic complexity (McCabe 1976) and LOC (lines of code). Henry and Selig (1990) reported correlations of five code metrics (LOC, three Halstead’s softwarescience metrics (N, V, and E), and McCabe’s cyclomatic complexity). They worked with code written in Pascal language and observed three correlations significantly higher that are (the values in parenthesis indicate the correlation coefficients): Halstead N  Halstead V (0.989), LOC  Halstead N (0.893), and LOC  Halstead V (0.885). Henry et al. (1981) compared three complexity metrics: McCabe’s cyclomatic complexity, Halstead’s effort, and Kafura’s information flow. Taking the UNIX operating system as a subject, they found McCabe’s cyclomatic complexity and Halstead’s effort highly correlated while Kafura’s information flow is found to be independent. On NASA’s open dataset, Tashtoush et al. (2014) studied cyclomatic complexity, Halstead complexity, and LOC metrics. They found a strong correlation between cyclomatic complexity and Halstead’s complexity similar to the study by Henry et al. (1981). LOC is observed to be highly correlated with both of these complexity metrics. Jay et al. (2009), in a comprehensive study, also explored the relationship between McCabe’s cyclomatic complexity and LOC. They worked with 1.2 million C, C++ and Java source files randomly selected from SourceForge code repository. They reported that cyclomatic complexity and LOC practically have a perfect linear relationship irrespective of programming languages, programmers, code paradigms, and software processes. Toward comparing four internal code metrics (McCabe’s cyclomatic complexity, Halstead volume, LOC, and number of comments), Meulen and Revilla (2007) used 59 specifications each containing between 111 and 11,495 small (up to 40KB file size) C/C++ programs. They observed strong correlations between LOC, Halstead volume, and cyclomatic complexity. A recent study by Landman et al. (2016) on an extensive Java and C corpora (17.6 million Java methods and 6.3 million C functions) finds no strong linear correlation between cyclomatic complexity and LOC to be considered as redundant. This finding contradicts many earlier studies including (Henry et al. 1981; Tashtoush et al. 2014; Saini et al. 2015; Jay et al. 2009; Meulen and Revilla 2007).
The studies discussed here, mostly cover McCabe’s cyclomatic complexity, Halstead’s metrics, and LOC investigating correlations between them and showing different results. However, they do not address whether measurement types of the studied metrics affect the strength of correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics.
Zhou et al. (2009) have reported that size metrics have confounding effects on the associations between objectoriented metrics and changeproneness. On a revisited study, Gil and Lalouche (2017) reported similar results about the confounding of the size metric. Zhou et al. (2009) have elaborately explained the confounding effect and models to identify them in areas like health sciences and epidemiological research. Gil and Lalouche (2017) used normalization as a way to remove the confounding effect. While they mentioned having lower correlation coefficient for normalized metrics, they have not explicitly reported the overall difference between correlations coming from the intracumulative and the intranormalized measures. They also did not report whether there exists a significant statistical difference between the two. But it is understandable as their primary focus is on the validity of metrics. Our focus, in contrast is solely toward understanding the effects of measurements on the correlations of code metrics. We want to understand how much of the collinearity come from the types of measures and how much of it exists naturally.
There have been studies toward understanding the distributions of software metrics. For example, Wheeldon and Counsell (2003), Concas et al. (2007), and Louridas et al. (2008) have investigated whether power law distributions are present among software metrics. They have reported that various software metrics follow different distributions with long fat tails. Louridas et al. (2008) have also reported correlations among eight software metrics including LOC and number of methods. They reported a high correlation between LOC, number of methods (NOM), and out degree of classes. Baxter et al. (2006) reported a similar study, however, unlike Wheeldon and Counsell (2003), have observed some metrics that do not follow the power laws. They opined, their use of a more extensive corpus compared to Wheeldon and Counsell (2003) is the reason for the difference. In addition to looking at the distributions of metrics, Ferreira et al. (2012) have attempted to establish thresholds or reference values for six objectoriented metrics. We have also looked at the statistical properties of the studied metrics including their distributions. However, we have done this as part of out methodology to find appropriate statistical methods, and this is not the main focus of this study.
Chidamber et al. (1998) have investigated six Chidamber and Kamerer (CK) metrics and reported high collinearity among three of them which are coupling between objects (CBO), response for a class (RFC), and NOM. Succi et al. (2005) have studied to what extent collinearity is present in CK metrics. They suggested to completely avoid RFC metric as an input feature for predictive models due to its high collinearity with other CK metrics. Given the problems of collinearity, Shihab et al. (2010) have proposed an approach to select metrics that are less collinear from a set of metrics. These studies have mentioned collinearity as a problem and reported collinearity among software metrics or proposed method to select metrics with low collinearity. However, they have not investigated from the perspective of measurement types influencing collinearity.
3 Methodology
We have designed this empirical study following the guideline of Runeson and Höst (2008) on designing, planning, and conducting case studies. This study is explorative with the intent to find insights about relations between code metrics with different measurement types. We have designed the study to minimize bias and validity threats and maximize the reliability of the results, which involves project selection, data extraction, data cleaning, exploring nature of the data, select appropriate statistical analysis methods based on the nature of data, and being conservative when selecting and instrumenting statistical analysis.
Data sources for this research are open source software projects, more specifically, open source Java projects on GitHub. Java is among the top three most frequently used project languages on GitHub. Since extracted data is quantitative, analysis methods used in this study are quantitative. We have followed a thirddegree data collection method described by Lethbridge et al. (2005). First, the case and the context of the study are defined, followed by data sources and criteria for data collection. Assumptions for statistical methods are thoroughly checked, which involves exploration of the nature of data and cleaning of data as necessary. Regardless of the measurement types, extracted data is nonnormal to the extent that meaningful transformation is not possible. Thus, we have used nonparametric statistical methods for analyzing data in this study.
3.1 Project Selection
GitHub’s search functionality was used to find candidate projects. However, due to limited capabilities of GitHub search functionality, it was not possible to perform a compound query that would fulfill all our criteria. Project selection was not randomized as we wanted to assure that selected projects have specific criteria (e.g., minimum LOC, minimum commits, etc.) and come from wellknown development organizations that would not raise obvious validity questions, e.g., “project is unrepresentative because it is a classroom project by a novice programmer.” Thus, finding projects from reputed organizations was exploratory. We started by screening projects from the 14 organizations listed in GitHub’s open source organizations showcase^{1}. We then explored whether other wellknown organizations to our knowledge are also hosting their projects on GitHub but are not in the showcase, e.g., Apache. For each organization, we made queries to find Java projects. As we want to minimize blocking effects coming from various languages, we decided to stick with a single programming language. We selected Java as it is a topranked programming language on GitHub.
Crawford et al. (2002) presented various methods for classifying software projects. We take a more straightforward approach to make sure that our selected projects are representative regarding size. A study^{2} on the dataset of International Software Benchmarking Standards Group (ISBSG) classified software projects based on “Rule’s Relative Size Scale”. Measurements of this study are based on IFPUG MkII and COSMIC which are also translated into equivalent LOC. The combined distribution of all projects shows that more than 93% of the projects are between S (small) and L (large) size where S is estimated as 5300 and L as 150,000 LOC. We roughly followed this finding and selected projects in a way that project sizes are about uniformly distributed within about the range of S and L. We have projects ranging from 4059 LOC to 155,260 LOC indicating the code size of the latest revision of projects. Sizes of the projects are determined with cloc tool^{3} using a bash script to extract total LOC and Java LOC. We selected 21 GitHub projects from eight software organizations where Java is tagged as the project language.
Overview of the selected projects
Organization  Project name  Contributors  Analyzed  Total  Java  Total  Latest  Project 

Revisions  Revisions  Code  Code  Commit  Duration  
(person)  (commits)  (commits)  (LOC)  (LOC)  (SHA)  (months)  
Microsoft  oauth2useragent  6  171  336  2,976  4,059  d5ddee2  12 
GitCredentialManagerforMacandLinux  5  141  679  4,986  6,169  5fcb321  13  
malmo  12  295  814  14,221  33,212  8a1bc7d  5  
thrifty  5  242  299  43,948  44,324  7f8c12a  12  
Vsointellij  22  305  1,191  64,457  65,502  682cb12  12  
ambrose  18  167  640  4,879  10,138  da7bcb9  48  
cloudhoppersmpp  15  94  150  12,342  12,452  193d1c4  57  
elephantbird  52  449  1,361  22,675  26,087  87efd8c  76  
Netflix  Fenzo  10  98  192  11,174  12,360  4b446e3  20 
ribbon  30  223  785  22,299  22,634  4522f71  46  
astyanax  51  549  991  55,167  55,680  4324ba7  55  
Square  dagger  36  306  699  8,894  10,927  9888337  46 
retrofit  103  776  1,378  12,908  14,329  32ace08  72  
picasso  73  518  931  9,960  11,094  a1ba906  42  
Esri  solutionsgeoeventjava  7  218  535  34,981  60,752  f4f872e  38 
geometryapijava  14  100  174  76,114  76,391  3858559  43  
Shopify  nokogiri  105  1,788  3,513  26,398  61,129  e2821be  75 
SAP  Cloudsfsfbenefitsext  9  52  165  3,029  5,638  984de61  24 
Apache  kafka  236  2,302  2,863  88,920  155,260  8f2e0a5  64 
zookeeper  9  1,474  1,474  72,894  137,466  f6349d1  109  
zeppelin  158  1,606  2,707  64,878  100,551  ba2b90c  39  
Total  976  11,874  21,877  658,100  926,154  907  
Mean  46  565  1,042  31,338  44,103  43 
3.2 Data Collection and Metrics Selection
We used SonarQube^{4} to analyze revisions of the selected projects. Kazman et al. (2015) mentioned SonarQube as the defacto source code analysis tool for automatically calculating technical debt. It has gained popularity in recent years, and Janes et al. (2017) mentioned SonarQube as the defacto industrial tool for Software Quality Assurance. This tool is based on SQALE methodology (Letouzey and Ilkiewicz 2012). We used SonarQube version 6.1 and SonarScanner version 2.8.
We run SonarQube on each revision available in the master branch of a project. Since we ignore subbranches, the number of analyzed revisions is less than the number of total revisions as reported in Table 1. Subbranches are eventually merged with the master branch which means, we do not lose anything except the granularity of data.
Analyzing 11,874 software revisions needs be automated. Python scripts are used to automate the process of traversing commits or revisions on the master branch of a project’s Git repository and run SonarQube tool on commits. SonarQube provides webservices covering a range of functionalities including mining analysis results and software metrics. We observed some of the metrics such as new_lines are seen on SonarQube’s webinterface, but they cannot be mined through the webservices. We later found that SonarQube computes some metrics only for the latest software revision and removes them automatically.
Since we did not find any option to stop the autodeletion, we added triggers and additional tables into SonarQube’s SQL database to recover the deleted records. In total, 47 metrics were mined from the database classified into six major domains namely size, documentation, complexity, duplications, issues, and maintainability. The classification is based on what the metrics measure.
Classification of the selected code metrics according to “How They Measure”
Measurement  Metric name  Description  Value 

Category  Type  
Cumulative  ncloc  Number of physical lines of code that are not comments  Integer 
(line only containing space, tab, and carriage return are ignored)  
classes  Number of classes  Integer  
(including nested classes, interfaces, enums, and annotations)  
files  Number of files  Integer  
directories  Number of directories  Integer  
functions  Number of methods  Integer  
statements  Number of statements according to Java language specifications  Integer  
public_api  Number of public Classes + number of public Functions  Integer  
+ number of public Properties  
comment_lines  Number of lines containing either comment or commentedout code  Integer  
(Empty comment lines and comment lines containing only special  
characters are igonored)  
public_undocumented  Public API without comments header.  Integer  
_api  Public undocumented classes, functions and variables  
complexity  Cyclomatic complexity (else, default, and finally keywords are ignored)  Integer  
complexity_in_classes  Cyclomatic complexity in classes  Integer  
complexity_in_functions  Cyclomatic complexity in functions  Integer  
duplicated_lines  Number of duplicated lines  Integer  
duplicated_blocks  Number of duplicated blocks. To count a block, at least 10 successive  Integer  
duplicated statements are needed. Indentation & string literals are ignored  
duplicated_files  Number of duplicated files  Integer  
Density  comment_lines_density  comment_lines_density = comment_lines / (ncloc + comment_lines) * 100  Percent 
public_documented  public_documented_api_density =  Percent  
_api_density  (public_api  public_undocumented_api) / public_api * 100  
duplicated_lines_density  duplicated_lines_density = duplicated_lines / ncloc * 100  Percent  
Average  file_complexity  Average complexity by file. file_complexity = complexity / files  Float 
class_complexity  Average complexity by class.  Float  
class_complexity = complexity_in_classes / classes  
function_complexity  Avgerage complexity by function.  Float  
function_complexity = complexity_in_functions / functions  
Organic  new_lines  Number of new physical lines of code that are not comment  Integer 
new_duplicated_lines  Number of new physical lines of code that are duplicated  Integer  
new_duplicated_blocks  Number of new blocks of code that are duplicated  Integer 
Metric data representing five revisions of a project
SQL code to instrument the SonarQube database, Python code to automatically analyze commits of a Git project with SonarQube, MySQL code to retrieve the desired data from the database, and the collected data used for this study are published as public dataset^{6}.
3.3 Exploring Nature of Data
Among different probability distribution functions, a normal distribution is more anticipated by the researchers due to its relationship with the natural phenomenon “central limit theorem.” Statistical methods are based on assumptions. After collecting data, before deciding on the type of statistical methods, researchers need to investigate the nature of the collected data. A crucial part is to check the distribution of data. If the distribution is normal, parametric statistical tests are considered. If data is nonnormal and a meaningful transformation is not possible, nonparametric tests are considered.
Considered methods for normality test
Graphical methods  Numerical methods  

Descriptive  Histogram, box plot  Skewness, Kurtosis 
Theorydriven  QQ plot  ShapiroWilk, KolmogorovSmirnov test (Lillefors test) 
Histogram, a frequency distribution, is considered to be a useful graphical test when the number of observations is large. It is particularly helpful because it can capture the shape of the distribution given that bin size is appropriately chosen. If data is far from a normal distribution, a single look at the histogram tells us that the data is nonnormal. It also gives a rough understanding of the overall nature of the distribution, such as skewness, kurtosis or the type of distribution such as bi and multimodal, etc. Box plots are useful to identify outliers and comparing the distribution of the quartiles. Normal QQ plot (quantilequantile plot) is a graphical representation of the comparison of two distributions. If data is normally distributed, the data points in the normal QQ plot approximately follow a straight line. It also helps us to understand the skewness and tails of the distribution. The graphical methods help to understand the overall nature of the data quickly, but it does not provide objective criteria (Park 2009).
There are different numerical methods to evaluate whether data is normally distributed or not. Skewness and kurtosis are commonly used descriptive tools for this purpose. For a perfectly normal distribution, the statistics for these two analyses should be zero. Since this does not happen in practice, we calculate the zscore by dividing the statistic by the standard error. However, determining normality from zscore is not straightforward either. Kim (2013) discussed how the sample size can affect the zscore. Field (2009) and Kim (2013) suggested to consider different criteria for skewness and kurtosis based on the sample size. For a sample size less than 50, an absolute zscore for either of these methods should be 1.95 (corresponding to an α of 0.05); for a sample size less than 200, an absolute zscore of 3.29 (corresponding to an α of 0.001). However, for a sample size of 200 or more, it is more meaningful to inspect the graphical methods and look at the skewness and kurtosis statistics instead of evaluating the significance, i.e., zscore.
From analytical numerical methods, we consider ShapiroWilk and KolmogorovSmirnov for normality test. ShapiroWilk works better with sample size up to 2000 and KolmogorovSmirnov works with large sample sizes (Park 2009). The literature has reported different maximum values for these tests. For example, sample size over 300 might produce an unreliable result for these two tests, observed by Kim (2013), and a range of 30 to 1000 is suggested by Hair (2006). We have sample sizes from 52 to 2302 for our metrics. For large sample size, numerical methods can be unreliable; thus, Field (2009) suggested to use the graphical methods besides the numerical methods to make an informed decision about the nature of the data. We have computed all these tests in the statistical software package SPSS. It can be noted that SPSS calculates Kurtosis − 3 for Kurtosis, meaning subtracting three from the Kurtosis value.
Descriptive statistics for four example metrics from the Apache zookeeper project
Category  Metrics  N  Minimum  Maximum  Mean  Std. Dev.  

Statistic  Std. Error  
Cumulative  ncloc  1473  10,804  73,009  44,192.4  485.3  18,626.9 
Density  comment_lines_density  1473  9.2  13.9  12.5  0.0  1.2 
Average  file_complexity  1473  22.0  33.9  25.6  0.1  2.7 
Organic  new_lines  1473  0  19,055  117.5  16.3  627.3 
The depiction of the distributions and their properties by graphical methods are so much deviated from normality that we could have omitted additional tests for them. However, we perform them as part of the design of the study. Since we have a varying number of sample sizes, we got some insights about the tests. However, such observations are beyond the scope and focus of this study, thus, are not reported in this report.
Skewness and Kurtosis check for four example metrics from the Apache zookeeper project
Metrics  Skewness  Kurtosis  

Statistic  Std. Error  zvalue  Statistic  Std. Error  zvalue  
ncloc  − 0.24  0.06  − 3.71  − 1.13  0.13  − 8.87 
comment_lines_density  − 1.30  0.06  − 20.40  0.91  0.13  7.11 
file_complexity  2.00  0.06  31.39  3.15  0.13  24.68 
new_lines  20.83  0.06  326.63  580.39  0.13  4,554.62 
ShapiroWilk and KolmogorovSmirnov tests for four example metrics from the Apache zookeeper project
KolmogorovSmirnov^{∗}  ShapiroWilk  

Statistic  df  Sig.  Statistic  df  Sig.  
ncloc  .098  1473  .000  .939  1473  .000 
comment_lines_density  .181  1473  .000  .842  1473  .000 
file_complexity  .258  1473  .000  .693  1473  .000 
new_lines  .426  1473  0.000  .150  1473  .000 
Likewise, these four metrics from the Apache zookeeper project, other metrics from these measurements categories or from other projects are also observed to be nonnormal. Due to a high degree of nonnormality and different distributions among the metrics make it impractical to make transformations on these metrics. Thus, we are left with the option to perform nonparametric statistical analysis methods.
3.4 Data Processing
Besides the statistical analysis to understand the nature of the extracted data, we performed manual inspections to check the data for possible anomalies especially at the boundaries meaning, at the beginning and the ending of revision data.
Data snippet from the first revision of project astyanax showing null value for new_lines
ncloc  New_lines  Classes  Files  directories 

8409  NULL  170  157  12 
Data snippet from the first three revisions of project ribbon showing a high value for new_lines
ncloc  New_lines  Classes  Files  Directories 

6111  8461  80  61  9 
6118  42  80  61  9 
5981  15  79  60  9 
There might be several reasons why some projects start with a higher number of ncloc from the beginning as we see in the case in Table 9. It can be due to a project not tracking its code base through a version control from the beginning, or the project start with an existing code base possibly because it is an extension of another project with a separate version control. It is also observed that new_lines has a higher value than ncloc in the first revision of few projects. We removed data related to the first revisions in such cases. Since the number of removed revisions is very insignificant compared to the total number of revisions, it is less likely that this will have a major impact on this study.
4 Data Analysis Method
Since collected data is not normally distributed, nonparametric statistical methods are appropriate for this research. Spearman’s ρ correlation coefficient and Kendall’s τ correlation coefficient are two wellknown nonparametric measures to assess relationships between ranked data. We carefully investigated the nature of the collected data and properties of these two measures. A measure of Kendall’s τ is based on the number of concordant and discordant pairs of the ranked data.
Calculating Spearman’s ρ by hand is much simpler than calculating Kendall’s τ because it can be done pairwise without being dependent on the rest of the data while computing Kendall’s τ by hand is only feasible for small sample sizes because computing each data pair requires exploring the remaining data. Xu et al. (2013) reported that the time complexity of Spearman’s ρ is O(n ∗ log(n)) and for Kendall’s τ it is O(n^{2}). In general, Spearman’s ρ results in a higher correlation coefficient compared to Kendall’s τ where the latter is generally known to be more robust and has an intuitive interpretation.
Studies from different disciplines have investigated the appropriateness of Spearman’s ρ and Kendall’s τ concerning various factors. Both measures are invariant concerning increasing monotone transformations (Kendall and Gibbons 1990). Moreover, being nonparametric methods, both measures are robust against impulsive noise (Shevlyakov and Vilchevski 2002; Croux and Dehon 2010). Croux and Dehon (2010) studied the robustness of Spearman’s ρ and Kendall’s τ through their influence function and grosserror sensitivities. Even though it is commonly known that both of these measures are robust enough to handle outliers, this study found that Kendall’s τ is more robust to handle outliers and statistically slightly more efficient than Spearman’s ρ. In a more recent study, Xu et al. (2013) investigated the applicability of Spearman’s ρ and Kendall’s τ based on different requirements. Some of their key findings report Kendall’s τ as the desired measure when the sample size is large, and there is impulsive noise in the data. Their results are based on unbiased estimations of Spearman’s ρ and Kendall’s τ correlation coefficients.
In light of the above discussion, we think, Kendall’s τ is more appropriate for software projects. We have observed outliers in many projects, and all projects have some revisions with high data values indicating outliers. This can naturally happen to any software project when existing codebase is added to a new project. Outliers in software revision data are observed and their underlying reasons are discussed in earlier research (Aggarwal 2013; Schroeder et al. 2016). The robust nature of Kendall’s τ handles such data points better when compared to Spearman’s ρ.
4.1 Landscape of Correlation Coefficients (τ _{b})
Grouping and labeling τ_{b}
Statistical significance  Negative τ_{b} value  Positive τ_{b} value  Label 

α = 0.05  \(1.0 \leqslant \tau _{b} \leqslant 0.9\)  0.9 \(\leqslant \tau _{b} \leqslant \) 1.0  Very strong τ_{b} 
\(0.9 < \tau _{b} \leqslant 0.7 \)  0.7 \(\leqslant \tau _{b} <\) 0.9  Strong τ_{b}  
\(0.7 < \tau _{b} \leqslant 0.4\)  0.4 \(\leqslant \tau _{b} <\) 0.7  Moderate τ_{b}  
\(0.4 < \tau _{b} \leqslant 0.0\)  0 \(\leqslant \tau _{b} <\) 0.4  Weak τ_{b} 
However, it is not enough just to look at the τ_{b} values, unless we look at their statistical significance, which can be found from the pvalue. If the pvalue is greater than a chosen significance level, we cannot reject the null hypothesis, meaning we do not have enough evidence to differentiate that a corresponding τ_{b} is any different from τ_{b} = 0. For example, for a pvalue value of 0.05, there is a possibility that 5% of the τ_{b} values are indicating correlation by chance even though there is no real correlation between the underlying metrics. This study considers α = 0.05 for any τ_{b} to be statistically significance.
4.2 Aggregating Correlation Coefficients (τ _{b})
Now that we have 21 metrics of size 24x24, we want to aggregate meaningful data from them. Let M = {m_{1},m_{2},m_{3}...m_{n}}, where n is the number of total metrics be the set of all metrics and P = {p_{1},p_{2},p_{3}...p_{q}}, where q is the number of total projects be the set of all projects.
Aggregating correlation coefficients (τ_{b}) based on the set of all correlation coefficients (\(S\tau _{b_{all}}\))
The simplest way of aggregating τ_{b} is by summing up all τ_{b} values for each possible pair of metrics within M for each project within P. This results in a correlation matrix of size 24×24, i.e., the same size of a correlation matrix from a project. The advantage of this method is that we can compute a single aggregated matrix where each cell is the sum of τ_{b} values from the corresponding cells from correlation matrices generated from the projects. Such an aggregated matrix can also be transformed into a weighted average matrix by dividing each cell (containing the sum of all τ_{b} for a particular pair of metrics) by q.
However, there is a fundamental problem with this method. This method combines all τ_{b} values irrespective of their corresponding pvalue. If the pvalue is greater than α then we cannot reject the null hypothesis. This means, we only consider a τ_{b} valid when its corresponding pvalue is less than or equal to α. If this is not checked, there are two implications. First, a result will be wrong due to the inclusion of the nonsignificant τ_{b} or inclusion of τ_{b} from the set \(S\tau _{b_{nsig}}\); second, we will not have any idea how big the set \(S\tau _{b_{nsig}}\) is compared to \(S\tau _{b_{sig}}\).
Aggregating results based on \(S\tau _{b_{all}}\) is not problematic in the case when \(S\tau _{b_{all}} = S\tau _{b_{sig}}\), meaning there is no τ_{b} with a nonsignificant pvalue, which is an assumption that should be checked for validity in case it is assumed. For example, the recent research on the correlation of code metrics by Gil and Lalouche (2017) has not reported anything about considering α, i.e., the significance level for pvalue, thus, nothing about the existence of τ_{b} with a nonsignificant pvalue. In this case, the readers cannot know the size of \(S\tau _{b_{nsig}}\). If an assumption regarding the nonexistence of nonsignificant pvalue is considered, then that should be documented and validated. In this research, to aggregate results, we have avoided any value from the set \(S\tau _{b_{nsig}}\). To calculate the sample mean, τ_{b} values from \(S\tau _{b_{nsig}}\) are considered as zero.
Aggregating correlation coefficients (τ_{b}) based on the set of all significant correlation coefficients (\(S\tau _{b_{sig}}\))
Let \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) is the correlation coefficients for two metrics m_{1} and m_{2} in project p_{i}. Now we compute the mean for \(S\tau _{b_{sig}}\) by the following equation:
Since (1) is calculated based on \(S\tau _{b_{sig}}\), ν indicates the total count of \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) from all projects. To compute the sample mean, we only have to replace ν by m in (1). Since we are unsure of the τ_{b} values within set \(S\tau _{b_{nsig}}\) due to their insignificant pvalue, considering a conservative measure, we take τ_{b} from set \(S\tau _{b_{nsig}}\) as zero. Thus, \(\tau _{b_{(m_{1}, m_{2}, p_{i})}} \in S\tau _{b_{sig}}\) part in (1) still holds for the sample mean.$$ \overline{\tau}_{b_{(m_{1}, m_{2})}} = \frac{1}{\nu} \sum\limits_{i=0}^{q} \tau_{b_{(m_{1}, m_{2}, p_{i})}} \mid \tau_{b_{(m_{1}, m_{2}, p_{i})}} \in \tau_{b_{sig}} $$(1)Like \(S\tau _{b_{all}}\), the advantage of aggregating τ_{b} from the set \(S\tau _{b_{sig}}\) is also that we can report the results using a single matrix without having the mentioned problems of \(S\tau _{b_{all}}\) based aggregation. However, from such results, we are not able to determine whether \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is coming from a large or small value of ν. In other words, we are unable to tell how representative \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is among the selected projects. Since we can calculate sample mean from \(S\tau _{b_{sig}}\), we can also compute variance and standard deviation to see the overall spread of τ_{b} within the samples. However, sample mean and standard deviation of τ_{b} from \(S\tau _{b_{sig}}\) do not necessarily tell us about the distribution of τ_{b} within the four strength levels in Table 10. Standard deviation gives us an indication of the variability, but we do not have a way to know how the variability looks like regarding different strength levels of τ_{b}.
Aggregating correlation coefficients (τ_{b}) based on the sets \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), \(S\tau _{b_{w}}\)
These four sets represent τ_{b} based on their strengths according to Table 10. These sets make \(S\tau _{b_{sig}}\) i.e., \(S\tau _{b_{vs}}\) ∪ \(S\tau _{b_{s}}\) ∪ \(S\tau _{b_{m}}\) ∪ \(S\tau _{b_{w}}\) = \(S\tau _{b_{sig}}\). We can consider two measures from these sets as listed below.These two proposed measures, count ofτ_{b} and sum ofτ_{b}, provide additional insights about correlation coefficients compared to popular statistical measures sample mean and standard deviation. In this report, when we mention the term ‘mean,’ we indicate it for a certain set, and when data from all sets are considered, we write it as ‘sample mean.’ For example, when we say ‘correlation between metrics m_{1} and m_{2} results in τ_{b}’ or ‘correlation between metrics m_{1} and m_{2} results in very strong/strong/moderate/weak τ_{b}’, we refer to τ_{b} from the sample mean table.Count ofτ_{b} based on their level of strength: We get this measure by simply counting the number of τ_{b} within a set. This simple measure gives us a direct answer to the question, “how many projects report a certain correlation between two metrics at a certain level of strength”? Since we have 21 projects, the maximum value count ofτ_{b} can be 21 and minimum can be zero. Looking at particular values of a count ofτ_{b} for two metrics from the metrics \(S\tau _{b_{nsig}}\), \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), and \(S\tau _{b_{w}}\), we can understand the distribution of τ_{b} among different sets toward having a better understanding about the nature of relationships between metrics.
Sum ofτ_{b} based on their level of strength: Instead of taking counts, this adds all τ_{b} within the set. Since the highest value of a single τ_{b} is 1.0, the maximum value for sum ofτ_{b} can also be 21 similar to the measure count ofτ_{b}. However, sum ofτ_{b} does not tell us how many τ_{b} values are contributing for the sum, which we can get from the count ofτ_{b}. Thus, these two measures are mutually exclusive.
4.3 Missing Correlation Coefficients (τ _{b})
Correlation cannot be computed if either or both of the variables have constant values or missing values. In such a case, computation of correlation returns NaN (not a number) for both τ_{b} and pvalue. Set S_{NaN} contains such data. S_{NaN} is kept disjoint from the \(S\tau _{b_{all}}\) in Fig. 6, because τ_{b} value is missing to determine the level of strength and pvalue is also missing to determine the level of significance. Even though S_{NaN} does not help us with correlation, it tells us about the nature of specific metrics.
Labeling different sections of our matrices for easy referencing
When reporting this case study, we have tried to maintain the actual flow how this research was carried out. Based on some interesting observations between intercategory correlations of 15 cumulative and three organic metrics, this case study further tested the hypothesis:
“The median difference between correlations of metrics from cumulative and organic_{t}categories equals to zero.”
This required deriving a set of 15 organic metrics denoted as organic_{t} corresponding to the 15 cumulative metrics. The specifics of designing, performing and results of the test are elaborated in Section 5.4.
5 Results
Before going into the results and discussion regarding significant correlation coefficient (i.e., τ_{b} based on significant pvalue), from the sets within \(S\tau _{b_{sig}}\), we report how correlations coefficients outside \(S\tau _{b_{sig}}\) looks like to illustrate data that is not contributing to the main result.
5.1 Missing and Nonsignificant Correlation Coefficients (τ _{b})
First, we look at the small set S_{NaN} in Fig. 6, reporting cases where computing correlation was not successful and resulted in null values (i.e., NaN) for τ_{b} and pvalue for specific pairs of metrics from M as reported in Table 19. Missing τ_{b} related to the metric directories as seen in this table sourced from two projects, malmo and geometryapijava. In both projects, the metric directories has values (8 for malmo and 3 for geometryapijava) that remained unchanged throughout the project. The rest of the missing τ_{b} results from the project cloud, and all six metrics are related to duplications category that are duplicated_lines, duplicated_blocks, duplicated_files, duplicated_lines_density, new_duplicated_lines, and new_duplicated_blocks. Since the cloud project has no duplication related issues during the analyzed revisions, all six metrics have the value 0.
If at least one metric from a pair contains a constant value (i.e., a metric having the same value for all revisions), it is not possible to perform correlation on that pair of metrics, and this results in NaN values for τ_{b} and pvalue. We observed the directories metric and metrics related to duplications have fewer levels (meaning variations) compared to other metrics.
Now, we move to the set \(S\tau _{b_{nsig}}\) as reported in Table 17. Horizontal bars in this table and all other similar tables reporting the count and sum measures are graphical representations of the corresponding cell values. The maximum value for a cell corresponding to two metrics (e.g., the cell between ncloc and classes) is q which is 21, the number of projects. Cells in the principal diagonal are omitted, and thus, kept blank. The rightmost column reporting the ‘total count’ or ‘total sum’ of a row may have a maximum possible value of 483 (calculated from n.q − q). The horizontal bars are drawn based on the maximum possible value a cell can contain (i.e., 21 for regular cells between metrics and 483 for cells indicating ’total’) but not on the maximum available value in a table.
A single glance at Table 17 gives us the impression that organic category is different from all other categories. A more careful look tells us that there are three groups: cumulative and organic having lowest and highest count of nonsignificant τ_{b} values correspondingly, and density and average with the count of nonsignificant τ_{b} values in between. We can also see it from the rightmost column ‘total count’. Since ‘total count’ comes from all four metric categories, we show the categorywise mean of Tables 17 in 12.
Categorywise mean of count of nonsignificant correlation coefficients (τ_{b}) from Table 17

Correlating metrics from different categories results in more nonsignificant correlation coefficients compared to correlating metrics within a category.

Category organic is quite different from the other three categories by producing a lot of nonsignificant τ_{b} for intracategory correlations.

Category organic has least amount of nonsignificant τ_{b} followed by cumulative, average, and density for intercategory correlations. Even though the mean value for cumulative category is quite low (0.27) in this context, the contrast between the inter and intracategory is clearly less noticeable compared to organic category.
5.2 Overall Distribution of Correlation Coefficients (τ _{b})
Based on the labeling of τ_{b} in Table 10, we have reported two sets of measurements. They are ‘count of τ_{b}’ (Tables 23, 24, 25, and 26) and ‘sum of τ_{b}’ (Tables 27, 28, 29, and 30) at different levels.
Both Figs. 7 and 8 are vertically equally scaled for better comparison. Between these figures, the level ‘very strong’ has the least difference and level ‘weak’ has the highest difference among the four strength levels of τ_{b}. While Fig. 7 gives us an overview of how different sets of τ_{b} are representative, Fig. 8 shows the actual sum of τ_{b}. Since for \(S\tau _{b_{nsig}}\tau _{b}\) value is meaningless, and for S_{NaN}τ_{b} value is missing; they are not included in Fig. 8.
In Fig. 8, we can see how the red bars, representing ‘very strong τ_{b}’ within the cumulative metrics (metrics ncloc to duplicated_files), are dominating compared to other levels. Metrics directories, duplicated_lines, duplicated_blocks, and duplicated_files are comparatively weaker in terms of ‘very strong τ_{b}’ compared to the other cumulative metrics. However, all cumulative metrics have scored much higher than metrics from other measurement categories. Metrics from the organic category have scored lowest among all metrics and categories. These two figures represent data from all correlations, e.g., the horizontal bar for ncloc comes from the correlation coefficient of ncloc with all other metrics. Thus, from these figures, we cannot determine whether there is any difference between τ_{b}’s resulting from intracategory metrics correlation and intercategory metric correlation. We study this in more detail next.
5.3 Significant Correlation Coefficients
Sample mean of significant correlation coefficients (τ_{b}) from the set \(S\tau _{b_{all}}\) (set of all τ_{b}). The three grayscale cell colors indicate three levels of τ_{b}. Cells with red text indicate weak τ_{b}
First, we want to look at intracategory relations among the metrics. Table 13 shows the sample mean of τ_{b} corresponding to the set \(S\tau _{b_{all}}\), and Table 21 in the Appendix shows the mean of τ_{b} within the \(S\tau _{b_{sig}}\). Here, we want to reiterate that we have considered any nonsignificant τ_{b} from the set \(S\tau _{b_{all}}\) as 0.
Perfect Correlations
We are interested in the perfect correlation (i.e., τ_{b} = 1.0) reported in Tables 13 and 21 between metrics (complexity, complexity_in_classes), (complexity, complexity_in_functions), and (complexity_in_classes, complexity_in _functions) since this is an indication that these three pairs of metrics are perfectly correlated meaning both metrics within a pair measure exactly the same aspect. It can be noted that τ_{b} for these relations are not exactly 1 as reported in Tables 13 and 21. This happens because we have reported τ_{b} up to two decimal points, so anything greater or equal to 0.995 is reported as 1. To understand these relations, we counted the perfect correlation coefficients between all metrics from all projects, which is reported in Table 20 where we see five relations have perfect τ_{b}. In 20 projects the correlation between complexity and complexity_in_classes is perfect. So it is evident that the metrics complexity and complexity_in_classes are measuring the same aspect and using one of them is sufficient. Since our data is coming from Java source code, this is not a surprise because, in Java, code does not reside outside classes.
For the relation between complexity and complexity_in_functions, there exist perfect correlations in nine projects. Since we found a perfect correlation in the sample mean of τ_{b} in Table 13, it means the rest of the 12 τ_{b} for this relation must be very strong. We have also calculated ‘sum of significant τ_{b}’ reported in Table 22. The ‘sum of significant τ_{b}’ for (complexity, complexity_in_functions) is found to be 20.98 out of 21. This indicates that complexity and complexity_in_functions also measure the same aspect with a negligible difference.
Even though the results in Table 20 does not show any perfect correlation between the metrics ncloc, functions, statements, complexity, classes, files, and public_api, but we see τ_{b} greater than 0.9 meaning at the very strong level for all relations in the sample mean in Table 13. Considering any relation at the τ_{b}level greater or equal to 0.9 redundant, we consider all these seven measures from the cumulative category as redundant. Under the same consideration, public_api is redundant to public_undocumented_api.
For relations between new_duplicated_lines and new_duplicated_blocks and between duplicated_blocks and duplicated_files, perfect correlations, are found in only one project. ‘Sum of significant τ_{b}’ for these two pairs are reported in Table 22 as 19.65 and 17.87 correspondingly. Thus, we cannot say right away that metrics within these two pairs are duplicated.
5.3.1 IntraCategory Correlations
Intracategory correlations indicate correlations within the sections CC, DD, AA, and OO where all correlations within a section are coming from metrics measured similarly.
Section CC
Categorywise mean of sample mean of significant correlation coefficients (τ_{b}) from Table 13
Since the sample mean (i.e., from \(S\tau _{b_{all}}\)) reported in Table 13 is a more conservative measure than the mean of \(S\tau _{b_{sig}}\) in Table 21, we expect equal or better result (higher τ_{b} value) in Table 21. From the data, we see a very small or no difference for most of the mentioned metrics due to the few numbers of nonsignificant and missing τ_{b} in section CC.

Metrics complexity, complexity_in_classes, and complexity_in_functions measure exactly the same aspect.

Metrics ncloc, functions, statements, complexity, classes, files, public_api, and including the three metrics mentioned above are all redundant.

Metrics public_api and public_undocumented_api are redundant.

Not a single correlation out of 105 total correlations within the cumulative metrics has a weak correlation coefficient.
Sections DD, AA, OO
As Table 14 shows, sections DD, AA, and OO have on average weak to moderate correlations compared to strong correlations in section CC. Correlations among all three metrics (comment_lines_density, public_documented_api_density, and duplicated_lines_density) from density category in section DD have weak τ_{b}. For section AA, the correlation between file_complexity and class_complexity from average category has a strong τ_{b} of 0.71. The other two correlations (including function_complexity) for this section have moderate τ_{b}. In section OO, the correlation between new_duplicated_lines and new_duplicated_blocks is 0.94 and, based on our 0.9 limit; these two metrics are redundant. Correlations of these two metrics with new_lines are weak.
Categorywise mean of standard deviations of significant correlation coefficients (τ_{b}) from Table 18
Intracategory correlations among cumulative metrics are much higher than density, average, and organic. Even though cumulative, average, and organic are all at the strong τ_{b} level, but for cumulative the categorywise mean of the sample mean of τ_{b} is 0.79, thus, almost close to very strong level. On the other hand, average and organic categories have scored 0.54 and 0.55 correspondingly. In addition, even though we do not see a single weak level correlation in CC, more than half (5 out of 9) correlations in sections DD, AA, and OO have weak τ_{b}.

Metrics new_duplicated_lines and new_duplicated_blocks (in section OO) are redundant.

All correlations among density metrics are weak.

Intracategory correlations for density, average, and organic metrics result in lower τ_{b} compared to cumulative metrics.
5.3.2 InterCategory Correlations
Intercategory correlation happens on metrics that are measured differently. We want to see whether there exist any noticeable difference in intercategory correlations compared to intracategory correlations.
Intercategory correlations are available in the six sections that are CD, CA, DA, CO, DO, and AO. These sections reflect all possible correlations, a total of 162, between metrics from all four categories. All these correlations result in weak τ_{b} except three correlations that result in moderate τ_{b} (values 0.56, 0.54, and 0.52) as shown in Table 13. Having a look at the categorywise mean in Table 14, we see that sections CO, DO, and AO, i.e., intercategory correlations of organic metrics with all other categories are the lowest. Interestingly, standard deviations are also lower for these three sections related to organic category (in Tables 18 and 15), meaning when organic metrics are correlated with metrics from other categories, the variability of τ_{b} is low. When we look at the count ofτ_{b} (Tables 23, 24, 25, and 26) and sum ofτ_{b} (Tables 27, 28, 29, and 30), we see that sections CO, DO, and AO have values only in Tables 26 and 30 reporting weak τ_{b}. In the remaining tables these three sections have zero. For other three sections (CD, CA, and DA), we see ‘count of τ_{b}’ and ‘sum of τ_{b}’ measures are available in all four levels and most concentrated at the moderate level.
When we look at the difference between intra and intercategory correlation of metrics, it is clear that they are different. The grand mean of the categorywise mean (see Table 14) of intracategory correlations is 0.49 (from (0.79 + 0.09 + 0.54 + 0.55)/4), however, for intercategory correlations, it is 0.06 (from (0.08 + 0.24 + 0.01 + 0.03 − 0.01 + 0)/6).
The observation that correlation between metrics from different categories results in overall weak correlation is important because this tells us that metrics from different categories have low collinearity. Thus, software code metrics from different categories can be used together as features in models for prediction, forecast, etc.

Intracategory correlations of metrics are different from intercategory correlations by resulting in much lower correlation coefficients.

Correlation of metrics from different categories results in overall weak correlation coefficient. Thus, code metrics from different categories are observed to have low correlation coefficients, thus, are noncollinear.
Overall, we have observed that cumulative metrics are different because the intracategory correlations of cumulative metrics result in higher τ_{b} values compared to correlations of metrics within other categories.
5.4 Cumulative vs. Organic Metrics
The progression of development of software can be tracked in different ways. Traditionally, we keep track of software through cumulative metrics. Cumulative metrics are intuitive in the sense that they give us an overall idea of the state of the system. The same thing, i.e., tracking the progression of software, can also be done through the organic way. Version control systems like Git keep track of revisions through saving deltas from each revision. Taking the delta from each consecutive software revisions, we can still calculate the total by a linear addition. If we have either a cumulative or an organic measure, we can calculate one from the other.
Null hypothesisH_{0}: The median difference between correlations of metrics from cumulative and organic_{t} categories equals to zero.
5.4.1 Transformed Organic Metrics
We have transformed all 15 cumulative metrics into organic_{t} metrics by taking the difference between consecutive revisions for each metric. We name these new metrics by adding an underscore before the cumulative counterpart metrics from which they are transformed, e.g., ncloc is transformed into _ncloc. It should be noted that the existing three metrics from the organic category do not take any negative values. Meaning if ncloc decreases new_lines will hold a zero value because there are no new lines. However, organic_{t} metrics can hold negative values reflecting a reduction of a metric’s measure in consecutive revisions.
5.4.2 Designing and Executing the Test
After computing the organic_{t} metrics, we went through the same procedure as we did for the other metrics in this study, i.e., checking the nature of the data. We found organic_{t} metrics to be similar to organic metrics in terms of kurtosis and short tails. However, for skewness, organic_{t} metrics are not as extremely left skewed as organic metrics because organic_{t} metrics can take negative values. However, overall, organic_{t} metrics are nonnormal.
Sample mean values for significant correlation coefficients (τ_{b}) of all metrics including transformed organic metrics. Three grayscale cell colors indicate three levels of τ_{b}. Cells with red text indicate weak τ_{b}. A dot (.) in a cell indicates zero value
5.4.3 Results from the Test
In Tables 16, 31, and 32, we see that organic_{t} metrics are similar to organic metrics both in terms of intra and intercategory correlations.
Now we like to see whether organic_{t} metrics are able to produce perfect correlations as produced by the cumulative metrics. Following the similar referencing style as per Table 11, we refer the section containing τ_{b} from intracategory correlations of organic_{t} metrics as O_{t}O_{t}. Intracategory correlation of cumulative metrics (in Section CC of Table 20) have a total 46 counts of perfect correlations for six correlations. For organic_{t} metrics, we see (in section O_{t}O_{t} of Table 32) 24 perfect correlations for four metrics are exactly produced. However, for two correlations ((complexity, complexity_in_functions) and (complexity_in_classes, complexity_in_functions)) 12 out of 22 perfect correlations are produced. So we look at the sample mean of these two correlations (see section O_{t}O_{t} of Table 16) and find a value 0.994 for both of these correlations; the value 0.994 is so close to the maximum possible τ_{b} value of 1 that we can consider 0.994 as a perfect correlation. Thus, we see that both cumulative and organic_{t} metrics are equally able to detect perfect correlations among metrics if there exist any.
At this point, we like to focus on testing the null hypothesis. The mean of section O_{t}O_{t} of Table 16 is calculated as 0.49 which is much lower than the value of 0.79 for the cumulative metrics (in section CC of Table 14). However, without performing a statistical test, we cannot accept or reject our null hypothesis.
Earlier our data was the code metrics, but now it is the sample mean of the Kendall’s τ. So now we need to check the distributions of data to be tested, i.e., section CC and O_{t}O_{t} of Table 16. We have checked descriptive statistics, skewness, kurtosis, histograms, and also performed the ShapiroWilk test on these two data sets and found the data nonnormal. Thus, we have to choose nonparametric tests.
We can consider the data as paired because a single τ_{b} value, say ncloc in section CC and a corresponding τ_{b} value from section O_{t}O_{t} (i.e., _ncloc) both measure the similar aspect but they are measured differently. Then it can also be argued that ncloc and _ncloc are two different measures and they cannot be considered as paired. Taking both arguments, we would like to execute two tests to see whether there is any significant difference. Taking our two data sets as two independent samples, we perform the ‘MannWhitney U Test’, and taking our data sets as dependent samples, we perform the ‘Wilcoxon Signed Ranks Test’. After checking the assumptions of these tests, we remove two τ_{b} from section CC and the two corresponding τ_{b} from section O_{t}O_{t}. Since the correlations between metrics complexity, complexity_in_classes, complexity_in_functions in section CC and corresponding organic_{t} measures from section O_{t}O_{t} are commonly identified as perfectly correlated in both sections, we decide to keep only one of them and remove the other two so that the assumption related to data dependency is no longer present. The ‘Wilcoxon Signed Ranks Test’ has an assumption that all the difference between paired data should approximately be equally distributed along the quartiles when plotted as a onedimensional boxplot. This was not met for our data. Since this assumption is a visual test, we decided to include a ‘PairedSamples Sign Test’, which does not have such an assumption. We report results from all these three tests here.
5.4.4 Implications of the Test Results
The finding that there exists a significant median difference between correlations of metrics from cumulative and organic categories, implies that organic metrics are a set of measures that are collectively different than the cumulative metrics. Therefore, organic metrics can be considered as a new set of feature holding different characteristics than cumulative metrics as a whole.
The intracategory correlations of cumulative metrics are much higher than their equivalent sets of transformed organic metrics. Since correlations between cumulative metrics are high, there exist high collinearity among these metrics. This makes cumulative metrics collinear, and only a single metric from a group of highly correlated metrics can be considered as a valid input feature for a predictive model. The high collinearity among software code metrics is not new information. However, the knowledge that transforming cumulative metrics into organic can significantly reduce the collinearity is new. Since this transformation does not alter the original footprint of how software is evolved, it is expected to be free of side effects of normalization. Since organic metrics have lower collinearity, the chance of having multiple valid input features from them is possible.
The intercategory correlations of metrics from different categories are observed to be always low. This information can improve feature engineering by making the process more systematic. First, this gives a simple and intuitive understanding of noncollinearity based on measurement types. Second, this implies transforming a feature into a different measurement type can produce a feature that is noncollinear with the original feature.
5.5 Discussion
Software engineering researchers have observed high collinearity among software code metrics. However, we did not have explicit knowledge whether the high collinearity is due to the inherent nature of the code metrics or due to how we measure them. This study compares between correlation coefficients of a set of 15 cumulative metrics and correlation coefficients of their corresponding organic metrics and reveals that a large portion of collinearity among cumulative metrics results from the cumulative way of measurement. Since organic metrics are free from the effects of cumulation representing the natural evolution of software, we think, correlation coefficients among the organic metrics represent the inherent collinearity among code metrics. Taking the difference between the intracorrelation coefficients of metrics from these two categories, we can determine the added collinearity due to cumulative measurement.
High collinearity among a group of input features makes them weaker as a whole. We can get a few valid input features from such a set because, during the validation process, features with high collinearity are removed. Since organic metrics have lower collinearity among themselves, we can possibly get more valid input features from them. Moreover, since the collinearity between cumulative and organic metrics are very low, we can combine them to have even more valid input features. The lower collinearity among cumulative and organic metrics also means, even though we can calculate a set of organic metrics corresponding to a set of cumulative metrics, they are mutually exclusive. Therefore, theoretically, we do not have to restrict our choice of metrics from either of these two categories.
It is interesting that when we add density and average categories to the scenario described above for cumulative and organic, still noncollinearity holds between metrics from different measurement categories. It can be noted that the unit of measurement of the density metrics is percentage. While cumulative and organic metrics have the same value type (i.e., integer), metrics are measured differently in both categories.
The findings that organic metrics are collectively different than their corresponding cumulative metrics (i.e., intercategory), the understanding of the effect of cumulation toward collinearity among metrics (i.e., intracategory), and the general observation that metrics from different measurement categories yield in overall weak correlation coefficients are significant for the researchers and the practitioners in software engineering because they help us better understand the nature of the code metrics. The findings open the possibility of new research and revisiting existing research in this field. Based on our results, theoretically, we have more valid input features from different measurement categories, however, in practice, research needs to be conducted to determine which predictors (i.e., input metrics) are good for which targets (i.e., qualities that we want to predict or estimate). It could be the case that specific metrics from a category work better in combination with other metrics to predict a particular quality attribute. Researchers can also try to find new measurement categories and their properties. This study has investigated some code metrics, other code metrics not covered here can also be studied. These results can help the practitioners by giving them more insights about software metrics. Quality managers can reprioritize the metrics that a project keeps track. Tool developers can rethink about the metrics their tools support. For example, SonarQube automatically removes new_duplicated_lines, new_lines, new_duplicated_blocks, and some other similar organic metrics. Based on the findings of this study, they can decide to give users the option to keep track of such metrics. Software engineers working with predictive analysis have more options when choosing input features for their models. For example, for a predictive model, complexity (of type cumulative) is identified as an input feature. At this point, the possibility of incorporating metrics related to complexity from different measurement categories can be explored and complexity per file (of type average) and complexity per commit or based on a specific time duration (of type organic) can be considered as input features.
5.6 Threats to Validity and Limitations
5.6.1 Internal Validity
Today software is built with various languages and it is very common that a project uses code from different languages. Still, a project is usually designated with a specific language indicating the major portion of code, etc. Computer programming languages use different constructs and measures of code metrics may dramatically vary due to language difference. To eliminate this effect, we choose to focus on a single programming language, Java. It can be noted that Java projects are among the top three ranks on GitHub.
We considered some other factors when selecting projects such as project types, project size in LOC, number of revisions, number of developers. While selecting the projects, we tried to combine these factors so that we have a good representation regarding these factors. Besides, we looked at the number of issues and pull requests. We tried to select projects that have reported issues and pull requests because these factors are signs of active involvements of users and developers.
Tools measuring software code metrics are not perfect. Different tools implement measures differently even though they claim to measure a same aspect of code (Lincke et al. 2008). This study has selected one of the most widely used measurement tools for software quality, and we have observed very strong correlations between the cumulative metrics similar to the major studies. However, selecting a popular tool and similar observations to the major studies do not entirely remove this threat. This is a general issue to any study like this and we are aware of this.
We mentioned earlier that we only checked the revisions from the master branch of a Git repository. This affects granularity of the collected data. Based on the finding in this study, we know that reducing the cumulative effect reduces the correlation between metrics, thus, avoiding partial cumulation of data due to Git branches could possibly make the result of this study more stronger. Therefore, we do not consider this a major threat to our results.
5.6.2 External Validity
There are millions of software projects hosted on GitHub. Generalizing result for such a large population is a key validity threat for any study and we also identified generalizability as a considerable threat to validity. Selection of project is one of the most important things to minimize this threat. We tried to carefully select projects from wellknown organizations to mitigate this risk. For example, we have included projects from Apache, a pioneering organization in the open source community, Microsoft, and other organizations with diverse portfolios. On the positive side, we have found highly significant results from different tests. For example, MannWhitney U Test in Fig. 9a, Wilcoxon Signed Ranks Test in Fig. 10a, and PairedSample Sign Test in Fig. 10b have significance values 4.4e− 18, 1.8e− 18, and 1.5e− 23 correspondingly.
To safeguard the internal validity, we choose to restrict our focus to Java source code. This choice, however, seems to affect the external validity, meaning ‘do the results hold for source code written in other programming languages?’. Since measurement types discussed in this study are independent of the programming languages, such a threat is not highly significant, in our opinion. However, more research can be done to be certain.
6 Conclusions
This empirical research investigates whether measurement types of software code metrics have an effect on their correlations. Through collecting and analyzing 24 code metrics from 11,874 revisions from 21 open source Java projects, we have found that measurement types have an effect on correlations. Analysis of data shows that 10 out of total 15 metrics that are measured cumulatively are redundant based on our criteria two metrics with a correlation coefficient of 0.9 or above are redundant. When the cumulative effect of these metrics is removed by transforming these 15 metrics into type organic, only three of them are identified as redundant. These three metrics are identified as perfectly correlated (i.e., they measure exactly the same aspect) in both categories, implying if metrics are truly correlated, correlations of their organic measures are able to identify it. In addition, our analysis shows that organic metrics result in significantly lower correlation coefficients compared to cumulative metrics for intracategory correlations. Furthermore, while some software metrics are closely related to each other resulting in high correlation coefficients to an extent to be considered redundant, many higher correlation coefficient values are due to measuring these metrics cumulatively. In other words, we should not be surprised seeing higher correlation coefficients for cumulative metrics, and we should be aware that measuring metrics by their natural development, i.e., organically result in much lower correlation coefficient.
Another interesting finding is correlations between metrics from different categories yield in overall weak correlation coefficients. This finding is important because metrics from cumulative, density, average, and organic categories can be combined together as features for predictive models. From another view point, this could improve the process of feature engineering by providing the information that transforming a feature into a different measurement type produces a new feature that is noncollinear with the original feature.
We have discussed why Kendall’s τ version B fits more for software projects. We also discussed the landscape of correlation coefficients outlining possible sets considering both correlation coefficient and pvalue which can be helpful for this type of study.
This study has attempted to reveal the fundamental relationships between measurement types and correlations of software code metrics. More evidence is required to generalize and extend this knowledge. Thus, replicated studies can be conducted considering various metrics, measurement tools, programming languages, and project types.
Footnotes
Notes
Acknowledgments
The authors like to thank Dr. Aila Särkkä for her comments about some of the statistical analysis performed in this study and the anonymous reviewers for their valuable comments and suggestions that have significantly improved the clarity of this paper.
References
 Aggarwal CC (2013) Outlier Analysis. Springer Publishing Company, Incorporated, BerlinCrossRefzbMATHGoogle Scholar
 Baxter G, Frean M, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: Proceedings of the 21st Annual ACM SIGPLAN Conference on Objectoriented Programming Systems, Languages, and Applications, ACM, New York, OOPSLA ’06, pp 397–412. https://doi.org/10.1145/1167473.1167507
 Chidamber SR, Darcy DP, Kemerer CF (1998) Managerial use of metrics for objectoriented software: an exploratory analysis. IEEE Trans Softw Eng 24(8):629–639CrossRefGoogle Scholar
 Coleman D, Ash D, Lowther B, Oman P (1994) Using metrics to evaluate software system maintainability. Computer 27(8):44–49. https://doi.org/10.1109/2.303623 CrossRefGoogle Scholar
 Concas G, Marchesi M, Pinna S, Serra N (2007) PowerLaws in a large objectOriented software system. IEEE Trans Softw Eng 33(10):687–708. https://doi.org/10.1109/TSE.2007.1019 CrossRefGoogle Scholar
 Crawford L, Hobbs JB, Turner JR (2002) Investigation of potential classification systems for projects. In: Proceedings of the 2nd PMI Research Conference, Project Management Institute, Seattle, Washington, pp 181–190Google Scholar
 Croux C, Dehon C (2010) Influence functions of the spearman and kendall correlation measures. Statistical Methods &, Applications 19(4):497–515. https://doi.org/10.1007/s102600100142z MathSciNetCrossRefzbMATHGoogle Scholar
 Deshpande A, Riehle D (2008) The total growth of open source. In: Open Source Development, Communities and Quality. https://doi.org/10.1007/9780387096841_16. Springer, Boston, pp 197–209
 Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, Marquéz JRG, Gruber B, Lafourcade B, Leitão PJ, Münkemüller T, McClean C, Osborne PE, Reineking B, Schröder B, Skidmore AK, Zurell D, Lautenbach S (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36(1):27–46. https://doi.org/10.1111/j.16000587.2012.07348.x CrossRefGoogle Scholar
 El Emam K, Schneidewind NF (2000) Methodology for Validating Software Product Metrics. Technical Report NCR/ERC1076, National Research Council of Canada, Ottawa, Ontario, Canada, http://nick.adjective.com/work/ElEmam2000.pdf
 Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying thresholds for objectoriented software metrics. J Syst Softw 85(2):244–257. https://doi.org/10.1016/j.jss.2011.05.044. http://www.sciencedirect.com/science/article/pii/S0164121211001385 CrossRefGoogle Scholar
 Field A (2009) Discovering Statistics Using SPSS. SAGE Publications, googleBooksID: a6FLF1YOqtsCGoogle Scholar
 Gil Y, Lalouche G (2017) On the correlation between size and metric validity. Empir Softw Eng 22(5):2585–2611. https://doi.org/10.1007/s1066401795135 CrossRefGoogle Scholar
 Hair J (2006) Multivariate Data Analysis. Pearson international edition, Pearson Prentice Hall. https://books.google.se/books?id=WESxQgAACAAJ
 Henry S, Selig C (1990) Predicting sourcecode complexity at the design stage. IEEE Softw 7(2):36–44. https://doi.org/10.1109/52.50772 CrossRefGoogle Scholar
 Henry S, Kafura D, Harris K (1981) On the relationships among three software metrics. In: Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality. https://doi.org/10.1145/800003.807911. ACM, New York, pp 81–88
 Janes A, Lenarduzzi V, Stan AC (2017) A continuous software quality monitoring approach for small and medium enterprises. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, ACM, New York, ICPE ’17 Companion, pp 97–100. https://doi.org/10.1145/3053600.3053618
 Jantunen S, Lehtola L, Gause DC, Dumdum UR, Barnes RJ (2011) The challenge of release planning. In: 2011 Fifth International Workshop on Software Product Management (IWSPM), pp 36–45. https://doi.org/10.1109/IWSPM.2011.6046202
 Jay G, Hale JE, Smith RK, Hale D, Kraft NA, Ward C (2009) Cyclomatic complexity and lines of code: empirical evidence of a stable linear relationship. J Softw Eng Appl 02(03):137. https://doi.org/10.4236/jsea.2009.23020 CrossRefGoogle Scholar
 Kazman R, Cai Y, Mo R, Feng Q, Xiao L, Haziyev S, Fedak V, Shapochka A (2015) A case study in locating the architectural roots of technical Debt. In: Proceedings of the 37th International Conference on Software Engineering, vol 2. IEEE Press, Piscataway, ICSE ’15, pp 179–188. http://dl.acm.org/citation.cfm?id=2819009.2819037
 Kendall MG, Gibbons JD (1990) Rank Correlation Methods, 5th edn. Oxford University Press, LondonzbMATHGoogle Scholar
 Kim HY (2013) Statistical notes for clinical researchers: assessing normal distribution (2) using skewness and kurtosis. Restorative Dentistry & Endodontics 38(1):52–54. https://doi.org/10.5395/rde.2013.38.1.52. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591587/ CrossRefGoogle Scholar
 Landman D, Serebrenik A, Bouwers E, Vinju JJ (2016) Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions. Journal of Software: Evolution and Process 28(7):589–618. https://doi.org/10.1002/smr.1760. http://onlinelibrary.wiley.com/doi/10.1002/smr.1760/abstract Google Scholar
 Lethbridge TC, Sim SE, Singer J (2005) Studying software engineers: data collection techniques for software field studies. Empir Softw Eng 10(3):311–341. https://doi.org/10.1007/s106640051290x CrossRefGoogle Scholar
 Letouzey J, Ilkiewicz M (2012) Managing technical debt with the sqale method. IEEE Softw 29(6):44–51. https://doi.org/10.1109/MS.2012.129 CrossRefGoogle Scholar
 Lincke R, Lundberg J, Löwe W (2008) Comparing software metrics tools. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, ACM, New York, ISSTA ’08, pp 131–142 . https://doi.org/10.1145/1390630.1390648
 Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26. https://doi.org/10.1145/1391984.1391986 CrossRefGoogle Scholar
 Mamun MAA, Berger C, Hansson J (2017) Correlations of software code metrics: an empirical study. In: Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement, ACM, New York, IWSM Mensura ’17, pp 255–266 . https://doi.org/10.1145/3143434.3143445
 McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng SE2(4):308–320. https://doi.org/10.1109/TSE.1976.233837 MathSciNetCrossRefzbMATHGoogle Scholar
 Meloun M, Militký J, Hill M, Brereton RG (2002) Crucial problems in regression modelling and their solutions. Analyst 127(4):433–450. https://doi.org/10.1039/B110779H. https://pubs.rsc.org/en/content/articlelanding/2002/an/b110779h CrossRefGoogle Scholar
 Meneely A, Smith B, Williams L (2013) Validating software metrics: a spectrum of philosophies. ACM Trans Softw Eng Methodol 21:4:24,1–24:28. https://doi.org/10.1145/2377656.2377661 Google Scholar
 Meulen MJPvd, Revilla MA (2007) Correlations between internal software metrics and software dependability in a large population of small C/C++ programs. In: The 18th IEEE International Symposium on Software Reliability (ISSRE ’07), pp 203–208. https://doi.org/10.1109/ISSRE.2007.12
 Park HM (2009) Univariate analysis and normalitytest using sas, stata, and spss Technical Working Paper The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University. https://scholarworks.iu.edu/dspace/handle/2022/19742
 Riaz M, Mendes E, Tempero E (2009) A systematic review of software maintainability prediction and metrics. In: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, IEEE Computer Society, Washington, ESEM ’09, pp 367–377. https://doi.org/10.1109/ESEM.2009.5314233
 Runeson P, Höst M (2008) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164. https://doi.org/10.1007/s1066400891028 CrossRefGoogle Scholar
 Saini S, Sharma S, Singh R (2015) Better utilization of correlation between metrics using Principal Component Analysis (PCA). In: 2015 Annual IEEE India Conference (INDICON), pp 1–6. https://doi.org/10.1109/INDICON.2015.7443299
 Schroeder J, Berger C, Staron M, Herpel T, Knauss A (2016) Unveiling anomalies and their impact on software quality in modelbased automotive software revisions with software metrics and domain experts. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016. https://doi.org/10.1145/2931037.2931060. ACM, New York, pp 154–164
 Shevlyakov GL, Vilchevski NO (2002) Robustness in Data Analysis: criteria and methods, Modern Probability and Statistics. VSP BVGoogle Scholar
 Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on postrelease defects: a case study on the eclipse project. In: Proceedings of the 2010 ACMIEEE International Symposium on Empirical Software Engineering and Measurement, ACM, p 4Google Scholar
 Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer objectoriented metrics suite. Empir Softw Eng 10(1):81–104. https://doi.org/10.1023/B:EMSE.0000048324.12188.a2 CrossRefGoogle Scholar
 Tashtoush Y, AlMaolegi M, Arkok B (2014) The Correlation among Software Complexity Metrics with Case Study. arXiv:1408.4523
 Taylor R (1990) Interpretation of the correlation coefficient: a basic review. Journal of Diagnostic Medical Sonography 6(1):35–39MathSciNetCrossRefGoogle Scholar
 Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation, pp 45–54. https://doi.org/10.1109/SCAM.2003.1238030
 Xu W, Hou Y, Hung Y, Zou Y (2013) A comparative analysis of spearman’s rho and kendall’s tau in normal and contaminated normal models. Signal Process 93(1):261–276. https://doi.org/10.1016/j.sigpro.2012.08.005. http://www.sciencedirect.com/science/article/pii/S0165168412002721 CrossRefGoogle Scholar
 Zhou Y, Leung H, Xu B (2009) Examining the potentially confounding effect of class size on the associations between objectoriented metrics and changeproneness. IEEE Trans Softw Eng 35(5):607–623. https://doi.org/10.1109/TSE.2009.32 CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.