Abstract
Context
Software metrics play a significant role in many areas in the lifecycle of software including forecasting defects and foretelling stories regarding maintenance, cost, etc. through predictive analysis. Many studies have found code metrics correlated to each other at such a high level that such correlated code metrics are considered redundant, which implies it is enough to keep track of a single metric from a list of highly correlated metrics.
Objective
Software is developed incrementally over a period. Traditionally, code metrics are measured cumulatively as cumulative sum or running sum. When a code metric is measured based on the values from individual revisions or commits without consolidating values from past revisions, indicating the natural development of software, this study identifies such a type of measure as organic. Density and average are two other ways of measuring metrics. This empirical study focuses on whether measurement types influence correlations of code metrics.
Method
To investigate the objective, this empirical study has collected 24 code metrics classified into four categories, according to the measurement types of the metrics, from 11,874 software revisions (i.e., commits) of 21 open source projects from eight wellknown organizations. Kendall’s τB is used for computing correlations. To determine whether there is a significant difference between cumulative and organic metrics, MannWhitney U test, Wilcoxon signed rank test, and pairedsamples sign test are performed.
Results
The cumulative metrics are found to be highly correlated to each other with an average coefficient of 0.79. For corresponding organic metrics, it is 0.49. When individual correlation coefficients between these two measure types are compared, correlations between organic metrics are found to be significantly lower (with p < 0.01) than cumulative metrics. Our results indicate that the cumulative nature of metrics makes them highly correlated, implying cumulative measurement is a major source of collinearity between cumulative metrics. Another interesting observation is that correlations between metrics from different categories are weak.
Conclusions
Results of this study reveal that measurement types may have a significant impact on the correlations of code metrics and that transforming metrics into a different type can give us metrics with low collinearity. These findings provide us a simple understanding how feature transformation to a different measurement type can produce new noncollinear input features for predictive models.
1 Introduction
The exponential growth of software size (Deshpande and Riehle 2008) is bringing in many challenges related to maintainability, release planning, and other software qualities. Thus, a natural demand to predict external product quality factors to foresee the future state of software has been observed. Maintainability is related to the size, complexity and documentation of software (Coleman et al. 1994). Size and complexity metrics are common among other metrics to predict software maintainability (Riaz et al. 2009). Growing software size and complexity have made it increasingly difficult to select features to be implemented in the next product release and have challenged existing assumptions and approaches for release planning (Jantunen et al. 2011).
Validating software metrics has gained importance as predicting external software qualities are becoming more demanding day by day to be able to manage future revisions of software. Researchers have proposed many validation criteria for software metrics over the last 40 years, e.g., a list of 47 criteria is reported in a systematic literature study by Meneely et al. (2013) where one of them is noncollinearity. Collinearity (also known as multicollinearity) exists between two independent features if they are linearly related. Since prediction models are often multivariate, i.e., use more than one independent feature or metric, it is important that there is no significant collinearity among the independent features. Collinearity results in two major problems (Meloun et al. 2002). First, it makes a model less useful as individual effects of the independent features on a dependent feature can no longer be isolated. Second, extrapolation is most likely be highly erroneous. Thus, El Emam and Schneidewind (2000) and Dormann et al. (2013) suggested diagnosing collinearity among the independent features for a proper interpretation of regression models.
Many studies have explored correlations between various software metrics such as McCabe’s cyclomatic complexity (Landman et al. 2016; Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), lines of code (Landman et al. 2016; Henry and Selig 1990; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), Halstead’s metrics (Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Meulen and Revilla 2007) Kafura’s information flow (Henry et al. 1981), Number of comments, Meulen and Revilla (2007), etc. Most of these studies have observed that code metrics are highly correlated. However, they do not address whether measurement types of metrics affect their correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics. A description of the measurement types used in this study are given below.
Cumulative: This indicates the traditional or the most common way how software code metrics are measured by cumulative sum or running sum. Here, by a revision, we indicate a commit, which is a single entry of source code in a repository. For example, if the number of total Lines Of Code (LOC) written for a project’s first three revisions are 50, 30, and 30 consecutively, the corresponding cumulative measures of ncloc for the revisions would be 50, 80, and 110.
Density: This measure tells us how representative a measure is within a per unit of artifact with a standard portion. Generally, the unit of density is a ratio. Within the context of code, we consider 100 LOC as a unit, the measurement unit becomes a percentage. Under this consideration, such a metric can take a value from 0 to 100. For example, the metric comment_lines_density measures lines of comments per 100 lines of code.
Average: This is the mean value of a measure with respect to artifacts related to a specific type. An example of such a metric is file_complexity which measures the mean complexity per file.
Organic: A metric that measures artifact from a single revision or two consecutive revisions without being influenced by any other revisions in the repository is organic. We have introduced the term organic as this measure has no effect from the entire list of unbounded preceding revisions like the cumulative measure. An organic metric can measure purely from a single revision, e.g., new_lines measures the lines of code (that are not comments) specific to a single revision. It can be zero in case no new code is added to a revision however, it cannot be negative (like a code churn measure). An organic metric can also measure a single revision relative to its one preceding revision. Since in this case it reflects a change or delta compared to the preceding revision, it can be positive, negative and zero.
The core idea of this study was developed while following a previous study (Mamun et al. 2017) where we focused on the domainlevel correlation of metrics from four domains that are size, complexity, documentation, and duplications. In the followup study, we explored correlations at the metriclevel and observed that the organic metrics consistently have lower correlations. Based on this observation from the followup study, we initiated and designed this study by grouping the code metrics based on how they are measured.
Due to the problems of collinearity when building predictive models, many studies have investigated how different metrics are correlated with each other. However, to our knowledge, no study has investigated the impact of measurement types on the correlations of software code metrics. This knowledge is fundamental to understand the metrics better. With a goal to understand the relationship between measurement types and correlations of software code metrics, this study has the following research question.
RQ: How measurement types affect correlations of code metrics?
This study has selected 21 open source projects from eight organizations and analyzed the source code of a total 11,874 revisions from all projects to extract code metrics. We have mined 24 software metrics classified into four categories: cumulative, density, average, and organic.
The complete revision histories of the selected projects have been analyzed using a static analysis tool to generate code metrics. The code metrics are then mined from the database for analysis. Before performing data analysis, data is explored using various visual and theoretical statistical tools. Based on the nature of data, we selected Kendall’s τB (a nonparametric method for correlation), for all selected projects. Motivations for selecting Kendall’s τB is discussed in Section 4. Correlation coefficients are divided into different sets based on their level of strength and level of significance. Based on the results up to this point, we transformed all cumulative metrics into organic metrics and ran statistical tests to determine whether there is a significant difference in correlation between these two sets.
Results of this study indicate how correlations of code metrics are influenced by their measurement types, i.e., the way they are measured. We can see whether there is a difference between intracategory correlations of metrics from the same category and intercategory correlations of metrics from different categories. Based on the data analysis, we will also report whether there is a significant difference between intracategory correlations of cumulative metrics and intracategory correlations of organic metrics. These understandings are fundamental because they can reveal whether high collinearity between code metrics are due to their measurement types. Such knowledge can be helpful in making an informed decision while selecting code metrics as features for predictive models.
In the following sections of this paper, we first discuss the methodology including design of this study, data collection procedures, nature of the collected data and data processing. Based on the nature of data observed in Section 3.3, Section 4 (data analysis method), presents a comparative discussion of applicable correlation methods and pros and cons of different measures to aggregate results from the data. Section 5 shows results and implications. Based on some results, this study performed an additional test. Retaining the actual work flow of this study, we have put the design and execution of this test in Section 5.4. This section also includes discussion, limitations and validity threats to this study. Finally, Section 6 summarizes the conclusions of this study.
2 Related Work
Software code metrics are generally known to be highly correlated as many studies have reported high correlation among various code metrics. A recent systematic literature review from 2016 (Landman et al. 2016) presents a summary of 33 articles reporting correlations between McCabe’s cyclomatic complexity (McCabe 1976) and LOC (lines of code). Henry and Selig (1990) reported correlations of five code metrics (LOC, three Halstead’s softwarescience metrics (N, V, and E), and McCabe’s cyclomatic complexity). They worked with code written in Pascal language and observed three correlations significantly higher that are (the values in parenthesis indicate the correlation coefficients): Halstead N  Halstead V (0.989), LOC  Halstead N (0.893), and LOC  Halstead V (0.885). Henry et al. (1981) compared three complexity metrics: McCabe’s cyclomatic complexity, Halstead’s effort, and Kafura’s information flow. Taking the UNIX operating system as a subject, they found McCabe’s cyclomatic complexity and Halstead’s effort highly correlated while Kafura’s information flow is found to be independent. On NASA’s open dataset, Tashtoush et al. (2014) studied cyclomatic complexity, Halstead complexity, and LOC metrics. They found a strong correlation between cyclomatic complexity and Halstead’s complexity similar to the study by Henry et al. (1981). LOC is observed to be highly correlated with both of these complexity metrics. Jay et al. (2009), in a comprehensive study, also explored the relationship between McCabe’s cyclomatic complexity and LOC. They worked with 1.2 million C, C++ and Java source files randomly selected from SourceForge code repository. They reported that cyclomatic complexity and LOC practically have a perfect linear relationship irrespective of programming languages, programmers, code paradigms, and software processes. Toward comparing four internal code metrics (McCabe’s cyclomatic complexity, Halstead volume, LOC, and number of comments), Meulen and Revilla (2007) used 59 specifications each containing between 111 and 11,495 small (up to 40KB file size) C/C++ programs. They observed strong correlations between LOC, Halstead volume, and cyclomatic complexity. A recent study by Landman et al. (2016) on an extensive Java and C corpora (17.6 million Java methods and 6.3 million C functions) finds no strong linear correlation between cyclomatic complexity and LOC to be considered as redundant. This finding contradicts many earlier studies including (Henry et al. 1981; Tashtoush et al. 2014; Saini et al. 2015; Jay et al. 2009; Meulen and Revilla 2007).
The studies discussed here, mostly cover McCabe’s cyclomatic complexity, Halstead’s metrics, and LOC investigating correlations between them and showing different results. However, they do not address whether measurement types of the studied metrics affect the strength of correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics.
Zhou et al. (2009) have reported that size metrics have confounding effects on the associations between objectoriented metrics and changeproneness. On a revisited study, Gil and Lalouche (2017) reported similar results about the confounding of the size metric. Zhou et al. (2009) have elaborately explained the confounding effect and models to identify them in areas like health sciences and epidemiological research. Gil and Lalouche (2017) used normalization as a way to remove the confounding effect. While they mentioned having lower correlation coefficient for normalized metrics, they have not explicitly reported the overall difference between correlations coming from the intracumulative and the intranormalized measures. They also did not report whether there exists a significant statistical difference between the two. But it is understandable as their primary focus is on the validity of metrics. Our focus, in contrast is solely toward understanding the effects of measurements on the correlations of code metrics. We want to understand how much of the collinearity come from the types of measures and how much of it exists naturally.
There have been studies toward understanding the distributions of software metrics. For example, Wheeldon and Counsell (2003), Concas et al. (2007), and Louridas et al. (2008) have investigated whether power law distributions are present among software metrics. They have reported that various software metrics follow different distributions with long fat tails. Louridas et al. (2008) have also reported correlations among eight software metrics including LOC and number of methods. They reported a high correlation between LOC, number of methods (NOM), and out degree of classes. Baxter et al. (2006) reported a similar study, however, unlike Wheeldon and Counsell (2003), have observed some metrics that do not follow the power laws. They opined, their use of a more extensive corpus compared to Wheeldon and Counsell (2003) is the reason for the difference. In addition to looking at the distributions of metrics, Ferreira et al. (2012) have attempted to establish thresholds or reference values for six objectoriented metrics. We have also looked at the statistical properties of the studied metrics including their distributions. However, we have done this as part of out methodology to find appropriate statistical methods, and this is not the main focus of this study.
Chidamber et al. (1998) have investigated six Chidamber and Kamerer (CK) metrics and reported high collinearity among three of them which are coupling between objects (CBO), response for a class (RFC), and NOM. Succi et al. (2005) have studied to what extent collinearity is present in CK metrics. They suggested to completely avoid RFC metric as an input feature for predictive models due to its high collinearity with other CK metrics. Given the problems of collinearity, Shihab et al. (2010) have proposed an approach to select metrics that are less collinear from a set of metrics. These studies have mentioned collinearity as a problem and reported collinearity among software metrics or proposed method to select metrics with low collinearity. However, they have not investigated from the perspective of measurement types influencing collinearity.
3 Methodology
We have designed this empirical study following the guideline of Runeson and Höst (2008) on designing, planning, and conducting case studies. This study is explorative with the intent to find insights about relations between code metrics with different measurement types. We have designed the study to minimize bias and validity threats and maximize the reliability of the results, which involves project selection, data extraction, data cleaning, exploring nature of the data, select appropriate statistical analysis methods based on the nature of data, and being conservative when selecting and instrumenting statistical analysis.
Data sources for this research are open source software projects, more specifically, open source Java projects on GitHub. Java is among the top three most frequently used project languages on GitHub. Since extracted data is quantitative, analysis methods used in this study are quantitative. We have followed a thirddegree data collection method described by Lethbridge et al. (2005). First, the case and the context of the study are defined, followed by data sources and criteria for data collection. Assumptions for statistical methods are thoroughly checked, which involves exploration of the nature of data and cleaning of data as necessary. Regardless of the measurement types, extracted data is nonnormal to the extent that meaningful transformation is not possible. Thus, we have used nonparametric statistical methods for analyzing data in this study.
3.1 Project Selection
GitHub’s search functionality was used to find candidate projects. However, due to limited capabilities of GitHub search functionality, it was not possible to perform a compound query that would fulfill all our criteria. Project selection was not randomized as we wanted to assure that selected projects have specific criteria (e.g., minimum LOC, minimum commits, etc.) and come from wellknown development organizations that would not raise obvious validity questions, e.g., “project is unrepresentative because it is a classroom project by a novice programmer.” Thus, finding projects from reputed organizations was exploratory. We started by screening projects from the 14 organizations listed in GitHub’s open source organizations showcase^{Footnote 1}. We then explored whether other wellknown organizations to our knowledge are also hosting their projects on GitHub but are not in the showcase, e.g., Apache. For each organization, we made queries to find Java projects. As we want to minimize blocking effects coming from various languages, we decided to stick with a single programming language. We selected Java as it is a topranked programming language on GitHub.
Crawford et al. (2002) presented various methods for classifying software projects. We take a more straightforward approach to make sure that our selected projects are representative regarding size. A study^{Footnote 2} on the dataset of International Software Benchmarking Standards Group (ISBSG) classified software projects based on “Rule’s Relative Size Scale”. Measurements of this study are based on IFPUG MkII and COSMIC which are also translated into equivalent LOC. The combined distribution of all projects shows that more than 93% of the projects are between S (small) and L (large) size where S is estimated as 5300 and L as 150,000 LOC. We roughly followed this finding and selected projects in a way that project sizes are about uniformly distributed within about the range of S and L. We have projects ranging from 4059 LOC to 155,260 LOC indicating the code size of the latest revision of projects. Sizes of the projects are determined with cloc tool^{Footnote 3} using a bash script to extract total LOC and Java LOC. We selected 21 GitHub projects from eight software organizations where Java is tagged as the project language.
An overview of the selected projects is given in Table 1. In this table, ‘analyzed revisions’ indicates all the commits from which we collected data and which are available exclusively in the master branch of the Git repositories. ‘Total revisions’ indicates all commits available in the Git repository including the branches. Even though the selected projects are classified as Java projects, they have source code from other languages too. Thus, ‘total code’ indicates the amount of all lines of code and ‘Java code’ indicates only the lines of Java source code. The table field ‘latest commit’ points to the Head of a Git repository at the time we downloaded it. The time durations of the projects are presented in the whiskers box plot in Fig. 1 showing duration of projects from five months to 109 months with a median of 43 months. About 33% of the projects are within the 4^{th}quartile ranging from 57 to 109 months.
3.2 Data Collection and Metrics Selection
We used SonarQube^{Footnote 4} to analyze revisions of the selected projects. Kazman et al. (2015) mentioned SonarQube as the defacto source code analysis tool for automatically calculating technical debt. It has gained popularity in recent years, and Janes et al. (2017) mentioned SonarQube as the defacto industrial tool for Software Quality Assurance. This tool is based on SQALE methodology (Letouzey and Ilkiewicz 2012). We used SonarQube version 6.1 and SonarScanner version 2.8.
We run SonarQube on each revision available in the master branch of a project. Since we ignore subbranches, the number of analyzed revisions is less than the number of total revisions as reported in Table 1. Subbranches are eventually merged with the master branch which means, we do not lose anything except the granularity of data.
Analyzing 11,874 software revisions needs be automated. Python scripts are used to automate the process of traversing commits or revisions on the master branch of a project’s Git repository and run SonarQube tool on commits. SonarQube provides webservices covering a range of functionalities including mining analysis results and software metrics. We observed some of the metrics such as new_lines are seen on SonarQube’s webinterface, but they cannot be mined through the webservices. We later found that SonarQube computes some metrics only for the latest software revision and removes them automatically.
Since we did not find any option to stop the autodeletion, we added triggers and additional tables into SonarQube’s SQL database to recover the deleted records. In total, 47 metrics were mined from the database classified into six major domains namely size, documentation, complexity, duplications, issues, and maintainability. The classification is based on what the metrics measure.
In our earlier study (Mamun et al. 2017), we used SonarQube’s classification of metrics and explored domainlevel and metrictodomainlevel relationships. From the results of the metrictodomainlevel relationships in that study, we had the indication that metrics that measure artifacts based on individual values from each revision (i.e., organic metrics), result in lower overall correlation. Since metrics such as new_lines of type organic are inherently different from metrics such as ncloc of type cumulative concerning how they measure artifacts, it was understandable. However, as we started the followup study exploring the metriclevel correlations, we observed that organic metrics have much lower correlations compared to other types of metrics. This observation influenced us to rethink how the metrics should be grouped for comparison. So the criteria to group the metrics changed from earlier “what they measure” to “how they measure” artifacts. We looked at the metrics classified into four domains (i.e., size, complexity, documentation, and duplications) based on “what they measure” by SonarQube. Reviewing them, we identified 24 metrics of four measurement types that are cumulative, density, average, and organic. Table 2 shows the selected metrics classified into these four measurement types along with a short description and value type, taken from the MySQL database of SonarQube 6.1 and the metric definitions page.^{Footnote 5}
Table 3 shows metrics data corresponding to five revisions or commits of a project. In this table, each row represents a software revision. For project malmo, we have analyzed 295 revisions; thus, we have 295 data rows from this project with the similar structure as Table 3. Data used for this study is measured at the projectlevel. For example, for ncloc, all lines of Java code in the entire project is counted, for classes, all classes within the scope of a project are counted. In Table 3, the value of ncloc for the whole project is 9573 for revision 88. In the next revision (i.e., 89), ncloc becomes 9590 indicating an increase of 17 lines of code. However, the corresponding new_lines metric for this revision is 25 indicating the actual number of lines of code added to this revision disregarding possible changes or deletions of the code. All the metrics are calculated by SonarQube according to the description in Table 2. Among the four categories, density and average metrics are derived metrics, meaning, they are measured based on the cumulative metrics. Equations for constructing density and average metrics are given in Table 2.
SQL code to instrument the SonarQube database, Python code to automatically analyze commits of a Git project with SonarQube, MySQL code to retrieve the desired data from the database, and the collected data used for this study are published as public dataset^{Footnote 6}.
3.3 Exploring Nature of Data
Among different probability distribution functions, a normal distribution is more anticipated by the researchers due to its relationship with the natural phenomenon “central limit theorem.” Statistical methods are based on assumptions. After collecting data, before deciding on the type of statistical methods, researchers need to investigate the nature of the collected data. A crucial part is to check the distribution of data. If the distribution is normal, parametric statistical tests are considered. If data is nonnormal and a meaningful transformation is not possible, nonparametric tests are considered.
There is no straightforward way of determining whether a particular data is normally distributed. Sample size plays a significant factor in statistical tests for normality. There are graphical and numerical methods where each of them can be either descriptive or theoretical (Park 2009). Since our selected projects have a varying number of revisions, we have chosen a combination of tests appropriate to our data as shown in Table 4.
Histogram, a frequency distribution, is considered to be a useful graphical test when the number of observations is large. It is particularly helpful because it can capture the shape of the distribution given that bin size is appropriately chosen. If data is far from a normal distribution, a single look at the histogram tells us that the data is nonnormal. It also gives a rough understanding of the overall nature of the distribution, such as skewness, kurtosis or the type of distribution such as bi and multimodal, etc. Box plots are useful to identify outliers and comparing the distribution of the quartiles. Normal QQ plot (quantilequantile plot) is a graphical representation of the comparison of two distributions. If data is normally distributed, the data points in the normal QQ plot approximately follow a straight line. It also helps us to understand the skewness and tails of the distribution. The graphical methods help to understand the overall nature of the data quickly, but it does not provide objective criteria (Park 2009).
There are different numerical methods to evaluate whether data is normally distributed or not. Skewness and kurtosis are commonly used descriptive tools for this purpose. For a perfectly normal distribution, the statistics for these two analyses should be zero. Since this does not happen in practice, we calculate the zscore by dividing the statistic by the standard error. However, determining normality from zscore is not straightforward either. Kim (2013) discussed how the sample size can affect the zscore. Field (2009) and Kim (2013) suggested to consider different criteria for skewness and kurtosis based on the sample size. For a sample size less than 50, an absolute zscore for either of these methods should be 1.95 (corresponding to an α of 0.05); for a sample size less than 200, an absolute zscore of 3.29 (corresponding to an α of 0.001). However, for a sample size of 200 or more, it is more meaningful to inspect the graphical methods and look at the skewness and kurtosis statistics instead of evaluating the significance, i.e., zscore.
From analytical numerical methods, we consider ShapiroWilk and KolmogorovSmirnov for normality test. ShapiroWilk works better with sample size up to 2000 and KolmogorovSmirnov works with large sample sizes (Park 2009). The literature has reported different maximum values for these tests. For example, sample size over 300 might produce an unreliable result for these two tests, observed by Kim (2013), and a range of 30 to 1000 is suggested by Hair (2006). We have sample sizes from 52 to 2302 for our metrics. For large sample size, numerical methods can be unreliable; thus, Field (2009) suggested to use the graphical methods besides the numerical methods to make an informed decision about the nature of the data. We have computed all these tests in the statistical software package SPSS. It can be noted that SPSS calculates Kurtosis − 3 for Kurtosis, meaning subtracting three from the Kurtosis value.
Now, we present data of four metrics (ncloc, comment_lines_density, new_lines, and file_complexity), each from a different measurement category, from the Apache zookeeper project. Even though measures for these metrics vary in each project but we present them here so the readers have a rough idea how sample data from a project might look like. Table 5 shows the descriptive statistics for four metrics from four measurement types. The sample size is the same, i.e., 1473 for all these metrics and minimum, maximum, mean, and standard deviation values come from the whole sample. For example, among 1473 samples of new_lines, it has the minimum value 0 (i.e., a revision that adds no new lines of code) and maximum value 19,055 (i.e., a revision that adds 19,055 new lines of code). Based on these statistics, it is evident that ncloc is entirely different from new_lines. On the other hand, two metrics from density and average categories are quite similar but different from cumulative and organic metrics. The most notable number in this table is the tremendous 18,629.9 value of standard deviation for ncloc. The cumulative way of measurement and the large sample size are the key reasons for such a high standard deviation. Other cumulative metrics in our dataset also have a similar effect of cumulation on them. These descriptive statistics give us a quick overall idea about the nature of data, e.g., distribution, dispersion.
Histogram, box plot, and normal QQ plot for each of these four metrics are correspondingly presented in Figs. 2, 3, 4, and 5. The organic metric new_lines (see Fig. 2) has a very high peak (in the histogram), a considerable amount of outliers (in boxplot) and a positivelyskewed plot (normal QQ). The other metrics from the organic category have similar properties. Because of the very high peak in the distribution, the quartiles in the boxplot are not even visible. The file_complexity metric of type average (see Fig. 4) has a bimodal distribution in the histogram. The outliers in the box plot and the disconnected observed values in the normal QQ plots are due to the bimodal distribution of file_complexity metric. It is interesting to see in the QQ plot that both distributions for file_complexity are skewed in opposite directions. Many metrics in average and density, and even in the cumulative category have bimodal distributions, and some have multimodal distributions. The distributions of the average and density metrics are apparently more normal compared to the distributions of the cumulative metrics. However, all of them are still far away from a normal distribution. If we compare them all, organic metrics are distinctly different compared to the metrics from other categories. When specific types of plots are compared, e.g., all histograms of metrics from a category, plots of the organic metrics are visually observed to be more consistent to each other compared to the plots of metrics from other categories. On the other hand, when plots are compared across categories, distributions of metrics from density and average categories are observed to be more similar than others.
The depiction of the distributions and their properties by graphical methods are so much deviated from normality that we could have omitted additional tests for them. However, we perform them as part of the design of the study. Since we have a varying number of sample sizes, we got some insights about the tests. However, such observations are beyond the scope and focus of this study, thus, are not reported in this report.
Skewness and Kurtosis results are presented in Table 6. It is interesting that even though, we have a considerable sample size, none of the metrics are indicating normal distribution in the zscores in this table, meaning data is nonnormal because absolute zscores are greater than 3.29, thus, rejecting the null hypothesis. However, compared to these four metrics, the ncloc and file_complexity from the Apache kafka project having the largest sample size of 2302 have zscores within the limit of normality, but we reject it because the graphical tests do not show signs of a normal distribution.
We considered ShapiroWilk and KolmogorovSmirnov tests reliable up to a sample size of 2000. Results from these tests are depicted in Table 7 showing a very strong indication (i.e., the significance values are less than the considered α = 0.05) of nonnormal data.
Likewise, these four metrics from the Apache zookeeper project, other metrics from these measurements categories or from other projects are also observed to be nonnormal. Due to a high degree of nonnormality and different distributions among the metrics make it impractical to make transformations on these metrics. Thus, we are left with the option to perform nonparametric statistical analysis methods.
3.4 Data Processing
Besides the statistical analysis to understand the nature of the extracted data, we performed manual inspections to check the data for possible anomalies especially at the boundaries meaning, at the beginning and the ending of revision data.
We observed some organic metrics for some projects have either NaN (not a number) or unusually high value for the first analyzed revision. Examples of such observations are presented in Tables 8 and 9.
There might be several reasons why some projects start with a higher number of ncloc from the beginning as we see in the case in Table 9. It can be due to a project not tracking its code base through a version control from the beginning, or the project start with an existing code base possibly because it is an extension of another project with a separate version control. It is also observed that new_lines has a higher value than ncloc in the first revision of few projects. We removed data related to the first revisions in such cases. Since the number of removed revisions is very insignificant compared to the total number of revisions, it is less likely that this will have a major impact on this study.
4 Data Analysis Method
Since collected data is not normally distributed, nonparametric statistical methods are appropriate for this research. Spearman’s ρ correlation coefficient and Kendall’s τ correlation coefficient are two wellknown nonparametric measures to assess relationships between ranked data. We carefully investigated the nature of the collected data and properties of these two measures. A measure of Kendall’s τ is based on the number of concordant and discordant pairs of the ranked data.
Calculating Spearman’s ρ by hand is much simpler than calculating Kendall’s τ because it can be done pairwise without being dependent on the rest of the data while computing Kendall’s τ by hand is only feasible for small sample sizes because computing each data pair requires exploring the remaining data. Xu et al. (2013) reported that the time complexity of Spearman’s ρ is O(n ∗ log(n)) and for Kendall’s τ it is O(n^{2}). In general, Spearman’s ρ results in a higher correlation coefficient compared to Kendall’s τ where the latter is generally known to be more robust and has an intuitive interpretation.
Studies from different disciplines have investigated the appropriateness of Spearman’s ρ and Kendall’s τ concerning various factors. Both measures are invariant concerning increasing monotone transformations (Kendall and Gibbons 1990). Moreover, being nonparametric methods, both measures are robust against impulsive noise (Shevlyakov and Vilchevski 2002; Croux and Dehon 2010). Croux and Dehon (2010) studied the robustness of Spearman’s ρ and Kendall’s τ through their influence function and grosserror sensitivities. Even though it is commonly known that both of these measures are robust enough to handle outliers, this study found that Kendall’s τ is more robust to handle outliers and statistically slightly more efficient than Spearman’s ρ. In a more recent study, Xu et al. (2013) investigated the applicability of Spearman’s ρ and Kendall’s τ based on different requirements. Some of their key findings report Kendall’s τ as the desired measure when the sample size is large, and there is impulsive noise in the data. Their results are based on unbiased estimations of Spearman’s ρ and Kendall’s τ correlation coefficients.
In light of the above discussion, we think, Kendall’s τ is more appropriate for software projects. We have observed outliers in many projects, and all projects have some revisions with high data values indicating outliers. This can naturally happen to any software project when existing codebase is added to a new project. Outliers in software revision data are observed and their underlying reasons are discussed in earlier research (Aggarwal 2013; Schroeder et al. 2016). The robust nature of Kendall’s τ handles such data points better when compared to Spearman’s ρ.
Kendall’s τ has three different versions that are τA, τB, and τC. Both τA and τB are suitable for squareshaped data meaning data with same variables on both rows and columns. τC is used for rectangularshaped data tables with different sized rows and columns. A more important difference between τA and τB is that τB can handle tied ranks while having otherwise the same characteristics of τA. Since our collected data has tied ranks, we have used and measured τB according to the following equation^{Footnote 7}:
In this equation, m1, m2 are the two metrics for which we are checking correlation. We denote correlation coefficients of Kendall’s τ as τ_{b}. P, Q are numbers of concordant and discordant pairs respectively. m1_{0} is the number of ties only in m1, and m2_{0} is the number of ties only in m2. The possible value of τ_{b} is − 1 ≤ τ_{b} ≤ + 1 where 1 indicates perfect negative correlation, zero indicates no correlation, and + 1 indicates perfect positive correlations. Since we have 21 projects and 24 metrics in each project, we have calculated 21 correlation matrices of size 24×24. All these correlation matrices are symmetric meaning they are mirrored along the principal diagonal. τ_{b} for the diagonal elements are always + 1, indicating a perfect positive correlation of a metric with itself. Kendall’s τ also computes statistical significance (pvalue) for correlation coefficient (τ_{b}).
4.1 Landscape of Correlation Coefficients (τ _{b})
Before start aggregating results, we need to understand τ_{b} and define concrete boundaries for the interpretation of τ_{b} at different levels. τ_{b} indicates the strength of a correlation and τ_{b} has the range − 1 ≤ τ_{b} ≤ + 1. As the value of τ_{b} approaches 0, it indicates less correlation between two metrics. As the value of τ_{b} approaches to the boundaries, i.e., 1 or + 1, it indicates a higher correlation between two metrics. Researchers have used different ranges to label the strengths of correlation coefficients (Taylor 1990). This paper has labeled τ_{b} according to Table 10.
However, it is not enough just to look at the τ_{b} values, unless we look at their statistical significance, which can be found from the pvalue. If the pvalue is greater than a chosen significance level, we cannot reject the null hypothesis, meaning we do not have enough evidence to differentiate that a corresponding τ_{b} is any different from τ_{b} = 0. For example, for a pvalue value of 0.05, there is a possibility that 5% of the τ_{b} values are indicating correlation by chance even though there is no real correlation between the underlying metrics. This study considers α = 0.05 for any τ_{b} to be statistically significance.
Now we like to focus on the landscape of τ_{b} depicted in Fig. 6. The outer circle is the set of all τ_{b}, which is represented by \(S\tau _{b_{all}}\) containing τ_{b} from correlation matrices resulted from all projects. Similarly, the set of all significant τ_{b} is represented by \(S\tau _{b_{sig}}\) containing all τ_{b}, for which the corresponding pvalue has satisfied the specified significance level of 0.05. Set of very strong τ_{b} is represented by \(S\tau _{b_{vs}}\), set of strong τ_{b} by \(S\tau _{b_{s}}\), set of moderate τ_{b} by \(S\tau _{b_{m}}\), and set of weak τ_{b} by \(S\tau _{b_{w}}\). These sets are calculated according to τ_{b} values based on the levels defined in Fig. 10. Now we can also define the set of all nonsignificant τ_{b} as \(S\tau _{b_{nsig}}\) where \(S\tau _{b_{nsig}}\) = \(S\tau _{b_{all}} \setminus S\tau _{b_{sig}}\).
4.2 Aggregating Correlation Coefficients (τ _{b})
Now that we have 21 metrics of size 24x24, we want to aggregate meaningful data from them. Let M = {m_{1},m_{2},m_{3}...m_{n}}, where n is the number of total metrics be the set of all metrics and P = {p_{1},p_{2},p_{3}...p_{q}}, where q is the number of total projects be the set of all projects.
Now we discuss several ways to aggregate τ_{b} from all the projects.
Aggregating correlation coefficients (τ_{b}) based on the set of all correlation coefficients (\(S\tau _{b_{all}}\))
The simplest way of aggregating τ_{b} is by summing up all τ_{b} values for each possible pair of metrics within M for each project within P. This results in a correlation matrix of size 24×24, i.e., the same size of a correlation matrix from a project. The advantage of this method is that we can compute a single aggregated matrix where each cell is the sum of τ_{b} values from the corresponding cells from correlation matrices generated from the projects. Such an aggregated matrix can also be transformed into a weighted average matrix by dividing each cell (containing the sum of all τ_{b} for a particular pair of metrics) by q.
However, there is a fundamental problem with this method. This method combines all τ_{b} values irrespective of their corresponding pvalue. If the pvalue is greater than α then we cannot reject the null hypothesis. This means, we only consider a τ_{b} valid when its corresponding pvalue is less than or equal to α. If this is not checked, there are two implications. First, a result will be wrong due to the inclusion of the nonsignificant τ_{b} or inclusion of τ_{b} from the set \(S\tau _{b_{nsig}}\); second, we will not have any idea how big the set \(S\tau _{b_{nsig}}\) is compared to \(S\tau _{b_{sig}}\).
Aggregating results based on \(S\tau _{b_{all}}\) is not problematic in the case when \(S\tau _{b_{all}} = S\tau _{b_{sig}}\), meaning there is no τ_{b} with a nonsignificant pvalue, which is an assumption that should be checked for validity in case it is assumed. For example, the recent research on the correlation of code metrics by Gil and Lalouche (2017) has not reported anything about considering α, i.e., the significance level for pvalue, thus, nothing about the existence of τ_{b} with a nonsignificant pvalue. In this case, the readers cannot know the size of \(S\tau _{b_{nsig}}\). If an assumption regarding the nonexistence of nonsignificant pvalue is considered, then that should be documented and validated. In this research, to aggregate results, we have avoided any value from the set \(S\tau _{b_{nsig}}\). To calculate the sample mean, τ_{b} values from \(S\tau _{b_{nsig}}\) are considered as zero.
Aggregating correlation coefficients (τ_{b}) based on the set of all significant correlation coefficients (\(S\tau _{b_{sig}}\))
Let \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) is the correlation coefficients for two metrics m_{1} and m_{2} in project p_{i}. Now we compute the mean for \(S\tau _{b_{sig}}\) by the following equation:
$$ \overline{\tau}_{b_{(m_{1}, m_{2})}} = \frac{1}{\nu} \sum\limits_{i=0}^{q} \tau_{b_{(m_{1}, m_{2}, p_{i})}} \mid \tau_{b_{(m_{1}, m_{2}, p_{i})}} \in \tau_{b_{sig}} $$(1)Since (1) is calculated based on \(S\tau _{b_{sig}}\), ν indicates the total count of \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) from all projects. To compute the sample mean, we only have to replace ν by m in (1). Since we are unsure of the τ_{b} values within set \(S\tau _{b_{nsig}}\) due to their insignificant pvalue, considering a conservative measure, we take τ_{b} from set \(S\tau _{b_{nsig}}\) as zero. Thus, \(\tau _{b_{(m_{1}, m_{2}, p_{i})}} \in S\tau _{b_{sig}}\) part in (1) still holds for the sample mean.
Like \(S\tau _{b_{all}}\), the advantage of aggregating τ_{b} from the set \(S\tau _{b_{sig}}\) is also that we can report the results using a single matrix without having the mentioned problems of \(S\tau _{b_{all}}\) based aggregation. However, from such results, we are not able to determine whether \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is coming from a large or small value of ν. In other words, we are unable to tell how representative \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is among the selected projects. Since we can calculate sample mean from \(S\tau _{b_{sig}}\), we can also compute variance and standard deviation to see the overall spread of τ_{b} within the samples. However, sample mean and standard deviation of τ_{b} from \(S\tau _{b_{sig}}\) do not necessarily tell us about the distribution of τ_{b} within the four strength levels in Table 10. Standard deviation gives us an indication of the variability, but we do not have a way to know how the variability looks like regarding different strength levels of τ_{b}.
Aggregating correlation coefficients (τ_{b}) based on the sets \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), \(S\tau _{b_{w}}\)
These four sets represent τ_{b} based on their strengths according to Table 10. These sets make \(S\tau _{b_{sig}}\) i.e., \(S\tau _{b_{vs}}\) ∪ \(S\tau _{b_{s}}\) ∪ \(S\tau _{b_{m}}\) ∪ \(S\tau _{b_{w}}\) = \(S\tau _{b_{sig}}\). We can consider two measures from these sets as listed below.
Count ofτ_{b} based on their level of strength: We get this measure by simply counting the number of τ_{b} within a set. This simple measure gives us a direct answer to the question, “how many projects report a certain correlation between two metrics at a certain level of strength”? Since we have 21 projects, the maximum value count ofτ_{b} can be 21 and minimum can be zero. Looking at particular values of a count ofτ_{b} for two metrics from the metrics \(S\tau _{b_{nsig}}\), \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), and \(S\tau _{b_{w}}\), we can understand the distribution of τ_{b} among different sets toward having a better understanding about the nature of relationships between metrics.
Sum ofτ_{b} based on their level of strength: Instead of taking counts, this adds all τ_{b} within the set. Since the highest value of a single τ_{b} is 1.0, the maximum value for sum ofτ_{b} can also be 21 similar to the measure count ofτ_{b}. However, sum ofτ_{b} does not tell us how many τ_{b} values are contributing for the sum, which we can get from the count ofτ_{b}. Thus, these two measures are mutually exclusive.
These two proposed measures, count ofτ_{b} and sum ofτ_{b}, provide additional insights about correlation coefficients compared to popular statistical measures sample mean and standard deviation. In this report, when we mention the term ‘mean,’ we indicate it for a certain set, and when data from all sets are considered, we write it as ‘sample mean.’ For example, when we say ‘correlation between metrics m_{1} and m_{2} results in τ_{b}’ or ‘correlation between metrics m_{1} and m_{2} results in very strong/strong/moderate/weak τ_{b}’, we refer to τ_{b} from the sample mean table.
4.3 Missing Correlation Coefficients (τ _{b})
Correlation cannot be computed if either or both of the variables have constant values or missing values. In such a case, computation of correlation returns NaN (not a number) for both τ_{b} and pvalue. Set S_{NaN} contains such data. S_{NaN} is kept disjoint from the \(S\tau _{b_{all}}\) in Fig. 6, because τ_{b} value is missing to determine the level of strength and pvalue is also missing to determine the level of significance. Even though S_{NaN} does not help us with correlation, it tells us about the nature of specific metrics.
When discussing and reporting results in the following section, we have to point to different sections of correlation matrices or derived matrices from them. To simplify the referencing, we label different sections of such matrices as shown in Table 11.
When reporting this case study, we have tried to maintain the actual flow how this research was carried out. Based on some interesting observations between intercategory correlations of 15 cumulative and three organic metrics, this case study further tested the hypothesis:
“The median difference between correlations of metrics from cumulative and organic_{t}categories equals to zero.”
This required deriving a set of 15 organic metrics denoted as organic_{t} corresponding to the 15 cumulative metrics. The specifics of designing, performing and results of the test are elaborated in Section 5.4.
5 Results
Before going into the results and discussion regarding significant correlation coefficient (i.e., τ_{b} based on significant pvalue), from the sets within \(S\tau _{b_{sig}}\), we report how correlations coefficients outside \(S\tau _{b_{sig}}\) looks like to illustrate data that is not contributing to the main result.
5.1 Missing and Nonsignificant Correlation Coefficients (τ _{b})
First, we look at the small set S_{NaN} in Fig. 6, reporting cases where computing correlation was not successful and resulted in null values (i.e., NaN) for τ_{b} and pvalue for specific pairs of metrics from M as reported in Table 19. Missing τ_{b} related to the metric directories as seen in this table sourced from two projects, malmo and geometryapijava. In both projects, the metric directories has values (8 for malmo and 3 for geometryapijava) that remained unchanged throughout the project. The rest of the missing τ_{b} results from the project cloud, and all six metrics are related to duplications category that are duplicated_lines, duplicated_blocks, duplicated_files, duplicated_lines_density, new_duplicated_lines, and new_duplicated_blocks. Since the cloud project has no duplication related issues during the analyzed revisions, all six metrics have the value 0.
If at least one metric from a pair contains a constant value (i.e., a metric having the same value for all revisions), it is not possible to perform correlation on that pair of metrics, and this results in NaN values for τ_{b} and pvalue. We observed the directories metric and metrics related to duplications have fewer levels (meaning variations) compared to other metrics.
Now, we move to the set \(S\tau _{b_{nsig}}\) as reported in Table 17. Horizontal bars in this table and all other similar tables reporting the count and sum measures are graphical representations of the corresponding cell values. The maximum value for a cell corresponding to two metrics (e.g., the cell between ncloc and classes) is q which is 21, the number of projects. Cells in the principal diagonal are omitted, and thus, kept blank. The rightmost column reporting the ‘total count’ or ‘total sum’ of a row may have a maximum possible value of 483 (calculated from n.q − q). The horizontal bars are drawn based on the maximum possible value a cell can contain (i.e., 21 for regular cells between metrics and 483 for cells indicating ’total’) but not on the maximum available value in a table.
A single glance at Table 17 gives us the impression that organic category is different from all other categories. A more careful look tells us that there are three groups: cumulative and organic having lowest and highest count of nonsignificant τ_{b} values correspondingly, and density and average with the count of nonsignificant τ_{b} values in between. We can also see it from the rightmost column ‘total count’. Since ‘total count’ comes from all four metric categories, we show the categorywise mean of Tables 17 in 12.
We are interested to see the numbers in the principal diagonal of Table 12, which indicates measures coming from correlations between metrics from within a category, i.e., intracategory correlations. We see section OO has the least amount of nonsignificant τ_{b} followed by section CC, AA, and DD. Metrics within the organic category (section OO) are interesting as they have the least amount of nonsignificant τ_{b} within itself, however, organic scored highest when correlated to other categories. Another observation is that the mean value of ‘count of nonsignificant τ_{b}’ within a category itself is always smaller than the mean value of ‘count of nonsignificant τ_{b}’ between categories. Since we cannot draw any conclusion whether τ_{b} is strong or weak from τ_{b} with nonsignificant pvalue, it is better to have less nonsignificant values within the context of making statistical analysis.
Key Observations:

Correlating metrics from different categories results in more nonsignificant correlation coefficients compared to correlating metrics within a category.

Category organic is quite different from the other three categories by producing a lot of nonsignificant τ_{b} for intracategory correlations.

Category organic has least amount of nonsignificant τ_{b} followed by cumulative, average, and density for intercategory correlations. Even though the mean value for cumulative category is quite low (0.27) in this context, the contrast between the inter and intracategory is clearly less noticeable compared to organic category.
5.2 Overall Distribution of Correlation Coefficients (τ _{b})
Based on the labeling of τ_{b} in Table 10, we have reported two sets of measurements. They are ‘count of τ_{b}’ (Tables 23, 24, 25, and 26) and ‘sum of τ_{b}’ (Tables 27, 28, 29, and 30) at different levels.
Figure 7 is constructed based on the rightmost column ‘total count’ from Tables 17, 19 23, 24, 25, and 26. Similarly, Fig. 8 is constructed from the absolute values of the rightmost column ‘total sum’ from Tables 27, 28, 29, and 30. Since ‘total count/’ and ‘total sum’ columns in these tables are counts and sums of corresponding measures of a metric’s correlation with respect to all other metrics, we get an overall distribution of τ_{b} for the metrics from these columns.
Both Figs. 7 and 8 are vertically equally scaled for better comparison. Between these figures, the level ‘very strong’ has the least difference and level ‘weak’ has the highest difference among the four strength levels of τ_{b}. While Fig. 7 gives us an overview of how different sets of τ_{b} are representative, Fig. 8 shows the actual sum of τ_{b}. Since for \(S\tau _{b_{nsig}}\tau _{b}\) value is meaningless, and for S_{NaN}τ_{b} value is missing; they are not included in Fig. 8.
In Fig. 8, we can see how the red bars, representing ‘very strong τ_{b}’ within the cumulative metrics (metrics ncloc to duplicated_files), are dominating compared to other levels. Metrics directories, duplicated_lines, duplicated_blocks, and duplicated_files are comparatively weaker in terms of ‘very strong τ_{b}’ compared to the other cumulative metrics. However, all cumulative metrics have scored much higher than metrics from other measurement categories. Metrics from the organic category have scored lowest among all metrics and categories. These two figures represent data from all correlations, e.g., the horizontal bar for ncloc comes from the correlation coefficient of ncloc with all other metrics. Thus, from these figures, we cannot determine whether there is any difference between τ_{b}’s resulting from intracategory metrics correlation and intercategory metric correlation. We study this in more detail next.
5.3 Significant Correlation Coefficients
All reported τ_{b} from this subsection passed the significance α = 0.05, which we will not mention any further for the rest of this subsection. When reporting τ_{b}, by default, we indicate the sample mean as reported in Table 13.
First, we want to look at intracategory relations among the metrics. Table 13 shows the sample mean of τ_{b} corresponding to the set \(S\tau _{b_{all}}\), and Table 21 in the Appendix shows the mean of τ_{b} within the \(S\tau _{b_{sig}}\). Here, we want to reiterate that we have considered any nonsignificant τ_{b} from the set \(S\tau _{b_{all}}\) as 0.
Perfect Correlations
We are interested in the perfect correlation (i.e., τ_{b} = 1.0) reported in Tables 13 and 21 between metrics (complexity, complexity_in_classes), (complexity, complexity_in_functions), and (complexity_in_classes, complexity_in _functions) since this is an indication that these three pairs of metrics are perfectly correlated meaning both metrics within a pair measure exactly the same aspect. It can be noted that τ_{b} for these relations are not exactly 1 as reported in Tables 13 and 21. This happens because we have reported τ_{b} up to two decimal points, so anything greater or equal to 0.995 is reported as 1. To understand these relations, we counted the perfect correlation coefficients between all metrics from all projects, which is reported in Table 20 where we see five relations have perfect τ_{b}. In 20 projects the correlation between complexity and complexity_in_classes is perfect. So it is evident that the metrics complexity and complexity_in_classes are measuring the same aspect and using one of them is sufficient. Since our data is coming from Java source code, this is not a surprise because, in Java, code does not reside outside classes.
For the relation between complexity and complexity_in_functions, there exist perfect correlations in nine projects. Since we found a perfect correlation in the sample mean of τ_{b} in Table 13, it means the rest of the 12 τ_{b} for this relation must be very strong. We have also calculated ‘sum of significant τ_{b}’ reported in Table 22. The ‘sum of significant τ_{b}’ for (complexity, complexity_in_functions) is found to be 20.98 out of 21. This indicates that complexity and complexity_in_functions also measure the same aspect with a negligible difference.
Even though the results in Table 20 does not show any perfect correlation between the metrics ncloc, functions, statements, complexity, classes, files, and public_api, but we see τ_{b} greater than 0.9 meaning at the very strong level for all relations in the sample mean in Table 13. Considering any relation at the τ_{b}level greater or equal to 0.9 redundant, we consider all these seven measures from the cumulative category as redundant. Under the same consideration, public_api is redundant to public_undocumented_api.
For relations between new_duplicated_lines and new_duplicated_blocks and between duplicated_blocks and duplicated_files, perfect correlations, are found in only one project. ‘Sum of significant τ_{b}’ for these two pairs are reported in Table 22 as 19.65 and 17.87 correspondingly. Thus, we cannot say right away that metrics within these two pairs are duplicated.
5.3.1 IntraCategory Correlations
Intracategory correlations indicate correlations within the sections CC, DD, AA, and OO where all correlations within a section are coming from metrics measured similarly.
Section CC
The sample mean of \(S\tau _{b_{all}}\) in Table 13 presents that metrics within the cumulative category (i.e., section CC) are very highly correlated to each other, which we can see from the categorywise mean of the sample mean values in Table 14. Here, section CC has a mean value of 0.79 for τ_{b} which indicates a strong correlation coefficient between any metrics for any projects. For a more detailed picture, we look at the sample mean values in Table 13 where we can see metrics are mainly divided into three parts based on the strength of correlations. The first part constitutes of nine redundant metrics (from ncloc to public_api), which we have discussed earlier, i.e., due to ‘very strong’ τ_{b}, any of these metrics can explain the variability of the other two metrics to a high degree. The second part has three metrics public_undocumented_api, comment_lines, and directories that are mostly within the τ_{b} level ‘strong’, except public_undocumented_api is redundant to public_api (as discussed in the preceding section), and the correlation between public_undocumented_api and directories is ‘medium’. The third part has duplicated_lines, duplicated_blocks, and duplicated_files that are all ‘strongly’ correlated to each other but they are correlated at a ‘medium’ level with all the metrics from the other two parts.
Since the sample mean (i.e., from \(S\tau _{b_{all}}\)) reported in Table 13 is a more conservative measure than the mean of \(S\tau _{b_{sig}}\) in Table 21, we expect equal or better result (higher τ_{b} value) in Table 21. From the data, we see a very small or no difference for most of the mentioned metrics due to the few numbers of nonsignificant and missing τ_{b} in section CC.
Key Observations:

Metrics complexity, complexity_in_classes, and complexity_in_functions measure exactly the same aspect.

Metrics ncloc, functions, statements, complexity, classes, files, public_api, and including the three metrics mentioned above are all redundant.

Metrics public_api and public_undocumented_api are redundant.

Not a single correlation out of 105 total correlations within the cumulative metrics has a weak correlation coefficient.
Sections DD, AA, OO
As Table 14 shows, sections DD, AA, and OO have on average weak to moderate correlations compared to strong correlations in section CC. Correlations among all three metrics (comment_lines_density, public_documented_api_density, and duplicated_lines_density) from density category in section DD have weak τ_{b}. For section AA, the correlation between file_complexity and class_complexity from average category has a strong τ_{b} of 0.71. The other two correlations (including function_complexity) for this section have moderate τ_{b}. In section OO, the correlation between new_duplicated_lines and new_duplicated_blocks is 0.94 and, based on our 0.9 limit; these two metrics are redundant. Correlations of these two metrics with new_lines are weak.
All four sections (related to intracategory) have less nonsignificant τ_{b} (in Table 12) compared to other sections, and section OO has no nonsignificant τ_{b}. Section OO also has the lowest categorywise mean of standard deviations of τ_{b} as reported in Table 15. The individual records of standard deviations in Table 18 show that redundant metrics within section CC have lower standard deviations than other metrics from the same category. Density metrics in section DD have the highest standard deviation and we have to look into ‘count of τ_{b}’ (Tables 23, 24, 25 and 26) and ‘sum of τ_{b}’ (Tables 27, 28, 29 and 30) tables to fully understand how τ_{b} values are distributed in this category.
Intracategory correlations among cumulative metrics are much higher than density, average, and organic. Even though cumulative, average, and organic are all at the strong τ_{b} level, but for cumulative the categorywise mean of the sample mean of τ_{b} is 0.79, thus, almost close to very strong level. On the other hand, average and organic categories have scored 0.54 and 0.55 correspondingly. In addition, even though we do not see a single weak level correlation in CC, more than half (5 out of 9) correlations in sections DD, AA, and OO have weak τ_{b}.
Key Observations:

Metrics new_duplicated_lines and new_duplicated_blocks (in section OO) are redundant.

All correlations among density metrics are weak.

Intracategory correlations for density, average, and organic metrics result in lower τ_{b} compared to cumulative metrics.
5.3.2 InterCategory Correlations
Intercategory correlation happens on metrics that are measured differently. We want to see whether there exist any noticeable difference in intercategory correlations compared to intracategory correlations.
Intercategory correlations are available in the six sections that are CD, CA, DA, CO, DO, and AO. These sections reflect all possible correlations, a total of 162, between metrics from all four categories. All these correlations result in weak τ_{b} except three correlations that result in moderate τ_{b} (values 0.56, 0.54, and 0.52) as shown in Table 13. Having a look at the categorywise mean in Table 14, we see that sections CO, DO, and AO, i.e., intercategory correlations of organic metrics with all other categories are the lowest. Interestingly, standard deviations are also lower for these three sections related to organic category (in Tables 18 and 15), meaning when organic metrics are correlated with metrics from other categories, the variability of τ_{b} is low. When we look at the count ofτ_{b} (Tables 23, 24, 25, and 26) and sum ofτ_{b} (Tables 27, 28, 29, and 30), we see that sections CO, DO, and AO have values only in Tables 26 and 30 reporting weak τ_{b}. In the remaining tables these three sections have zero. For other three sections (CD, CA, and DA), we see ‘count of τ_{b}’ and ‘sum of τ_{b}’ measures are available in all four levels and most concentrated at the moderate level.
When we look at the difference between intra and intercategory correlation of metrics, it is clear that they are different. The grand mean of the categorywise mean (see Table 14) of intracategory correlations is 0.49 (from (0.79 + 0.09 + 0.54 + 0.55)/4), however, for intercategory correlations, it is 0.06 (from (0.08 + 0.24 + 0.01 + 0.03 − 0.01 + 0)/6).
The observation that correlation between metrics from different categories results in overall weak correlation is important because this tells us that metrics from different categories have low collinearity. Thus, software code metrics from different categories can be used together as features in models for prediction, forecast, etc.
Key Observations:

Intracategory correlations of metrics are different from intercategory correlations by resulting in much lower correlation coefficients.

Correlation of metrics from different categories results in overall weak correlation coefficient. Thus, code metrics from different categories are observed to have low correlation coefficients, thus, are noncollinear.
Overall, we have observed that cumulative metrics are different because the intracategory correlations of cumulative metrics result in higher τ_{b} values compared to correlations of metrics within other categories.
5.4 Cumulative vs. Organic Metrics
The progression of development of software can be tracked in different ways. Traditionally, we keep track of software through cumulative metrics. Cumulative metrics are intuitive in the sense that they give us an overall idea of the state of the system. The same thing, i.e., tracking the progression of software, can also be done through the organic way. Version control systems like Git keep track of revisions through saving deltas from each revision. Taking the delta from each consecutive software revisions, we can still calculate the total by a linear addition. If we have either a cumulative or an organic measure, we can calculate one from the other.
Now we have a question whether the higher τ_{b} values among cumulative metrics occur due to the cumulative way of measurement or it is just because the cumulative metrics are more correlated to each other. If we want to test this, we have to transform all the cumulative metrics in this study into organic metrics (which we denote as organic_{t}) then compute correlations and check whether there is a significant difference or not. It can be noted that in Table 2, the three organic metrics are different from the 15 cumulative metrics. Thus, we need to create a new set of organic metrics equivalent to the cumulative metrics. Since we have identified few perfectly correlated cumulative metrics, they can act as our point of validation. Meaning, perfectly correlated metrics should always be perfectly correlated no matter how they are measured. To check the difference between cumulative and organic_{t} categories, we have considered the following hypothesis:
Null hypothesisH_{0}: The median difference between correlations of metrics from cumulative and organic_{t} categories equals to zero.
5.4.1 Transformed Organic Metrics
We have transformed all 15 cumulative metrics into organic_{t} metrics by taking the difference between consecutive revisions for each metric. We name these new metrics by adding an underscore before the cumulative counterpart metrics from which they are transformed, e.g., ncloc is transformed into _ncloc. It should be noted that the existing three metrics from the organic category do not take any negative values. Meaning if ncloc decreases new_lines will hold a zero value because there are no new lines. However, organic_{t} metrics can hold negative values reflecting a reduction of a metric’s measure in consecutive revisions.
5.4.2 Designing and Executing the Test
After computing the organic_{t} metrics, we went through the same procedure as we did for the other metrics in this study, i.e., checking the nature of the data. We found organic_{t} metrics to be similar to organic metrics in terms of kurtosis and short tails. However, for skewness, organic_{t} metrics are not as extremely left skewed as organic metrics because organic_{t} metrics can take negative values. However, overall, organic_{t} metrics are nonnormal.
We performed Kendall’s τ for all 39 metrics and derived tables similar to the tables mentioned earlier in Section 5. The sample mean of the significant correlation coefficients is presented in full in Table 16. However, in the interest of space, we only reported newly derived table sections for standard deviation, count of nonsignificant τ_{b}, and count of perfect correlations in Tables 31 and 32.
5.4.3 Results from the Test
In Tables 16, 31, and 32, we see that organic_{t} metrics are similar to organic metrics both in terms of intra and intercategory correlations.
Now we like to see whether organic_{t} metrics are able to produce perfect correlations as produced by the cumulative metrics. Following the similar referencing style as per Table 11, we refer the section containing τ_{b} from intracategory correlations of organic_{t} metrics as O_{t}O_{t}. Intracategory correlation of cumulative metrics (in Section CC of Table 20) have a total 46 counts of perfect correlations for six correlations. For organic_{t} metrics, we see (in section O_{t}O_{t} of Table 32) 24 perfect correlations for four metrics are exactly produced. However, for two correlations ((complexity, complexity_in_functions) and (complexity_in_classes, complexity_in_functions)) 12 out of 22 perfect correlations are produced. So we look at the sample mean of these two correlations (see section O_{t}O_{t} of Table 16) and find a value 0.994 for both of these correlations; the value 0.994 is so close to the maximum possible τ_{b} value of 1 that we can consider 0.994 as a perfect correlation. Thus, we see that both cumulative and organic_{t} metrics are equally able to detect perfect correlations among metrics if there exist any.
At this point, we like to focus on testing the null hypothesis. The mean of section O_{t}O_{t} of Table 16 is calculated as 0.49 which is much lower than the value of 0.79 for the cumulative metrics (in section CC of Table 14). However, without performing a statistical test, we cannot accept or reject our null hypothesis.
Earlier our data was the code metrics, but now it is the sample mean of the Kendall’s τ. So now we need to check the distributions of data to be tested, i.e., section CC and O_{t}O_{t} of Table 16. We have checked descriptive statistics, skewness, kurtosis, histograms, and also performed the ShapiroWilk test on these two data sets and found the data nonnormal. Thus, we have to choose nonparametric tests.
We can consider the data as paired because a single τ_{b} value, say ncloc in section CC and a corresponding τ_{b} value from section O_{t}O_{t} (i.e., _ncloc) both measure the similar aspect but they are measured differently. Then it can also be argued that ncloc and _ncloc are two different measures and they cannot be considered as paired. Taking both arguments, we would like to execute two tests to see whether there is any significant difference. Taking our two data sets as two independent samples, we perform the ‘MannWhitney U Test’, and taking our data sets as dependent samples, we perform the ‘Wilcoxon Signed Ranks Test’. After checking the assumptions of these tests, we remove two τ_{b} from section CC and the two corresponding τ_{b} from section O_{t}O_{t}. Since the correlations between metrics complexity, complexity_in_classes, complexity_in_functions in section CC and corresponding organic_{t} measures from section O_{t}O_{t} are commonly identified as perfectly correlated in both sections, we decide to keep only one of them and remove the other two so that the assumption related to data dependency is no longer present. The ‘Wilcoxon Signed Ranks Test’ has an assumption that all the difference between paired data should approximately be equally distributed along the quartiles when plotted as a onedimensional boxplot. This was not met for our data. Since this assumption is a visual test, we decided to include a ‘PairedSamples Sign Test’, which does not have such an assumption. We report results from all these three tests here.
We have a total 103 pairs. From the test ranks, we see that for both ‘Wilcoxon Signed Ranks Test’ and ‘PairedSamples Sign Test’ count of negative difference is 0, count of positive difference is 102, and count of ties is one, taking CC as the first and O_{t}O_{t} as the second variable. These numbers already tell us without looking at the significance levels, that the organic_{t} category is lower than the cumulative category. From Figs. 9 and 10, we see that all the three tests show a pvalue less than 0.01 (i.e., considering an α = 0.01). Thus, we can reject the null hypothesis and accept the alternative hypothesis, i.e., the median difference between correlations of metrics from cumulative and organic_{t} categories is not equal to zero.
5.4.4 Implications of the Test Results
The finding that there exists a significant median difference between correlations of metrics from cumulative and organic categories, implies that organic metrics are a set of measures that are collectively different than the cumulative metrics. Therefore, organic metrics can be considered as a new set of feature holding different characteristics than cumulative metrics as a whole.
The intracategory correlations of cumulative metrics are much higher than their equivalent sets of transformed organic metrics. Since correlations between cumulative metrics are high, there exist high collinearity among these metrics. This makes cumulative metrics collinear, and only a single metric from a group of highly correlated metrics can be considered as a valid input feature for a predictive model. The high collinearity among software code metrics is not new information. However, the knowledge that transforming cumulative metrics into organic can significantly reduce the collinearity is new. Since this transformation does not alter the original footprint of how software is evolved, it is expected to be free of side effects of normalization. Since organic metrics have lower collinearity, the chance of having multiple valid input features from them is possible.
The intercategory correlations of metrics from different categories are observed to be always low. This information can improve feature engineering by making the process more systematic. First, this gives a simple and intuitive understanding of noncollinearity based on measurement types. Second, this implies transforming a feature into a different measurement type can produce a feature that is noncollinear with the original feature.
5.5 Discussion
Software engineering researchers have observed high collinearity among software code metrics. However, we did not have explicit knowledge whether the high collinearity is due to the inherent nature of the code metrics or due to how we measure them. This study compares between correlation coefficients of a set of 15 cumulative metrics and correlation coefficients of their corresponding organic metrics and reveals that a large portion of collinearity among cumulative metrics results from the cumulative way of measurement. Since organic metrics are free from the effects of cumulation representing the natural evolution of software, we think, correlation coefficients among the organic metrics represent the inherent collinearity among code metrics. Taking the difference between the intracorrelation coefficients of metrics from these two categories, we can determine the added collinearity due to cumulative measurement.
High collinearity among a group of input features makes them weaker as a whole. We can get a few valid input features from such a set because, during the validation process, features with high collinearity are removed. Since organic metrics have lower collinearity among themselves, we can possibly get more valid input features from them. Moreover, since the collinearity between cumulative and organic metrics are very low, we can combine them to have even more valid input features. The lower collinearity among cumulative and organic metrics also means, even though we can calculate a set of organic metrics corresponding to a set of cumulative metrics, they are mutually exclusive. Therefore, theoretically, we do not have to restrict our choice of metrics from either of these two categories.
It is interesting that when we add density and average categories to the scenario described above for cumulative and organic, still noncollinearity holds between metrics from different measurement categories. It can be noted that the unit of measurement of the density metrics is percentage. While cumulative and organic metrics have the same value type (i.e., integer), metrics are measured differently in both categories.
The findings that organic metrics are collectively different than their corresponding cumulative metrics (i.e., intercategory), the understanding of the effect of cumulation toward collinearity among metrics (i.e., intracategory), and the general observation that metrics from different measurement categories yield in overall weak correlation coefficients are significant for the researchers and the practitioners in software engineering because they help us better understand the nature of the code metrics. The findings open the possibility of new research and revisiting existing research in this field. Based on our results, theoretically, we have more valid input features from different measurement categories, however, in practice, research needs to be conducted to determine which predictors (i.e., input metrics) are good for which targets (i.e., qualities that we want to predict or estimate). It could be the case that specific metrics from a category work better in combination with other metrics to predict a particular quality attribute. Researchers can also try to find new measurement categories and their properties. This study has investigated some code metrics, other code metrics not covered here can also be studied. These results can help the practitioners by giving them more insights about software metrics. Quality managers can reprioritize the metrics that a project keeps track. Tool developers can rethink about the metrics their tools support. For example, SonarQube automatically removes new_duplicated_lines, new_lines, new_duplicated_blocks, and some other similar organic metrics. Based on the findings of this study, they can decide to give users the option to keep track of such metrics. Software engineers working with predictive analysis have more options when choosing input features for their models. For example, for a predictive model, complexity (of type cumulative) is identified as an input feature. At this point, the possibility of incorporating metrics related to complexity from different measurement categories can be explored and complexity per file (of type average) and complexity per commit or based on a specific time duration (of type organic) can be considered as input features.
5.6 Threats to Validity and Limitations
5.6.1 Internal Validity
Today software is built with various languages and it is very common that a project uses code from different languages. Still, a project is usually designated with a specific language indicating the major portion of code, etc. Computer programming languages use different constructs and measures of code metrics may dramatically vary due to language difference. To eliminate this effect, we choose to focus on a single programming language, Java. It can be noted that Java projects are among the top three ranks on GitHub.
We considered some other factors when selecting projects such as project types, project size in LOC, number of revisions, number of developers. While selecting the projects, we tried to combine these factors so that we have a good representation regarding these factors. Besides, we looked at the number of issues and pull requests. We tried to select projects that have reported issues and pull requests because these factors are signs of active involvements of users and developers.
Tools measuring software code metrics are not perfect. Different tools implement measures differently even though they claim to measure a same aspect of code (Lincke et al. 2008). This study has selected one of the most widely used measurement tools for software quality, and we have observed very strong correlations between the cumulative metrics similar to the major studies. However, selecting a popular tool and similar observations to the major studies do not entirely remove this threat. This is a general issue to any study like this and we are aware of this.
We mentioned earlier that we only checked the revisions from the master branch of a Git repository. This affects granularity of the collected data. Based on the finding in this study, we know that reducing the cumulative effect reduces the correlation between metrics, thus, avoiding partial cumulation of data due to Git branches could possibly make the result of this study more stronger. Therefore, we do not consider this a major threat to our results.
5.6.2 External Validity
There are millions of software projects hosted on GitHub. Generalizing result for such a large population is a key validity threat for any study and we also identified generalizability as a considerable threat to validity. Selection of project is one of the most important things to minimize this threat. We tried to carefully select projects from wellknown organizations to mitigate this risk. For example, we have included projects from Apache, a pioneering organization in the open source community, Microsoft, and other organizations with diverse portfolios. On the positive side, we have found highly significant results from different tests. For example, MannWhitney U Test in Fig. 9a, Wilcoxon Signed Ranks Test in Fig. 10a, and PairedSample Sign Test in Fig. 10b have significance values 4.4e− 18, 1.8e− 18, and 1.5e− 23 correspondingly.
To safeguard the internal validity, we choose to restrict our focus to Java source code. This choice, however, seems to affect the external validity, meaning ‘do the results hold for source code written in other programming languages?’. Since measurement types discussed in this study are independent of the programming languages, such a threat is not highly significant, in our opinion. However, more research can be done to be certain.
6 Conclusions
This empirical research investigates whether measurement types of software code metrics have an effect on their correlations. Through collecting and analyzing 24 code metrics from 11,874 revisions from 21 open source Java projects, we have found that measurement types have an effect on correlations. Analysis of data shows that 10 out of total 15 metrics that are measured cumulatively are redundant based on our criteria two metrics with a correlation coefficient of 0.9 or above are redundant. When the cumulative effect of these metrics is removed by transforming these 15 metrics into type organic, only three of them are identified as redundant. These three metrics are identified as perfectly correlated (i.e., they measure exactly the same aspect) in both categories, implying if metrics are truly correlated, correlations of their organic measures are able to identify it. In addition, our analysis shows that organic metrics result in significantly lower correlation coefficients compared to cumulative metrics for intracategory correlations. Furthermore, while some software metrics are closely related to each other resulting in high correlation coefficients to an extent to be considered redundant, many higher correlation coefficient values are due to measuring these metrics cumulatively. In other words, we should not be surprised seeing higher correlation coefficients for cumulative metrics, and we should be aware that measuring metrics by their natural development, i.e., organically result in much lower correlation coefficient.
Another interesting finding is correlations between metrics from different categories yield in overall weak correlation coefficients. This finding is important because metrics from cumulative, density, average, and organic categories can be combined together as features for predictive models. From another view point, this could improve the process of feature engineering by providing the information that transforming a feature into a different measurement type produces a new feature that is noncollinear with the original feature.
We have discussed why Kendall’s τ version B fits more for software projects. We also discussed the landscape of correlation coefficients outlining possible sets considering both correlation coefficient and pvalue which can be helpful for this type of study.
This study has attempted to reveal the fundamental relationships between measurement types and correlations of software code metrics. More evidence is required to generalize and extend this knowledge. Thus, replicated studies can be conducted considering various metrics, measurement tools, programming languages, and project types.
Notes
References
Aggarwal CC (2013) Outlier Analysis. Springer Publishing Company, Incorporated, Berlin
Baxter G, Frean M, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: Proceedings of the 21st Annual ACM SIGPLAN Conference on Objectoriented Programming Systems, Languages, and Applications, ACM, New York, OOPSLA ’06, pp 397–412. https://doi.org/10.1145/1167473.1167507
Chidamber SR, Darcy DP, Kemerer CF (1998) Managerial use of metrics for objectoriented software: an exploratory analysis. IEEE Trans Softw Eng 24(8):629–639
Coleman D, Ash D, Lowther B, Oman P (1994) Using metrics to evaluate software system maintainability. Computer 27(8):44–49. https://doi.org/10.1109/2.303623
Concas G, Marchesi M, Pinna S, Serra N (2007) PowerLaws in a large objectOriented software system. IEEE Trans Softw Eng 33(10):687–708. https://doi.org/10.1109/TSE.2007.1019
Crawford L, Hobbs JB, Turner JR (2002) Investigation of potential classification systems for projects. In: Proceedings of the 2nd PMI Research Conference, Project Management Institute, Seattle, Washington, pp 181–190
Croux C, Dehon C (2010) Influence functions of the spearman and kendall correlation measures. Statistical Methods &, Applications 19(4):497–515. https://doi.org/10.1007/s102600100142z
Deshpande A, Riehle D (2008) The total growth of open source. In: Open Source Development, Communities and Quality. https://doi.org/10.1007/9780387096841_16. Springer, Boston, pp 197–209
Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, Marquéz JRG, Gruber B, Lafourcade B, Leitão PJ, Münkemüller T, McClean C, Osborne PE, Reineking B, Schröder B, Skidmore AK, Zurell D, Lautenbach S (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36(1):27–46. https://doi.org/10.1111/j.16000587.2012.07348.x
El Emam K, Schneidewind NF (2000) Methodology for Validating Software Product Metrics. Technical Report NCR/ERC1076, National Research Council of Canada, Ottawa, Ontario, Canada, http://nick.adjective.com/work/ElEmam2000.pdf
Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying thresholds for objectoriented software metrics. J Syst Softw 85(2):244–257. https://doi.org/10.1016/j.jss.2011.05.044. http://www.sciencedirect.com/science/article/pii/S0164121211001385
Field A (2009) Discovering Statistics Using SPSS. SAGE Publications, googleBooksID: a6FLF1YOqtsC
Gil Y, Lalouche G (2017) On the correlation between size and metric validity. Empir Softw Eng 22(5):2585–2611. https://doi.org/10.1007/s1066401795135
Hair J (2006) Multivariate Data Analysis. Pearson international edition, Pearson Prentice Hall. https://books.google.se/books?id=WESxQgAACAAJ
Henry S, Selig C (1990) Predicting sourcecode complexity at the design stage. IEEE Softw 7(2):36–44. https://doi.org/10.1109/52.50772
Henry S, Kafura D, Harris K (1981) On the relationships among three software metrics. In: Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality. https://doi.org/10.1145/800003.807911. ACM, New York, pp 81–88
Janes A, Lenarduzzi V, Stan AC (2017) A continuous software quality monitoring approach for small and medium enterprises. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, ACM, New York, ICPE ’17 Companion, pp 97–100. https://doi.org/10.1145/3053600.3053618
Jantunen S, Lehtola L, Gause DC, Dumdum UR, Barnes RJ (2011) The challenge of release planning. In: 2011 Fifth International Workshop on Software Product Management (IWSPM), pp 36–45. https://doi.org/10.1109/IWSPM.2011.6046202
Jay G, Hale JE, Smith RK, Hale D, Kraft NA, Ward C (2009) Cyclomatic complexity and lines of code: empirical evidence of a stable linear relationship. J Softw Eng Appl 02(03):137. https://doi.org/10.4236/jsea.2009.23020
Kazman R, Cai Y, Mo R, Feng Q, Xiao L, Haziyev S, Fedak V, Shapochka A (2015) A case study in locating the architectural roots of technical Debt. In: Proceedings of the 37th International Conference on Software Engineering, vol 2. IEEE Press, Piscataway, ICSE ’15, pp 179–188. http://dl.acm.org/citation.cfm?id=2819009.2819037
Kendall MG, Gibbons JD (1990) Rank Correlation Methods, 5th edn. Oxford University Press, London
Kim HY (2013) Statistical notes for clinical researchers: assessing normal distribution (2) using skewness and kurtosis. Restorative Dentistry & Endodontics 38(1):52–54. https://doi.org/10.5395/rde.2013.38.1.52. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591587/
Landman D, Serebrenik A, Bouwers E, Vinju JJ (2016) Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions. Journal of Software: Evolution and Process 28(7):589–618. https://doi.org/10.1002/smr.1760. http://onlinelibrary.wiley.com/doi/10.1002/smr.1760/abstract
Lethbridge TC, Sim SE, Singer J (2005) Studying software engineers: data collection techniques for software field studies. Empir Softw Eng 10(3):311–341. https://doi.org/10.1007/s106640051290x
Letouzey J, Ilkiewicz M (2012) Managing technical debt with the sqale method. IEEE Softw 29(6):44–51. https://doi.org/10.1109/MS.2012.129
Lincke R, Lundberg J, Löwe W (2008) Comparing software metrics tools. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, ACM, New York, ISSTA ’08, pp 131–142 . https://doi.org/10.1145/1390630.1390648
Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26. https://doi.org/10.1145/1391984.1391986
Mamun MAA, Berger C, Hansson J (2017) Correlations of software code metrics: an empirical study. In: Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement, ACM, New York, IWSM Mensura ’17, pp 255–266 . https://doi.org/10.1145/3143434.3143445
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng SE2(4):308–320. https://doi.org/10.1109/TSE.1976.233837
Meloun M, Militký J, Hill M, Brereton RG (2002) Crucial problems in regression modelling and their solutions. Analyst 127(4):433–450. https://doi.org/10.1039/B110779H. https://pubs.rsc.org/en/content/articlelanding/2002/an/b110779h
Meneely A, Smith B, Williams L (2013) Validating software metrics: a spectrum of philosophies. ACM Trans Softw Eng Methodol 21:4:24,1–24:28. https://doi.org/10.1145/2377656.2377661
Meulen MJPvd, Revilla MA (2007) Correlations between internal software metrics and software dependability in a large population of small C/C++ programs. In: The 18th IEEE International Symposium on Software Reliability (ISSRE ’07), pp 203–208. https://doi.org/10.1109/ISSRE.2007.12
Park HM (2009) Univariate analysis and normalitytest using sas, stata, and spss Technical Working Paper The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University. https://scholarworks.iu.edu/dspace/handle/2022/19742
Riaz M, Mendes E, Tempero E (2009) A systematic review of software maintainability prediction and metrics. In: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, IEEE Computer Society, Washington, ESEM ’09, pp 367–377. https://doi.org/10.1109/ESEM.2009.5314233
Runeson P, Höst M (2008) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164. https://doi.org/10.1007/s1066400891028
Saini S, Sharma S, Singh R (2015) Better utilization of correlation between metrics using Principal Component Analysis (PCA). In: 2015 Annual IEEE India Conference (INDICON), pp 1–6. https://doi.org/10.1109/INDICON.2015.7443299
Schroeder J, Berger C, Staron M, Herpel T, Knauss A (2016) Unveiling anomalies and their impact on software quality in modelbased automotive software revisions with software metrics and domain experts. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016. https://doi.org/10.1145/2931037.2931060. ACM, New York, pp 154–164
Shevlyakov GL, Vilchevski NO (2002) Robustness in Data Analysis: criteria and methods, Modern Probability and Statistics. VSP BV
Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on postrelease defects: a case study on the eclipse project. In: Proceedings of the 2010 ACMIEEE International Symposium on Empirical Software Engineering and Measurement, ACM, p 4
Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer objectoriented metrics suite. Empir Softw Eng 10(1):81–104. https://doi.org/10.1023/B:EMSE.0000048324.12188.a2
Tashtoush Y, AlMaolegi M, Arkok B (2014) The Correlation among Software Complexity Metrics with Case Study. arXiv:1408.4523
Taylor R (1990) Interpretation of the correlation coefficient: a basic review. Journal of Diagnostic Medical Sonography 6(1):35–39
Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation, pp 45–54. https://doi.org/10.1109/SCAM.2003.1238030
Xu W, Hou Y, Hung Y, Zou Y (2013) A comparative analysis of spearman’s rho and kendall’s tau in normal and contaminated normal models. Signal Process 93(1):261–276. https://doi.org/10.1016/j.sigpro.2012.08.005. http://www.sciencedirect.com/science/article/pii/S0165168412002721
Zhou Y, Leung H, Xu B (2009) Examining the potentially confounding effect of class size on the associations between objectoriented metrics and changeproneness. IEEE Trans Softw Eng 35(5):607–623. https://doi.org/10.1109/TSE.2009.32
Acknowledgments
The authors like to thank Dr. Aila Särkkä for her comments about some of the statistical analysis performed in this study and the anonymous reviewers for their valuable comments and suggestions that have significantly improved the clarity of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Martin Shepperd
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Here are the tables that are not put in the paper to enhance the reading experience.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Mamun, M.A.A., Berger, C. & Hansson, J. Effects of measurements on correlations of software code metrics. Empir Software Eng 24, 2764–2818 (2019). https://doi.org/10.1007/s10664019097149
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664019097149
Keywords
 Software code metrics
 Measurement effects on correlations
 Collinearity
 Software engineering
 Cumulative measurement