Effects of measurements on correlations of software code metrics

Mamun, Md Abdullah Al; Berger, Christian; Hansson, Jörgen

doi:10.1007/s10664-019-09714-9

Effects of measurements on correlations of software code metrics

Open access
Published: 16 May 2019

Volume 24, pages 2764–2818, (2019)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

Effects of measurements on correlations of software code metrics

Download PDF

Md Abdullah Al Mamun ORCID: orcid.org/0000-0003-1504-8130¹,
Christian Berger¹ &
Jörgen Hansson²

4697 Accesses
11 Citations
4 Altmetric
Explore all metrics

Abstract

Context

Software metrics play a significant role in many areas in the life-cycle of software including forecasting defects and foretelling stories regarding maintenance, cost, etc. through predictive analysis. Many studies have found code metrics correlated to each other at such a high level that such correlated code metrics are considered redundant, which implies it is enough to keep track of a single metric from a list of highly correlated metrics.

Objective

Software is developed incrementally over a period. Traditionally, code metrics are measured cumulatively as cumulative sum or running sum. When a code metric is measured based on the values from individual revisions or commits without consolidating values from past revisions, indicating the natural development of software, this study identifies such a type of measure as organic. Density and average are two other ways of measuring metrics. This empirical study focuses on whether measurement types influence correlations of code metrics.

Method

To investigate the objective, this empirical study has collected 24 code metrics classified into four categories, according to the measurement types of the metrics, from 11,874 software revisions (i.e., commits) of 21 open source projects from eight well-known organizations. Kendall’s τ-B is used for computing correlations. To determine whether there is a significant difference between cumulative and organic metrics, Mann-Whitney U test, Wilcoxon signed rank test, and paired-samples sign test are performed.

Results

The cumulative metrics are found to be highly correlated to each other with an average coefficient of 0.79. For corresponding organic metrics, it is 0.49. When individual correlation coefficients between these two measure types are compared, correlations between organic metrics are found to be significantly lower (with p < 0.01) than cumulative metrics. Our results indicate that the cumulative nature of metrics makes them highly correlated, implying cumulative measurement is a major source of collinearity between cumulative metrics. Another interesting observation is that correlations between metrics from different categories are weak.

Conclusions

Results of this study reveal that measurement types may have a significant impact on the correlations of code metrics and that transforming metrics into a different type can give us metrics with low collinearity. These findings provide us a simple understanding how feature transformation to a different measurement type can produce new non-collinear input features for predictive models.

On the correlation between size and metric validity

Article 17 June 2017

Revisiting the debate: Are code metrics useful for measuring maintenance effort?

Article 17 August 2022

A metrics suite for JUnit test code: a multiple case study on open source software

Article Open access 30 December 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The exponential growth of software size (Deshpande and Riehle 2008) is bringing in many challenges related to maintainability, release planning, and other software qualities. Thus, a natural demand to predict external product quality factors to foresee the future state of software has been observed. Maintainability is related to the size, complexity and documentation of software (Coleman et al. 1994). Size and complexity metrics are common among other metrics to predict software maintainability (Riaz et al. 2009). Growing software size and complexity have made it increasingly difficult to select features to be implemented in the next product release and have challenged existing assumptions and approaches for release planning (Jantunen et al. 2011).

Validating software metrics has gained importance as predicting external software qualities are becoming more demanding day by day to be able to manage future revisions of software. Researchers have proposed many validation criteria for software metrics over the last 40 years, e.g., a list of 47 criteria is reported in a systematic literature study by Meneely et al. (2013) where one of them is non-collinearity. Collinearity (also known as multicollinearity) exists between two independent features if they are linearly related. Since prediction models are often multivariate, i.e., use more than one independent feature or metric, it is important that there is no significant collinearity among the independent features. Collinearity results in two major problems (Meloun et al. 2002). First, it makes a model less useful as individual effects of the independent features on a dependent feature can no longer be isolated. Second, extrapolation is most likely be highly erroneous. Thus, El Emam and Schneidewind (2000) and Dormann et al. (2013) suggested diagnosing collinearity among the independent features for a proper interpretation of regression models.

Many studies have explored correlations between various software metrics such as McCabe’s cyclomatic complexity (Landman et al. 2016; Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), lines of code (Landman et al. 2016; Henry and Selig 1990; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), Halstead’s metrics (Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Meulen and Revilla 2007) Kafura’s information flow (Henry et al. 1981), Number of comments, Meulen and Revilla (2007), etc. Most of these studies have observed that code metrics are highly correlated. However, they do not address whether measurement types of metrics affect their correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics. A description of the measurement types used in this study are given below.

Cumulative: This indicates the traditional or the most common way how software code metrics are measured by cumulative sum or running sum. Here, by a revision, we indicate a commit, which is a single entry of source code in a repository. For example, if the number of total Lines Of Code (LOC) written for a project’s first three revisions are 50, 30, and 30 consecutively, the corresponding cumulative measures of ncloc for the revisions would be 50, 80, and 110.
Density: This measure tells us how representative a measure is within a per unit of artifact with a standard portion. Generally, the unit of density is a ratio. Within the context of code, we consider 100 LOC as a unit, the measurement unit becomes a percentage. Under this consideration, such a metric can take a value from 0 to 100. For example, the metric comment_lines_density measures lines of comments per 100 lines of code.
Average: This is the mean value of a measure with respect to artifacts related to a specific type. An example of such a metric is file_complexity which measures the mean complexity per file.
Organic: A metric that measures artifact from a single revision or two consecutive revisions without being influenced by any other revisions in the repository is organic. We have introduced the term organic as this measure has no effect from the entire list of unbounded preceding revisions like the cumulative measure. An organic metric can measure purely from a single revision, e.g., new_lines measures the lines of code (that are not comments) specific to a single revision. It can be zero in case no new code is added to a revision however, it cannot be negative (like a code churn measure). An organic metric can also measure a single revision relative to its one preceding revision. Since in this case it reflects a change or delta compared to the preceding revision, it can be positive, negative and zero.

The core idea of this study was developed while following a previous study (Mamun et al. 2017) where we focused on the domain-level correlation of metrics from four domains that are size, complexity, documentation, and duplications. In the follow-up study, we explored correlations at the metric-level and observed that the organic metrics consistently have lower correlations. Based on this observation from the follow-up study, we initiated and designed this study by grouping the code metrics based on how they are measured.

Due to the problems of collinearity when building predictive models, many studies have investigated how different metrics are correlated with each other. However, to our knowledge, no study has investigated the impact of measurement types on the correlations of software code metrics. This knowledge is fundamental to understand the metrics better. With a goal to understand the relationship between measurement types and correlations of software code metrics, this study has the following research question.

RQ: How measurement types affect correlations of code metrics?

This study has selected 21 open source projects from eight organizations and analyzed the source code of a total 11,874 revisions from all projects to extract code metrics. We have mined 24 software metrics classified into four categories: cumulative, density, average, and organic.

The complete revision histories of the selected projects have been analyzed using a static analysis tool to generate code metrics. The code metrics are then mined from the database for analysis. Before performing data analysis, data is explored using various visual and theoretical statistical tools. Based on the nature of data, we selected Kendall’s τ-B (a non-parametric method for correlation), for all selected projects. Motivations for selecting Kendall’s τ-B is discussed in Section 4. Correlation coefficients are divided into different sets based on their level of strength and level of significance. Based on the results up to this point, we transformed all cumulative metrics into organic metrics and ran statistical tests to determine whether there is a significant difference in correlation between these two sets.

Results of this study indicate how correlations of code metrics are influenced by their measurement types, i.e., the way they are measured. We can see whether there is a difference between intra-category correlations of metrics from the same category and inter-category correlations of metrics from different categories. Based on the data analysis, we will also report whether there is a significant difference between intra-category correlations of cumulative metrics and intra-category correlations of organic metrics. These understandings are fundamental because they can reveal whether high collinearity between code metrics are due to their measurement types. Such knowledge can be helpful in making an informed decision while selecting code metrics as features for predictive models.

In the following sections of this paper, we first discuss the methodology including design of this study, data collection procedures, nature of the collected data and data processing. Based on the nature of data observed in Section 3.3, Section 4 (data analysis method), presents a comparative discussion of applicable correlation methods and pros and cons of different measures to aggregate results from the data. Section 5 shows results and implications. Based on some results, this study performed an additional test. Retaining the actual work flow of this study, we have put the design and execution of this test in Section 5.4. This section also includes discussion, limitations and validity threats to this study. Finally, Section 6 summarizes the conclusions of this study.

2 Related Work

Software code metrics are generally known to be highly correlated as many studies have reported high correlation among various code metrics. A recent systematic literature review from 2016 (Landman et al. 2016) presents a summary of 33 articles reporting correlations between McCabe’s cyclomatic complexity (McCabe 1976) and LOC (lines of code). Henry and Selig (1990) reported correlations of five code metrics (LOC, three Halstead’s software-science metrics (N, V, and E), and McCabe’s cyclomatic complexity). They worked with code written in Pascal language and observed three correlations significantly higher that are (the values in parenthesis indicate the correlation coefficients): Halstead N - Halstead V (0.989), LOC - Halstead N (0.893), and LOC - Halstead V (0.885). Henry et al. (1981) compared three complexity metrics: McCabe’s cyclomatic complexity, Halstead’s effort, and Kafura’s information flow. Taking the UNIX operating system as a subject, they found McCabe’s cyclomatic complexity and Halstead’s effort highly correlated while Kafura’s information flow is found to be independent. On NASA’s open dataset, Tashtoush et al. (2014) studied cyclomatic complexity, Halstead complexity, and LOC metrics. They found a strong correlation between cyclomatic complexity and Halstead’s complexity similar to the study by Henry et al. (1981). LOC is observed to be highly correlated with both of these complexity metrics. Jay et al. (2009), in a comprehensive study, also explored the relationship between McCabe’s cyclomatic complexity and LOC. They worked with 1.2 million C, C++ and Java source files randomly selected from SourceForge code repository. They reported that cyclomatic complexity and LOC practically have a perfect linear relationship irrespective of programming languages, programmers, code paradigms, and software processes. Toward comparing four internal code metrics (McCabe’s cyclomatic complexity, Halstead volume, LOC, and number of comments), Meulen and Revilla (2007) used 59 specifications each containing between 111 and 11,495 small (up to 40KB file size) C/C++ programs. They observed strong correlations between LOC, Halstead volume, and cyclomatic complexity. A recent study by Landman et al. (2016) on an extensive Java and C corpora (17.6 million Java methods and 6.3 million C functions) finds no strong linear correlation between cyclomatic complexity and LOC to be considered as redundant. This finding contradicts many earlier studies including (Henry et al. 1981; Tashtoush et al. 2014; Saini et al. 2015; Jay et al. 2009; Meulen and Revilla 2007).

The studies discussed here, mostly cover McCabe’s cyclomatic complexity, Halstead’s metrics, and LOC investigating correlations between them and showing different results. However, they do not address whether measurement types of the studied metrics affect the strength of correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics.

Zhou et al. (2009) have reported that size metrics have confounding effects on the associations between object-oriented metrics and change-proneness. On a revisited study, Gil and Lalouche (2017) reported similar results about the confounding of the size metric. Zhou et al. (2009) have elaborately explained the confounding effect and models to identify them in areas like health sciences and epidemiological research. Gil and Lalouche (2017) used normalization as a way to remove the confounding effect. While they mentioned having lower correlation coefficient for normalized metrics, they have not explicitly reported the overall difference between correlations coming from the intra-cumulative and the intra-normalized measures. They also did not report whether there exists a significant statistical difference between the two. But it is understandable as their primary focus is on the validity of metrics. Our focus, in contrast is solely toward understanding the effects of measurements on the correlations of code metrics. We want to understand how much of the collinearity come from the types of measures and how much of it exists naturally.

There have been studies toward understanding the distributions of software metrics. For example, Wheeldon and Counsell (2003), Concas et al. (2007), and Louridas et al. (2008) have investigated whether power law distributions are present among software metrics. They have reported that various software metrics follow different distributions with long fat tails. Louridas et al. (2008) have also reported correlations among eight software metrics including LOC and number of methods. They reported a high correlation between LOC, number of methods (NOM), and out degree of classes. Baxter et al. (2006) reported a similar study, however, unlike Wheeldon and Counsell (2003), have observed some metrics that do not follow the power laws. They opined, their use of a more extensive corpus compared to Wheeldon and Counsell (2003) is the reason for the difference. In addition to looking at the distributions of metrics, Ferreira et al. (2012) have attempted to establish thresholds or reference values for six object-oriented metrics. We have also looked at the statistical properties of the studied metrics including their distributions. However, we have done this as part of out methodology to find appropriate statistical methods, and this is not the main focus of this study.

Chidamber et al. (1998) have investigated six Chidamber and Kamerer (CK) metrics and reported high collinearity among three of them which are coupling between objects (CBO), response for a class (RFC), and NOM. Succi et al. (2005) have studied to what extent collinearity is present in CK metrics. They suggested to completely avoid RFC metric as an input feature for predictive models due to its high collinearity with other CK metrics. Given the problems of collinearity, Shihab et al. (2010) have proposed an approach to select metrics that are less collinear from a set of metrics. These studies have mentioned collinearity as a problem and reported collinearity among software metrics or proposed method to select metrics with low collinearity. However, they have not investigated from the perspective of measurement types influencing collinearity.

3 Methodology

We have designed this empirical study following the guideline of Runeson and Höst (2008) on designing, planning, and conducting case studies. This study is explorative with the intent to find insights about relations between code metrics with different measurement types. We have designed the study to minimize bias and validity threats and maximize the reliability of the results, which involves project selection, data extraction, data cleaning, exploring nature of the data, select appropriate statistical analysis methods based on the nature of data, and being conservative when selecting and instrumenting statistical analysis.

Data sources for this research are open source software projects, more specifically, open source Java projects on GitHub. Java is among the top three most frequently used project languages on GitHub. Since extracted data is quantitative, analysis methods used in this study are quantitative. We have followed a third-degree data collection method described by Lethbridge et al. (2005). First, the case and the context of the study are defined, followed by data sources and criteria for data collection. Assumptions for statistical methods are thoroughly checked, which involves exploration of the nature of data and cleaning of data as necessary. Regardless of the measurement types, extracted data is non-normal to the extent that meaningful transformation is not possible. Thus, we have used non-parametric statistical methods for analyzing data in this study.

3.1 Project Selection

GitHub’s search functionality was used to find candidate projects. However, due to limited capabilities of GitHub search functionality, it was not possible to perform a compound query that would fulfill all our criteria. Project selection was not randomized as we wanted to assure that selected projects have specific criteria (e.g., minimum LOC, minimum commits, etc.) and come from well-known development organizations that would not raise obvious validity questions, e.g., “project is unrepresentative because it is a classroom project by a novice programmer.” Thus, finding projects from reputed organizations was exploratory. We started by screening projects from the 14 organizations listed in GitHub’s open source organizations showcase^{Footnote 1}. We then explored whether other well-known organizations to our knowledge are also hosting their projects on GitHub but are not in the showcase, e.g., Apache. For each organization, we made queries to find Java projects. As we want to minimize blocking effects coming from various languages, we decided to stick with a single programming language. We selected Java as it is a top-ranked programming language on GitHub.

Crawford et al. (2002) presented various methods for classifying software projects. We take a more straightforward approach to make sure that our selected projects are representative regarding size. A study^{Footnote 2} on the dataset of International Software Benchmarking Standards Group (ISBSG) classified software projects based on “Rule’s Relative Size Scale”. Measurements of this study are based on IFPUG MkII and COSMIC which are also translated into equivalent LOC. The combined distribution of all projects shows that more than 93% of the projects are between S (small) and L (large) size where S is estimated as 5300 and L as 150,000 LOC. We roughly followed this finding and selected projects in a way that project sizes are about uniformly distributed within about the range of S and L. We have projects ranging from 4059 LOC to 155,260 LOC indicating the code size of the latest revision of projects. Sizes of the projects are determined with cloc tool^{Footnote 3} using a bash script to extract total LOC and Java LOC. We selected 21 GitHub projects from eight software organizations where Java is tagged as the project language.

An overview of the selected projects is given in Table 1. In this table, ‘analyzed revisions’ indicates all the commits from which we collected data and which are available exclusively in the master branch of the Git repositories. ‘Total revisions’ indicates all commits available in the Git repository including the branches. Even though the selected projects are classified as Java projects, they have source code from other languages too. Thus, ‘total code’ indicates the amount of all lines of code and ‘Java code’ indicates only the lines of Java source code. The table field ‘latest commit’ points to the Head of a Git repository at the time we downloaded it. The time durations of the projects are presented in the whiskers box plot in Fig. 1 showing duration of projects from five months to 109 months with a median of 43 months. About 33% of the projects are within the 4^th-quartile ranging from 57 to 109 months.

Table 1 Overview of the selected projects

Effects of measurements on correlations of software code metrics

Abstract

Context

Objective

Method

Results

Conclusions

Similar content being viewed by others

On the correlation between size and metric validity

Revisiting the debate: Are code metrics useful for measuring maintenance effort?

A metrics suite for JUnit test code: a multiple case study on open source software

1 Introduction

2 Related Work

3 Methodology

3.1 Project Selection

3.2 Data Collection and Metrics Selection

3.3 Exploring Nature of Data

3.4 Data Processing

4 Data Analysis Method

4.1 Landscape of Correlation Coefficients (τ b)

4.2 Aggregating Correlation Coefficients (τ b)

4.3 Missing Correlation Coefficients (τ b)

5 Results

5.1 Missing and Non-significant Correlation Coefficients (τ b)

5.2 Overall Distribution of Correlation Coefficients (τ b)

5.3 Significant Correlation Coefficients

Perfect Correlations

5.3.1 Intra-Category Correlations

Section CC

Sections DD, AA, OO

5.3.2 Inter-Category Correlations

5.4 Cumulative vs. Organic Metrics

5.4.1 Transformed Organic Metrics

5.4.2 Designing and Executing the Test

5.4.3 Results from the Test

5.4.4 Implications of the Test Results

5.5 Discussion

5.6 Threats to Validity and Limitations

5.6.1 Internal Validity

5.6.2 External Validity

6 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

4.1 Landscape of Correlation Coefficients (τ _b)

4.2 Aggregating Correlation Coefficients (τ _b)

4.3 Missing Correlation Coefficients (τ _b)

5.1 Missing and Non-significant Correlation Coefficients (τ _b)

5.2 Overall Distribution of Correlation Coefficients (τ _b)