1 Introduction

Software Change Impact Analysis (CIA) is an essential technique for identifying the potential ripple effects caused by software changes during software maintenance and evolution (Briand et al. 1999; Wilkie and Kitchenham 2000). CIA techniques can be typically static or dynamic (Sun et al. 2015), depending on how the information is collected to analyse its change impact. Dynamic techniques rely on information gathered during program execution to compute the change impact set while static techniques are centred around the source code, semantic information and change dependencies. Because of the many false positives and the effort required in dynamic analysis (collecting data during execution and analyzing data during execution), static techniques have gained popularity (Sun et al. 2015).

Most studies on static impact analysis have shown that certain classes, identified by patterns or metrics, are more likely to be impacted by a change and, hence, practitioners will need to invest extra effort in their future maintenance. Other studies, specifically addressed at establishing a link between coupling and co-change, have found that the set of co-changed classes was much larger compared to the set of structurally coupled classes (Oliva and Gerosa 2011, 2015; Fluri et al. 2005, Geipel and Schweitzer 2012). This implies that not all of the change dependencies are related to structural dependencies and there could be other reasons for software artefacts to be change dependent (Oliva and Gerosa 2011). High coupling between classes in an OO design can increase system complexity by introducing multiple inter-dependencies among the classes (Subramanyam and Krishnan 2003). Moreover, excessive coupling can complicate testing, make additional changes problematic and limit possibilities for reuse (Prasad and Bhadauria 2009). Software that is not flexible or tolerant to modification is usually destined to abandonment or replacement (Oliva and Gerosa 2012).

Kagdi and Maletic have estimated that there is a hidden dependency (HD) between two classes or two methods if the classes or the methods are changed at the same time in the past (Kagdi et al. 2007). As Yu et al. stated (Yu and Rajlich 2001): ‘hidden dependencies among software artefacts make both understanding and maintenance difficult’. Briand et al. showed that if developers are required to handle a large set of dependencies, they would miss a significant number of them Briand et al. (1999). Poshyvanyk and Marcus detected dependencies using information retrieval techniques (Poshyvanyk and Marcus 2006). In a similar way to HD, complex dependencies are captured by semantic information which is hard to detect by traditional program analysis techniques (Vanciu and Rajlich 2010). Some CIA tools do not discover HD, and it is the responsibility of the programmer to correctly identify and trace HD during change impact analysis (Petrenko and Rajlich 2009).

In the last few years, anew dimension has been identified as ahidden dependency, termed “semantic” coupling, that could have an influence on coupling and co-change. Simply defined, semantic coupling is ameasure of how loosely or closely related two software artefacts are, by considering the semantic information embedded in the comments and identifiers. According to Bavota et al. (2013b):

the peculiarity of the semantic coupling measure allows it to better estimate the mental model of developers than the other coupling measures. This is because, in several cases, the interactions between classes are encapsulated in the source code vocabulary (...).

In the conceptual framework for software dependency management proposed by Oliva and Gerosa (2012), semantic coupling is not considered as one of the dependencies to be measured. They state that software dependencies are the ‘primary’ subject of management, and the identification of dependencies involves capturing structural and logical dependencies. Nonetheless, the same authors claim that there is still a need to study the interplay between semantic and logical coupling in OO software as well as the interplay between structural and semantic coupling (Oliva and Gerosa 2011, 2015). They identified a small intersection between the sets of structural and logical dependencies after analyzing commits from the Apache Software Foundation repository. When directly assessing semantic coupling, researchers in the software evolution and dependency domain have demonstrated that semantic coupling metrics can outperform structural metrics in identifying classes that might be impacted by a given change request (Poshyvanyk et al. 2009). Semantic and logical coupling metrics have also been combined in change impact analysis (Kagdi et al. 2010; Lozano et al. 2014).

Researchers have suggested that frequent change coupling indicates a strong structural coupling between the corresponding modules, sub-modules, or files as well as possible shortcomings in the design of a software system (Fluri et al. 2005). However, the frequency of change couplings have not yet been studied in relation to semantic coupling. The computation of semantic coupling in studies in this domain have been done by using information retrieval techniques such as latent semantic indexing (LSI) and vector space modelling (VSM) (Poshyvanyk and Marcus 2006; Poshyvanyk et al. 2009; Kagdi et al. 2013) to analyze the corpora of OO software classes (after transforming the semantic information from source code into a text or words corpus).

In a pilot study, we observed that the extraction of word corpora can be time-consuming, especially when systems are large and many classes are involved (Ajienka and Capiluppi 2016). With the goal of identifying how the computation of the semantic coupling of classes can be improved, we statistically compared the metrics derived from analyzing the corpora of classes against an analysis of only their identifiers. Results revealed that identifier based metrics reflect the corpora based measurements. In addition, identifier based measurements were more efficient in terms of computation time, especially when analyzing large software classes (e.g., > 1000k lines of source code). It is important to further validate results derived from the pilot study with a larger sample of projects to improve generalizeability. In addition, the results will further contribute to knowledge on how to ease semantic coupling measurement in further studies that rely on semantic coupling information of classes in OO software.

Given the current state of the art in the area of software coupling, and extending our previous work, we shift the focus of the change impact analysis to the semantic link between object-oriented (OO) software classes in 79 OSS projects (written in Java). This paper examines the strength of semantic coupling (Poshyvanyk and Marcus 2006) between pairs of classes, through the evolution of various software systems, and correlates it with the likelihood of their future co-change.

Establishing whether there is an interplay between logical and semantic coupling has several applications in software engineering including:

  1. 1.

    Co-change inferred by semantic coupling: understanding the influence of semantic coupling on co-change can also help to infer the co-change frequency of software classes based on semantic coupling strengths, i.e., semantic coupling metrics can be used to directly inform practitioners about potential unplanned co-changes of classes in OO software projects.

  2. 2.

    Improving software tools to detect hidden dependencies: “hidden” dependencies not detected by software maintenance tools during change impact analysis that cause co-change would be detected with significant precision.

  3. 3.

    Minimizing historical data extraction and analysis efforts: semantic coupling metrics will be used to inform or predict the strength of the logical dependencies between classes without the need to analyze historical data of software projects, thus reducing the effort required (i.e., computation time and data storage) in the detection of logical dependencies via mining software repositories. The semantic similarity between class identifiers will also be used in the ranking of classes that might be impacted by a given change request without having to analyze software evolution or historical data, thus minimizing the effort required in change impact analysis (Kagdi et al. 2013).

This work is articulated as follows: in Section 2 we describe the definitions, research goals and steps taken to carry out this study, with the help of a worked example to show the empirical approach. Sections 3 and 4 highlight the results of our study, followed by a discussion on the importance of these findings. Section 5 highlights the threats to validity and in Section 6 we summarise the related work. Finally, our conclusions and areas for further research are presented in Section 7.

2 Research methodology

In this section, we present the definitions for the different types of coupling Section 2.1, the motivating scenario Section 2.2 and the goals of this research Section 2.3. Additionally, we highlight, with the use of worked examples, the steps performed in the methodology: data collection Section 2.4; computing the coupling types Section 2.5; evaluating their intersection Section 2.6; and performing the statistical tests Section 2.7.

2.1 OO software dependencies

A dependency is a semantic relationship that indicates that a client element may be affected by changes performed in a supplier element (Oliva and Gerosa 2011). In Sections 2.1.1 and 2.1.2, we introduce semantic and logical dependencies and discuss how they can be operationalised in an OO context.

2.1.1 Logical coupling

According to Wiese et al., “change coupling is a phenomenon associated with recurrent co-changes found in the software evolution history” (Wiese et al. 2015). Co-evolution of classes can be represented with their change, logical or evolutionary coupling (Zimmermann et al. 2003; Yu 2007) (as shown in Fig. 2). Therefore, the logical coupling of any two classes is based on their change history, and is a measure of the observation that the two classes always co-evolve or change together (Gal et al. 1998, 2003; D’Ambros et al. 2009; Wiese et al. 2015). They are commonly treated as association rules (Zimmermann et al. 2005), which means that when X 1 is changed, X 2is also changed (Oliva and Gerosa 2011). Furthermore, X1 and X2 are called the antecedent (i.e., left-hand-side, LHS) and the consequent (i.e., right-hand-side, RHS) of the rule, respectively. For example, the rule {A, B}→ C found in the sales data of a supermarket indicates that a customer who buys A and B together, is also likely to buy C (Oliva and Gerosa 2011).

Two classes change at the same time when changes in one class A are made in response to a change in another class B. Kagdi et al. (2013) state that logical coupling captures the extent to which software artefacts co-evolve and this information is derived by analysing patterns, relationships and relevant information of source code changes mined from multiple versions (of software systems) in software repositories (e.g., Subversion and Bugzilla). According to Lanza et al. (D’Ambros et al. 2006) it is useful to study logical coupling because it can reveal dependencies not revealed by analyzing the source code (Yu 2007) only. These sort of dependencies are the most troublesome and are the source of many defects in software. In this study, we adapt the methods proposed by Zimmermann et al. (2003) to represent logical dependencies.

Operationalisation

The logical dependency between classes and its degree, is evaluated in this work using the support and confidence metrics. By doing so, we evaluated the significance of the association rules between classes (Oliva and Gerosa 2011), and across the lifespan of a software project (i.e., taking all versions of the software system into consideration).

The support value counts the number of revisions where two software artifacts (i.e., classes) were changed together. In other words, the probability of finding both the antecedent and consequent in the set of revisions. For example, in Fig. 1, class A was modified in 3 transactions (where 3 is the “Transaction Count” (Yu 2007)). Out of these 3 transactions, 2 also included changes to the class C. Therefore, the support for the logical dependency AC will be 2. By its own nature, support is a symmetric metric, so the AC dependency also implies AC. The support value of a given rule determines how evident the rule is Wiese et al. (2017).

On the other hand, the confidence Footnote 1 value of a dependency link measures the degree of the logical dependency and normalizes the support value by the total number of changes of the causal class, or the antecedent of the association rule. Numerically, it is the ratio of the support count to transaction count: from Fig. 1, the confidence value for the association rule AC (which states that C depends on A) will have a high confidence value of 2/3 = 0.67. In contrast, the rule CA(which states that A depends on C) has a lower confidence value of 2/4 = 0.5. In other words, the confidence is directional, and determines the strength of the consequence of a given (directional) logical dependency. The confidence value is the strength of a given association rule (Wiese et al. 2017).

Fig. 1
figure 1

Association rule example for confidence and support metrics

Logical coupling is directional, thus AC(changes made to class A resulted in changes in C) and CA (changes in C caused changes in A) will have different meanings. As a result, the confidence for these two cause \(\xrightarrow {}\) effect rules can be different.

2.1.2 Semantic coupling

Some studies have used the term “semantic” (Poshyvanyk et al. 2009; Qusef et al. 2011, Bavota et al. 2010, 2013b, 2014a, 2014b; Kagdi et al. 2010; Gethers et al. 2012), while others have used the term “conceptual” (Gethers et al. 2012) to describe the same concept. Poshyvanyk et al. (2009) state that conceptual coupling captures the degree to which the identifiers and comments from different classes relate to each other (Qusef et al. 2011, Bavota et al. 2010, 2013b, 2014a, 2014b; Kagdi et al. 2010). Gethers et al. (2012) add a twist to the definition and state that conceptual coupling captures the extent to which domain concepts/features and software artefacts are related to each other. However, both definitions have things in common. They are limited to the underlying meanings of unstructured text in the source code of software entities (e.g., classes) and how these underlying meanings relate to each other. Furthermore, this relationship is derived in the form of metrics (-1 to 1, where 1 = high semantic coupling (Poshyvanyk and Marcus 2006)).

Identifiers used by developers for names of classes, methods, or attributes in source code or other artifacts contain important information and account for approximately half of the source code in software (Kagdi et al. 2013). These names often serve as a starting point in many program comprehension tasks. Hence, it is essential that these names clearly reflect the concepts that they are supposed to represent, as self-documenting identifiers decrease the time and effort needed to acquire a basic comprehension level for a programming task (Kagdi et al. 2013).

2.2 Motivating scenario

Figure 2 shows a simplified scenario that underlies our motivation: previous research (Yu 2007) (pictured inside the grey box) has shown that the structural coupling between classes causes them to be co-changed, and it plays an important role in the measurement of co-evolution (Yu 2007). However Yu (2007) has used correlation to infer a causal relationship and research has shown that correlation does not always mean causality (Didele 2005); there are different ways to identify causal relationships. In addition to correlation studies which investigate linear relationships, we have also investigated the presence of a directional relationships between semantic and logical coupling. Our contribution, expressed in this research, includes semantic coupling in the picture: we posit that the semantic coupling of classes leads to their co-change and logical coupling (evolutionary dependencies).

Fig. 2
figure 2

Motivating example: structural coupling \(\rightarrow \) co-evolution (Yu 2007) and semantic \(\rightarrow \) co-evolution (to be checked in this work)

2.3 Research goals

The work we present is based on the three following goals:

  1. G1:

    to establish with a larger sample of OSS projects whether the semantic coupling between classes using the class names of Java files produces comparable results to using the corpora of the classes content (i.e., the class own source code) (Ajienka and Capiluppi 2016);

  2. G2:

    to investigate how the semantic coupling strength between classes has an impact on their future co-changes;

  3. G3:

    to investigate the directionality of the relationship between logical and semantic (as motivated by Fig. 2) by identifying the proportion of logical dependencies that involve semantically related elements (“hidden dependencies” (Vanciu and Rajlich 2010)) and vice-versa.

Research questions were derived from each goal, and testable hypotheses formulated for each question, as summarised in Table 1.

Table 1 Research Goals and Questions

2.4 Empirical data collection

In the next subsections, we present how and what kind of data we collected from the repositories of the studied sample of OO software projects.

2.4.1 Selection of a sample of OSS projects

Leveraging the FlossMole project, we used its latest available data dump to determine the population of GoogleCode: a total of 2,593,222 projects are listed in the November 2012 dump.Footnote 2 Given their language descriptions, we extracted the subset of Java projects from that population, obtaining 49,459 Java projects. Each project in the subset was given a unique ID: using a 95% confidence level, and a 5% confidence interval, a random sample of 380 IDs were extracted, and linked to the Java projects’ names.

2.4.2 Storage of projects metadata and revisions

The first phase of this activity was centered on obtaining the metadata (e.g, name of developers, date and time of changes, etc.) of each project in the sample. The repository of each project was downloaded and stored, with its metadata, using the CVSAnalY set of tools.Footnote 3 The process to obtain the metadata for all the projects took around 48 hours: sleep statements were inserted in a routine not to overload the online servers, and to make sure that the latest versions of the files were downloaded. The metadata allowed us to obtain the list of revisions for each class, and for the whole project. The second phase was to get all the revisions of each project; from this we could identify the trivial projects (with < 20 revisions) and exclude these from the study. As a result, we ended up with 79 non-trivial Java open-source software projects with 117 revisions on average.

There is a chance that re-sampling to retrieve a larger sample of projects could result in a larger number of trivial projects with less than 79 left. Previous studies have also excluded a number of OSS projects after their initial sampling. Samoladas et al. (2010) and Gousios and Pinzger (2014) applied certain selection criteria to exclude projects from their initial selection. Midha and Palvia (2012) based on certain project selection criteria, reduced their initial sample from 887 to 283. Haefliger and Spaeth (Haefliger et al. 2008) reduced their selected sample of projects to 6 OSS projects with variance on their sampling criteria. The studied sample included a wide variety of software products such as office software, games, a hardware driver, and an instant messenger client and this reduced sampling bias (Stake 1995). Similarly in this study, the resulting non-trivial sample of 79 OSS projects are of different domains, sizes and levels of activity. The sample selection criteria widely used in OSS research (Cruz et al. 2006; Rainer and Gale 2005) and adopted by Haefliger and Spaeth, includes: 1) the project is under active development, allowing the tracking of its development activity Footnote 4, 2) the source code modifications of the project need to be available online, and 3) the project should have been in existence for at least a year.

Because the process of analyzing the correlation between identifier and corpora based methods of computing semantic coupling is labour-intensive Footnote 5, we focused our attention on a subset of this sample of projects when answering RQ1 (Crowston et al. 2005) while ensuring that the subset consists of projects of varying sizes representative of the overall sample.

2.5 Identifying class dependencies (RQ1)

In the following subsections, we present how the class dependencies were calculated with examples. We also present assumptions and decisions made during this task.

2.5.1 Logical coupling

For each project, we extracted the number of revisions, based on the tables built by CVSAnalY. This task was a pure SQL extraction task, so it did not pose a time issue. For all revisions, we extracted the list of pairs of classes that were co-evolving in that revision and stored this data in a .CSV file. An example of the co-evolution data is provided in Table 2, detailing an excerpt of the Java classes that co-evolve in the UrSQL project in its 4threvision. The first column shows the project name, the third and fourth columns show classes that were co-changed, through association rules.

Table 2 Co-evolution data for Project UrSQL (excerpt)

Using the arules Footnote 6 library in the R Footnote 7 environment for association rule mining (Hahsler et al. 2007), we were able to compute the Confidence metric for each pair of classes with an established logical dependency. Similar to prior research, the support and confidence thresholds have been set to 0.01 and 0.1 respectively (Kenett and Salini 2008). This is because increasing the support and confidence increases precision but lowers recall (i.e., only a small number of association rules are identified when the minimum confidence value is higher than 0.01). The number of identified co-evolving class pairs reduces based on increase in confidence and such pruning looses important information (Zimmermann et al. 2005). Oliva and Gerosa classified confidence metrics as: [0.00-0.33] low logical coupling, [0.33-0.66] medium logical coupling and [0.66-1.00] high logical coupling and identified that highly logically coupled classes suffered slightest influence from structural coupling. In addition, the arules library in the R statistical environment has been used with a high precision and minimal false positives in prior research across different disciplines (Hahsler et al. 2006; Hahsler and Hornik 2007; Kenett and Salini 2008) when mining frequent item sets from data.

2.5.2 Semantic coupling

In a previous study (Ajienka and Capiluppi 2016) described in Section 1, we compared two sentence similarity techniques (based on N-Gram Footnote 8 and Disco Word synonymFootnote 9 categories methods) against a corpus or text document cosine similarity based technique (VSM)Footnote 10 Footnote 11 for computing semantic similarity between Java classes. The study was conducted using two software projects and identified by means of Chi-squared independence tests that measuring the semantic similarity between classes using (only) their identifiers is similar to using the class corpora. This is because using identifiers was more efficient in terms of recall, and computation time (Ajienka and Capiluppi 2016). The study also identified that English based word similarity techniques such as WordNet do not perform well on software terminologies (e.g., export ⇔dump). The steps taken to compute the semantic similarity between Java classes using the three techniques is explained in detail in Ajienka and Capiluppi (2016).

In addition, the N-Gram technique is based on the edit distance and shared sub-strings of length n between sentences (Kešelj et al. 2003). An example is the semantic similarity between two class identifiers ’Ur S Q L Controller’ and ’Ur S Q L Entry’ which returns a value of 0.6 for shared sub-strings between 0 and 4. We have used n-grams of size 4 in this study based on prior information retrieval research (Mcnamee and Mayfield 2004; Kešelj et al. 2003) that shows that n = 4 increases the precision when analyzing words or terms in various languages. The N-Gram technique also performs better on text from other languages (Ajienka and Capiluppi 2016) apart from English (Mcnamee and Mayfield 2004; Kešelj et al. 2003) compared to English based text similarity methods like WordNet. The Disco technique although with low precision on non-English words has been compared to other text similarity techniques and proven to perform well when its outputs were manually checked. According to Despotakis et al. (2011) “although the precision for Disco was low, it did provide additional valuable concepts, which were approved by both experts. We also manually checked the outputs of the semantic similarity measurements to minimize the effects of false positives.”

In this study, we extend the previous study with a larger sample of OO software projects written in the Java programming language. The statistical methods applied in investigating the association between the corpus and identifier-based techniques are described in Sections 2.7.1 and 2.7.2. We also compare the techniques with three semantic dissimilarity thresholds (t = 0.1, 0.2, and 0.5) based on what has been used previously in the literature (0.195 (Kešelj et al. 2003); between 0.1 and 0.2 (Tan et al. 2000), 0.2 (Erkan and Radev 2004; Sarikaya et al. 2005); and 0.5 (Coster and Kauchak 2011; Corley and Mihalcea 2005)).

2.6 Evaluating the intersection of sets (RQ2)

Once pair-wise semantic and logical dependencies were identified and the associated coupling values were calculated, we then built a spreadsheet per project based on the data with the following columns; LHS (antecedent), RHS (consequent), semantic similarity, and confidence. Using a Shell script, we could parse the data and identify the proportion of semantic dependencies that involved non-logical dependencies (i.e., AB from the sets in Fig. 3), the proportion of logical dependencies that involved non-semantic dependencies (i.e., BA from Fig. 3) as well as the intersection set of pairs of classes that are both semantically and logically related (i.e., AB from Fig. 3).

Fig. 3
figure 3

Intersection of semantically and logically coupled classes

2.7 Statistical tests

Both RQ1 and RQ2 are linked to formal statistical testing. Below the two tests (Chi-Squared for RQ1 and Spearman for RQ2) are illustrated in the context of the analysed systems.

2.7.1 Chi-Squared (X 2) Test (RQ1)

To answer RQ1, we performed a Chi-square statistical test to discover if the similarity measures from one class identifier-based technique (say, the N-Gram) produces similar results to the corpus-based technique (VSM). For each project, we populated a 2X2 contingency table, composed of row (i.e., groups) and column (i.e., outcomes) variables. The first contingency table visible in Table 3 is a generic 2x2 contingency table, with the corpus-based outcomes (VSM) as the outcomes variable, and the identifier-based outcomes (N-Gram and Disco) as the groups variable. For the statistical test, we used three semantic dissimilarity thresholds t = 0.1, 0.2, and 0.5.

Table 3 Contingency Tables: generic (top) and populated (middle and bottom) with Identifier (either N-Gram or Disco) vs Corpus Based (VSM) techniques

If s is the semantic similarity between pairs, and using a similarity threshold t (with a lower t implying a weaker similarity), the items of the contingency table are:

  • A: pair of classes with s \(\geqslant \) t for both Corpora-based and Identifier-based techniques;

  • B: pair of classes with s<t for one technique but \(\geqslant \) t for the other;

The following are the possible outcomes observed for the threshold t – for the Identifier-based technique:

  • C: pair of classes with s \(\geqslant \) t for one technique but <t for the other;

  • D: pair of classes with s<t for both techniques.

The other two tables (middle and bottom of Table 3) report the values and results for (i) VSM as the column variable, and N-Gram as the row variable and (ii) VSM as the column variable, and Disco as the row variable for the UrSQL Project, with t = 0.1.

After populating the contingency Tables, we tested for the association between the semantic similarity measures derived from the pairs of techniques (the identifier and corpus based) using the Chi-square test method (chisq.test) in R Footnote 12. This test is used to compare categorical data. It asserts the independence of the two techniques, with a null hypothesis H 0of no association between their outcomes. We set the p-value at 0.05 as the threshold to reject the null hypothesis and compute the Chi-square tests for each project.

2.7.2 Spearman’s rank correlation ρ(RQ1 and RQ2)

This section describes the computation of Spearman’s rank correlation statistical tests for RQ1 and RQ2.

To further answer RQ1, using the semantic dissimilarity thresholds (0.1, 0.2, and 0.5) described in Section 2.5.2, we will in addition to the Chi-square test compute the linear correlation between the corpora based semantic similarity measurement technique and the identifier based techniques to verify whether the semantic coupling metrics reported by the different techniques for the same pairs of classes co-vary.

The intersection of dependency sets (from 2.6) is used to evaluate the relationship between the coupling types. All the values of the logical coupling strength (i.e., the confidence metrics) and all the values of the semantic coupling strength, are pulled together, per pair of classes, per project and along their string of revisions. Given a project, we created two vectors Footnote 13, one with the values of ‘semantic similarity‘ between classes; the other with all the values of co-change confidence between classes.

Each observation in both vectors contain the semantic coupling between two classes and the confidence of their co-evolution or logical coupling metric. Notwithstanding, the semantic coupling metric is a symmetric metric whereby the semantic coupling is the same in both directions (for example a pair of classes A and B in a software project Y, A →B will have the same semantic coupling metric as B → A). The logical coupling metric however is not symmetric. The association rule A → B measures the strength of the following observation: “when A is modified, there will always be a change in B” (Zimmermann et al. 2003). Therefore, A → B and B → A are not treated as the same association rule (the confidence metric could be different but the semantic coupling metrics is the same) and are considered as different observations in the created vectors.

Computing a linear correlation between the strengths of the semantic and logical coupling of classes will help to further answer questions such as: “What is the strength of the relationship between semantic and logical coupling of classes?”, “are classes with a high degree of semantic coupling likely to co-evolve frequently?”. Various correlation coefficients have been considered including Pearson, Kendall and Spearman. However, for Pearson’s to be valid the data has to follow a normal distribution (Yu 2007; Pagano 2001) (the mean, median and mode have to be the same) while Kendall tau is used in small sample sizes and where there are multiple values with the same score (Field 2009) and interpreted based on the probability of concordant and discordant observations. Finally, p-values derived from Kendall tau are more accurate with smaller sample sizes.

The null hypothesis H 0to be tested for RQ2 is as follows:

  • H 0: No linear relationship between the strengths of logical and semantic dependencies.

The correlation between the two vectors is evaluated using the Spearman’s rank correlation coefficient (Yu 2007). The Spearman’s metric (non-parametric) was chosen because it is unlikely that either the semantic or logical coupling values will have a normal distribution (Pagano 2001). Additionally, some classes might not be changed in all the revisions in which they are semantically coupled. Nevertheless the vectors will be of the same size or contain the same number of observations with the confidence metric in one and the semantic coupling metric in the other.

We adapt the categorization of correlation coefficients by Marcus and Poshyvanyk (2005) as follows: ([0 − 0.1] to be insignificant, [0.1 − 0.3]low, [0.3 − 0.5]moderate, [0.5 − 0.7]large, [0.7 − 0.9] very large, and [0.9 − 1]almost perfect) if the rank correlation coefficient proves to be statistically significant.

We reject the null hypothesis for all the projects studied at the 95% confidence level. In other words, if the rank correlation coefficient proves to be statistically significant at the α = 0.05level, we will reject the null hypothesis and fail to reject the alternative hypothesis H 1: There is a linear relationship between the logical coupling and semantic coupling of OO software classes. The results derived for all projects are described in Section 3.2. The α = 0.05 level was chosen as suggested in Yu’s study (Yu 2007). One of the threats to the statistical validity to their study was the selection of the significance level. In that study, they chose α = 0.1which might have resulted in a type I error - mistakenly rejecting a null hypothesis. To reduce this threat, they planned in future research to decrease the α value to 0.05 for more accuracy (which we have done herein).

3 Results

Following the methodology outlined above, this section presents the results of the three analyses, as performed on the selected projects. The aim is to answer the research questions outlined in Table 1.

3.1 RQ1. Can semantic coupling between classes be captured via class identifiers, rather than with source code corpora?

A Chi-squared test of independence was carried out to investigate the independence of the semantic coupling metrics measured using:

  1. 1.

    A corpora based technique (VSM) and

  2. 2.

    A couple of identifier based techniques (N-Gram and Disco word synonym category-based)

The measurement was done at the class level of granularity as mentioned in Section 6.1 and based on the results derived from the statistical test we could either reject or fail to reject the null hypothesis (H 0) presented in Table 1 for RQ1 : There is no association between the semantic similarity measures of the corpora and identifier based techniques.

Figure 4 shows the distribution of the p-values per studied OO software project derived from the Chi-squared test of independence. The box-plot also shows that we considered three semantic dissimilarity thresholds (t = 0.1, 0.2, 0.5) based on previous studies (Kešelj et al. 2003; Tan et al. 2000; Erkan and Radev 2004; Sarikaya et al. 2005; Coster and Kauchak 2011; Corley and Mihalcea 2005) on text similarity, whereby any class pairs with a measure below the threshold were not considered as semantically coupled.

Fig. 4
figure 4

RQ1- Chi-square association test results for class corpora (VSM) vs identifier (N-Gram, Disco word synonym category) based semantic similarity techniques (box-plot distribution of p-values for threshold t = 0.1, 0.2 and 0.5)

We reject the null hypothesis at the 95% confidence level with only a 5% error margin. In other words, we consider results as statistically significant where the p-value is below or at α = 0.05level.

Figure 4 shows that when the threshold is set to 0.1, not all the p-values fall below 0.05. Therefore, we cannot reject the null hypothesis. When the threshold is set to 0.2, a similar condition for 0.1 also applies for the VSM ⇔N-Gram tests with many outliers above the 0.05 mark. However, when the threshold is set to 0.5 all the p-values are less than or equal to 0.05 for the VSM ⇔ N-Gram tests. Yet the VSM ⇔Disco tests revealed only two outliers while the rest of the p-values are less than or below 0.05.

In addition to Chi-square independence test, we have used Spearman’s rank correlation to further verify the linear correlation between the metrics derived from the corpus based technique against the identifier based techniques. Spearman’s correlation results showed generally a statistically significant and weak correlation in at least half of the projects and between a moderate to large positive correlation (0.3 - 0.8) in another half of the projects with some outliers (negative correlation coefficients) as shown in Fig. 5. However, these negative correlation coefficients are statistically insignificant; the p-values are greater than 0.05 meaning the negative correlation was identified by chance.

Fig. 5
figure 5

RQ1- Spearman’s rank correlation results for class corpora (VSM) vs identifier (N-Gram, Disco word synonym category) based semantic similarity techniques (box-plot distribution of Spearman’s rank correlation coefficients)

Based on the Spearman’s correlation coefficient results, the semantic coupling metrics derived from IR techniques based on class identifiers and class corpora do not covary. However, the use of thresholds when testing for their independence shows a significant association at a semantic dissimilarity threshold of 0.5. This is expected as using a higher dissimilarity threshold of 0.5 for the semantic coupling means that only a subset of all pairs of classes will be reported as semantically coupled. Therefore, the Chi-square independence tests only reveal a significant association between identifier and corpora-based IR methods for a subset of classes – classes that are highly semantically related (semantic coupling \(\geqslant \) 0.5).

Based on the overall results, we cannot reject the null hypothesis and fail to reject the alternative hypothesis H 1for these tests - There is an association between the semantic similarity measures of the corpora and identifier based techniques, as semantic coupling metrics that exploit class identifiers (Kagdi et al. 2013) capture different information with respect to semantic coupling metrics using the entire class corpus.

3.1.1 Summary on RQ1 and its results

Similarly to the results derived in our pilot study (Ajienka and Capiluppi 2016), the Chi-square independence test results indicate the association between the semantic coupling metrics derived when measuring the semantic similarity between OO software classes based on their identifiers and the whole source code corpus. However, this association only applies to classes which as highly semantically related (semantic coupling \(\geqslant \) 0.5). This is an important result for further studies that wish to consider only highly semantically coupled classes, also considering the efficiency of both approaches (corpora and identifiers) in terms of computation time. The fifth column in Table 4 in Appendix A shows that time was saved when we computed the semantic similarity between classes using their identifiers in all but the first project. For example, the semanticdiscoverytoolkit project highlighted in Table 4. Extracting the corpus is time consuming as well as computing the semantic coupling metrics compared to using the identifiers alone. Especially for ‘large’ projects with hundreds of thousands of lines of source code, these results are essential for both researchers and practitioners.

On the other hand, the Spearman’s correlation coefficient results confirm that the semantic coupling metrics derived from identifier and corpora-based IR techniques do not covary. Therefore, to recap the identifier-based metrics and corpora-based cannot be used interchangeably apart from when considering highly semantically linked classes (semantic coupling \(\geqslant \) 0.5) .

Notwithstanding, there are cases or software activities for which one sentence similarity measurement technique will be more useful compared to the other two. For example, in a scenario whereby two class identifiers are similar but these classes do not have related comments, the corpora based method will return a low similarity while identifier based methods will return a high semantic similarity metric. For example, considering the class identifiers GeocoderGeometry and GeocoderIT in the geocoder-java OSS project. The following metrics are returned by the corpora (VSM), N-gram and Disco techniques respectively: 0.0, 0.5 and 0.7. To make use of identifier based techniques, class identifiers are split into short sentences or phrases before adopting the identifier based techniques. The N-gram technique returns a metric closest to that returned by the corpora based technique in comparison to Disco because Disco relies on the English dictionary and will not scale well on non-English terms. For example, considering the class identifiers Data and AnzeigeSpielfield in the 4-connect OSS project. The following metrics are returned by the corpora, n-gram and disco techniques respectively: 0.1, 0.1 and -1.0. At 0.5. The Disco technique compares words based on the similarities of their synonyms while the N-Gram technique is based on the edit distance and shared sub-strings of length n between sentences and has been widely used in the literature on text analysis (Kešelj et al. 2003). We have used n-grams of size 4 in this because research in the area of text mining (Mcnamee and Mayfield 2004; Kešelj et al. 2003) has shown that n = 4 maximizes precision when analyzing words or terms in several languages including English, French, German, Italian and Swedish. To add to that, long lengths of n increase lexicon size, will not represent short words well and processing N-grams sizes larger than 10 is very slow (Kešelj et al. 2003).

While identifier based techniques are more efficient when measuring semantic coupling between classes in terms of computation time, the corpus based technique is useful when recovering traceability links between source code and design documents (Marcus et al. 2005; Witte et al. 2008). Identifier based techniques are unable to extract the meaning or semantics of the documentation and source code to produce similarity measures that can be used to identify traceability links. This is because the identifier of a design document will be too vague and will likely be unrelated to a number of class identifiers. But when parts of the documents are parsed and compared with the terms embedded within the comments and source code of classes, then parts of design documents can be linked to classes in an OO software. Traceability is particularly useful when a developer is trying to comprehend someone else’s code and following any provided documentation as is usually required during maintenance and evolution. This is usually done manually and can be time consuming (especially with large systems consisting of millions of lines of code) without tools that can automatically recover traceability links between source code and documentation.

3.2 RQ2. Is there a << linear >> relationship between semantic and logical coupling?

In order to answer RQ2, it is necessary for us to compute the Spearman’s rank correlation (ρ) between the strengths of the logical and semantic coupling between related class pairs in the studied projects. The strength of the logical coupling is measured in terms of the confidence metrics of identified association rules or frequently co-changed class pairs. Similarly to the confidence metric for logical coupling, the semantic coupling metric range between 0 and 1.

The measurement of how loosely or closely two classes are semantically coupled is based on the corpora-based method (VSM), having identified that identifier-based metrics and corpora-based metrics do not share a linear relationship or covary in Section 3.1. Answering RQ2 will shed more light on whether or not the strengths of the semantic and logical coupling of OO software classes covary. For instance, if they do covary, such statistical results will enable the prediction of the co-change frequency of class pairs based on the strength of their semantic coupling. To recap, the linear relationship between both semantic and logical coupling metrics is investigated using the Spearman’s rank correlation coefficient in this section and the results are now presented.

The charts in Fig. 6 show the correlation results including the p-values obtained. While Fig. 7 further gives a clearer picture of the distribution of the p-values. Similarly to the correlation tests explained in Section 3.1, we reject the null hypothesis at the 95% confidence level with only a 5% error margin for the Spearman’s rank correlation.

Fig. 6
figure 6

RQ2- Correlation between VSM based semantic similarity measures and confidence

Fig. 7
figure 7

RQ2- Correlation between VSM based semantic similarity measures and confidence (box-plot distribution of p-values)

3.2.1 Summary on RQ2 and its results

Figure 6 shows that there is no substantial evidence to infer a particular type of correlation (+ve or -ve) exists between semantic and logical coupling strengths. There is a positive correlation in some projects, while a negative correlation in others. The p-values in 7 further show that the correlations might have been identified by chance and are not statistically significant. This is because most of the p-values are above 0.05 except for only a very few out of the sample of studied projects.

Therefore, due to the lack of any considerable evidence to suggest that there is a correlation between semantic and logical coupling strengths or related OO software classes, we fail to reject the null hypothesis (H 0) for RQ2 presented in Table 1: No linear relationship between the strengths of logical and semantic dependencies.

In summary, to answer RQ2 we have computed the linear correlation between the strengths of the semantic and logical coupling class pairs. We have used the semantic similarity of class identifiers and the confidence of their co-evolution. The results indicate that these coupling strengths do not covary, so they should be considered independent. A pair of classes with a higher co-evolution frequency are not necessarily bound to be linked by a semantic link.

This overall observation has two effects:

  1. 1.

    inferring the co-evolution degree or frequency of class pairs based on the strength of their semantic coupling and vice versa will produce a lot of false positives.

  2. 2.

    Using only semantic coupling information to predict co-evolution will produce a prediction model with low precision.

Previous research by Abdeen et al. (2015) has shown that combining semantic and structural coupling information when predicting change impact sets outperforms using either of them individually. However, semantic coupling metrics produced better recall values compared to structural coupling metrics. Research has also shown that the lack of a linear correlation does not imply a lack of causation (Perdicoúlis 2013). In RQ3, we investigate the possibility of a causal relationship between the semantic and logical coupling of classes.

3.3 RQ3. Is there a << directional >> relationship between semantic and logical coupling?

With the aim of contributing to the interplay between semantic coupling and logical coupling we went a step further to empirically investigate the presence or absence of a (bi-)directional relationship between these types of software dependencies. In Section 3.1, we identified that identifier and corpora-based semantic coupling metrics do not covary. Consequentially, similarly to Section 3.2 in this section the semantic coupling metric is calculated using the corpora technique (VSM).

In order to answer RQ3, it is imperative to gain an understanding of the overlapping or intersection of the logical and semantic class dependencies per project. The intersection set per project is defined as the proportion of class pairs linked both logically and semantically. Classes linked logically have either been co-changed once or more while classes linked semantically share are all class pairs excluding those without any semantic similarity whatsoever. The intersection set of class pairs is represented by the shaded area in Fig. 3. Depending on the size of the two sets, the Venn diagram could be far from symmetric.

Equations 1 and 2 are at the core of RQ3. Two formulas are presented: the Co-changed Semantic Dependencies (CSD, measured in percentages) and Semantic Logical Dependencies (SLD, also a percentage). These two formulas are used as a measure of the class dependencies that belong to the intersection set (both logically and semantically related classes). The CSD(%) represents co-changed semantic dependencies, these are class pairs that share a semantic and modification relationship (frequently co-changed). The SLD(%) represents classes that are logically or change related and also share a semantic relationship. Some classes might only share either a semantic relationship only or a logical relationship only and these classes do not belong to the intersection set.

$$ CSD (\%) = \frac{Semantic \cap Logical}{Semantic} $$
(1)
$$ SLD (\%) = \frac{Semantic \cap Logical}{Logical} $$
(2)

Figures 8 shows two summary plots with the CSD and SLD proportion extracted from the studied sample of OO software projects:

Fig. 8
figure 8

CSD and SLD Percentages per OSS Project (KEY: CSD = Co-changed Semantic Dependencies; SLD = Semantic Logical Dependencies)

While the proportion of co-changed semantic dependencies (CSD) is high (\(\geqslant \) 70%), the proportion of semantic logical dependencies (SLD) tends to also be high. Table 5 in Appendix B reveal for each project the number of distinct semantic dependencies in the third field, the number of distinct logical dependencies in the fourth field, the number of dependencies in the intersection set – pairs of classes that co-change and are semantically related, the percentage of semantic dependencies in the intersection set shown in the sixth field (see (1)), while the last field shows the percentage of logical dependencies in the intersection set (see (2)).

Table 5 is sorted by the project IDs and names for readability. The table shows that there is a directional connection between co-change and semantic coupling. When classes contain terms with similar meanings (i.e., the classes are semantically related) they require modifications at the same time. This also holds in the opposite direction.

From Table 5 in Appendix B, we know for instance, that the project with ID= 56 has a proportion of 60% of its semantic class dependencies including logical dependencies (as shown in the 6thcolumn). On the other hand, all of the pairs that co-changed include semantic dependencies, in the same project. This is a recurring pattern: in all of the projects as shown in Table 5, we have evidence to indicate that very often, semantically related classes involve logically related classes. In 17 of these projects, all the semantic dependencies are reflected into logical dependencies. In both Venn diagrams (left and right) in Fig. 9, the smaller circle represents the set of semantic dependencies while the larger circle represents the set of logical dependencies. Using the Venn diagram on the left (weighted) in Fig. 9, all the semantically coupled pairs of classes in the alleywayreinvented project (project ID = 12) need also co-changes. On the flip side, not all the pairs that co-change are semantically coupled.

Fig. 9
figure 9

Venn Diagrams (weighted) showing the two sets of coupling in two scenarios: project ID= 12 (left) and project ID= 69 (right)

The second most common scenario identified in the results is illustrated using the Venn diagram in Fig. 9 (right), showing the guitarjava project (project ID = 69). A subset of pairs of semantically coupled classes do not need co-change, while the majority of the others still do. Again, in this project most of its other co-changes are not conducive of semantic links.

3.3.1 Summary on RQ3 and its results

The results mentioned above are illustrated with two box-plots each in Fig. 8. The figure shows the distribution of class pairs belonging to the intersection set (classes with both semantic and logical dependencies; see (1) and (2)). The results indicate a bi-directional relationship between semantic relationships and co-change, as in Fig. 8 where both distributions are relatively high in the overall sample of studied OSS projects. Therefore, we reject the null hypothesis (H 0) for RQ3 presented in Table 1 and fail to reject the alternative hypothesis: There is a directional relationship between the semantic and logical dependencies among OO software classes.

In summary, after identifying the lack of a linear relationship between two identifier based techniques in relation to a corpora based information retrieval (IR) technique in semantic coupling measurement (in Section 3.1), we went further to investigate whether there is a linear relationship between the strengths of semantic and logical dependencies of OO software classes. Results presented in Section 3.2 revealed the absence of a linear relationship between the strengths or degrees of the two software dependency types (semantic and logical) at the file or class level. Lastly, as motivated by previous research by Yu (2007) and Fig. 2 where it has been shown that structural coupling of classes leads to their co-change, we wanted to identify where there was a bi-directional relationship between semantic and logical dependencies. Other results in Section 3.3 revealed the presence of a bi-directional link from semantic to change dependency (semantic ⇔logical coupling).

4 Discussion

In Section 3, we presented the results of a three-fold empirical study on the interplay between semantic and logical coupling among classes in OO systems. Previous studies have shown that a number of coupling measures, related to aggregation and invocation coupling, are related to a higher probability of common changes. This indicates that these coupling measures should be good indicators of ripple effects and are used as such in a decision model for ranking classes according to their probability to contain ripple effects associated with given change requests (Briand et al. 1999; Wilkie and Kitchenham 2000; Sun et al. 2015). According to Briand et al. (1999), it is also clear that a substantial number of ripple effects are not covered by the selected highly coupled classes. Thus, such models can be used to focus dependency analysis and help reduce the impact analysis effort. Nevertheless, other important dependencies are clearly not measured or accounted for, and may not be measurable from code alone.

The main findings from the analysis carried out in this study include:

  • RQ1 Identifiers vs Corpora – Firstly, identifier-based techniques (N-gram and Disco) yield similar results to analysing the whole corpora of software classes only for highly semantically related classes (semantic coupling \(\geqslant \) 0.5) and as such cannot always be used interchangeably when computing semantic coupling. Secondly, N-gram and Disco are much more computationally efficient than corpora-based techniques, time-wise. Finally, the N-gram technique is more efficient than the Disco technique, precision-wise: the latter is heavily dependent on the English dictionary, as it considers words with similar English synonyms as semantically related. This study has shown that over 50% of the software projects analyzed do not contain classes with only English identifiers, therefore the Disco technique will produce a lot of false negatives.

  • RQ2 Strengths of coupling – There is no linear correlation between the degree or strengths of the semantic similarity between classes and the frequency of their co-change. Statistical results prove that not all highly semantically related class pairs will require frequent co-changes.

  • RQ3 Direction/Causality of Coupling – There is a large overlapping between semantic and logical (change) class dependencies. If two classes are semantically coupled, there is a high chance that they will co-evolve in the future. However, from RQ2 we have shown that the degree of these dependency types do not show a linear correlation.

The last result is particularly important: for example, two (semantically similar) class pairs \(A\leftrightarrow \hat {A}\) and \(B\leftrightarrow \hat {B}\) could share a semantic similarity of 0.7, but not the same degree of co-change: the pair \(A\leftrightarrow \hat {A}\) could change much more often than \(B\leftrightarrow \hat {B}\). Even so, what we have shown is that it is highly likely that the pairs \(A\leftrightarrow \hat {A}\) and \(B\leftrightarrow \hat {B}\) will co-change at least once or more.

In addition, Spearman’s rank correlation coefficient only assesses linear relationships but some relationships can be curvillinear (Barnett and Salomon 2006). Earlier research has shown that lack of correlation does not imply lack of causation (Howard and Maxwell 1980; Wright et al. 1999; Verhulst et al. 2012).

Other researchers have emphasized the need to study the interplay between semantic and logical coupling in OO software as well as the interplay between structural and semantic coupling (Oliva and Gerosa 2011, 2015). It is noteworthy that this study has presented three novel results in (Sections 3.13.2 and 3.3) the software dependency and maintenance domain. These results will be useful and can guide software developers when building software maintenance tools for change impact analysis (CIA).

Studies on the relationship and interplay between structural and logical coupling have shown that a majority of co-evolving classes are not structurally linked. According to Geipel and Schweitzer (2012), this indirectly means that any model that tries to infer structural coupling from logical coupling or co-evolution will produce a lot of false positives. On the other hand, using the structural coupling information between pairs of classes to infer their future co-change is a more realistic objective (Oliva and Gerosa 2015). Differently from previous studies, this study has shown that over 70% of semantically related classes will usually co-change and the same proportion of change related classes will usually share a semantic relationship. Reflecting back to Section 1, these results are backed by the argument by Bavota et al. (2013b): “the peculiarity of the semantic coupling measure allows it to better estimate the mental model of developers than the other coupling measures. This is because, in several cases, the interactions between classes are encapsulated in the source code vocabulary”. However we cannot firmly assert that using the semantic coupling metrics between classes to infer the strength of their co-change is a realistic objective as our empirical study did not show a linear relationship between the strengths of semantic and logical coupling. But we believe that using a combination of structural and semantic information to predict co-change patterns (Abdeen et al. 2015) might be a more feasible objective.

5 Threats to validity

In this section we present the threats to validity of this study, dividing them in external, internal and construct threats.

External validity

This paper presents the results of an empirical analysis that should be applicable to all open-source projects. We cannot generalise our findings on any other sample of projects, or from any other repository but the lessons learned from this study can be instructive and transferred to similar studies carried out by others. Nonetheless, in order to make the findings from our study more generalisable and representative of open-source projects, we have carried out our analysis on a random sample of projects, with different sizes.

Internal validity

Our selection of the semantic dissimilarity threshold when investigating the association between the corpora-based technique and the identifier-based IR techniques for semantic coupling measurement is based on dissimilarity thresholds used in previous text mining and information retrieval studies. Therefore, to prevent any form of bias during the Chi-squared independence tests we have used three different values (t = 0.1, 0.2 and 0.5) and compared results. This is because different thresholds will reveal different results as shown in Section 3.1.

For measuring logical coupling, we have used the arules package in the R statistical environment (Hahsler et al. 2007). We set the confidence threshold to 0.01 and this might have affected the results. While this is a low threshold, it results in a higher recall (Dasseni et al. 2001) (i.e., identified a larger set of frequently co-changing classes). We further conducted a manual check on the returned association rules in the smaller projects to ensure that class pairs returned by the arules tool actually co-changed and to validate its accuracy. We also adopted 2×2 contingency tables when investigating the association between the identifier and corpora-based IR techniques using the Chi-squared independence test. This test results in false positives when one or more cells have no observations but this was not the case in our data set as each project had at least one class pair in each contingency table cell.

For parsing the corpora of classes and computing semantic coupling, we have adopted the vector space model (VSM) IR technique and we acknowledge that this can have an impact on our results. We also acknowledge that other text document comparison techniques exist, one of which is the latent semantic analysis (LSI); an extension of VSM (Bavota et al. 2013a) used in other domains apart form software engineering.

LSI uses an approach called singular vector decomposition (SVD) to reduce text documents (dimensionality or noise reduction to reflect semantic associations between words - latent semantic space) Footnote 14 by representing synonyms with topics in a latent semantic space before computing document similarity. As such, the dimensionality of a corpus is the number of distinct topics represented in it. Dimentionality reduction allows LSI to index or compare text documents based on topics/concepts instead of similar words. This means that LSI requires the use of fine tuned models.

However, in the context of semantic coupling the reduction of words in documents by grouping them into topics is time consuming as well as prone to low accuracy especially in cases where software teams or comments are multi-lingual (i.e., software built by developers who speak different languages and write comments in their native language). An example of this scenario is the two classes in Listings 1 and 2 both from the same software project (4-connect). Line 7 of Listing 2 contains English words while other comments in the class contain non-English words. The identifier of the class in Listing 1 is also not an English word. In such a case, it becomes imperative to translate words from one language to another before building a textual model for LSI to rely upon.

Listing 1
figure 10

SetzeStein.java

Listing 2
figure 11

DbConnector.java

Prior research has demonstrated that text similarity is based on the notion that the meaning of a sentence is made up of not only the meanings of its individual words, but also the structural way the words are combined (Oliva et al. 2011). Measuring the similarity between non-English documents based on models trained with the Wikipedia corpus for example will yield a low accuracy in the software domain. In Section 3.1.1, we have demonstrated using the Disco word synonym technique that text similarity methods based on the English dictionary does not perform well in the software domain.

In a different study (Bavota et al. 2013a) on software traceability link recovery, VSM outperformed LSI. But in an earlier study on the same topic the authors preferred the use of LSI over VSM. According to Marcus et al. (2005) “our main assumption is that developers use the same natural language (e.g., English, Romanian, etc.) in writing internal documentation and external documentation”. However, our examples and results have shown that the reverse is the case. A feasible research topic for future work will be to investigate or build techniques for the measurement of semantic coupling between multi-lingual OO software classes. Lastly, the performance of LSI depends on the contents of the documents used to build the model. As such, LSI is also not scalable when new documents, not analyzed during model building are parsed using pre-built models, as the concepts in such documents is not captured in the model which also has to be re-built.

Construct validity

The scope of our sample of projects was limited to open-source software projects written in the Java programming language (object-oriented), thus we encourage investigating projects written in other programming languages and non-object-oriented software projects. The study was also conducted at the class level of granularity. This is because overall,the measurement of semantic coupling is more affected by the difference in granularity than logical or evolutionary coupling. A previous study has shown that for the semantic dependencies, going from the coarse granularity of classes to the finer granularity of methods results in the reduction of the sizes of the documents (Kagdi et al. 2013). The documents are reduced in terms (and frequency). That is, a corpus for a class is typically much “bigger” than a corpus for a method (Kagdi et al. 2013). For logical coupling some commits do not contain changes made to methods while some do not contain changes made to classes, so there is not way to map changes made to classes and methods. This informs the choice of the class level of granularity.

6 Related work

Structural and logical (evolutionary) dependencies are at the core of software engineering. However, the study of semantic coupling is still evolving and relatively new compared to the number of studies undertaken on the structural and logical coupling of software classes. In the following section, we summarise the main results of related work on both aspects separately and jointly.

6.1 Semantic coupling

Poshyvanyk and Marcus (2006) defined a coupling metric for classes based on textual information extracted from source code identifiers and comments. Their conceptual coupling metric, CoCC (Conceptual Coupling of Classes), captures a new dimension of coupling not addressed by structural or dynamic measures. More recently, Újházi et al. (2010) extended the CoCC, defining the new CCBO metric (Conceptual Coupling between Object Classes).

Fluri et al. use a set-based similarity metric to explore how comments and code evolve over time (Fluri et al. 2007). Kuhn et al. (2007) proposed the use of IR techniques to exploit linguistic information found in source code, such as identifiers (i.e., class or method) names and comments. Revelle et al. (2011) define new feature coupling metrics based on structural and textual source code information.

Kagdi et al. (2013) in their study on integrating conceptual and logical coupling metrics for change impact analysis suggest that measurement of conceptual metrics is better employed at the class level than at the method level. A corpus for a class is typically much “bigger” than a corpus for a method (Kagdi et al. 2013). This informs our choice of conducting this study at the class level of granularity. In most of these studies, the semantic similarity between software classes using the LSI or VSM approach was adopted.

6.2 Logical coupling

In comparison to the broad research on structural coupling, the study of logical coupling, evolutionary or change dependencies (Yu 2007; Oliva and Gerosa 2011; Zimmermann et al. 2005) only began a few years ago because of the advances in data mining techniques (Yu 2007) used to extract co-evolution data. However, despite its short history, there have been several interesting studies published with promising results. Xia (Xia 1996) argued that the most widely used design metrics for the inter-module relation were based on information flow rather than the coupling or cohesion criteria, and proposed a metric to compute coupling complexity of modules of a system. Ying et al. (2004) proposed an approach for predicting source code changes by mining the change history of software systems. Zimmermann et al. (2005) applied data mining to version histories to guide programmers on related changes using the idea that “Programmers who changed these functions also changed....” (Zimmermann et al. 2005). Given a set of existing changes, the mined association rules 1) suggest and predict likely further changes, 2) show up item coupling that is undetectable by program analysis, and 3) can prevent errors due to incomplete changes.

6.3 The link between semantic and logical coupling

Recent studies (Oliva and Gerosa 2011, 2015; Geipel and Schweitzer 2012; Fluri et al. 2005) have shown that it is possible that structural and logical coupling are caused by other types of relationships (e.g., conceptual dependencies); some logically coupled classes were without structural coupling links between them and vice versa.

Kagdi et al. (2010, 2013) demonstrated that finer granularity decreased the accuracy of all approaches; however, it does not prevent the combination of the two from outperforming the standalone techniques. That is, the gain acquired by combining conceptual and evolutionary coupling exists regardless of the granularity (file-level and method-level) considered in the study. Additionally, they did not analyze the relationship between the coupling types and their information retrieval technique (LSI) or take into consideration “common English words and programming language keywords”. Since this study has shown that not all OO software contain only English words, this could have had an effect on the accuracy of their findings.

Related work shows that only a few studies have combined semantic and logical coupling to support software engineering activities and none have studied their correlation and interplay. Our study fills this gap by examining the interplay between semantic and logical coupling in OO software systems. Bavota et al. (2013b) investigated how class coupling as measured by dynamic, logical, structural and semantic coupling aligned with developer perception of coupling. They concluded that coupling was an important quality attribute of a software system which could not be captured by structural information such as method calls. More sophisticated approaches, and different source of information need to be analyzed to provide a better evaluation of developer perception of coupling. To this end, semantic coupling seems to reflect a developer’s mental model when identifying interaction between entities.

Poshyvanyk et al. (2009) propose new semantic coupling metrics based on the degree to which the identifiers and comments from different classes relate to each other at the class and method level of granularity. They suggest that their metrics capture new dimensions in software dependency measurement, compared with existing structural dependency metrics. For example, indicating change ripple effects better, compared to existing structural coupling measures and the new metrics can be used to rank classes in the course of impact analysis in large OO systems.

7 Conclusions and future work

We have presented two methods of measuring the semantic coupling of software classes using only their identifiers. We further compared each of these methods to measuring the semantic coupling of classes using their corpora. Results showed that using only the class identifiers is a more efficient approach but not in all cases (only when considering highly semantically similar classes). As such, identifier and corpora-based IR methods for computing semantic coupling cannot be used interchangeably in all studies. For projects with hundreds of thousands of lines of code, the extraction of text from all the classes to build their corpora can be time consuming and complex. Hence, the results derived from this study on class identifier-based semantic coupling metrics have some significance in the software dependency and maintenance domain and can be further explored.

On the interplay between semantic and logical dependencies, in 79 object-oriented and open-source software projects we could not detect a linear relationship between the strengths of semantic and logical dependencies. However, we identified a bi-directional link between semantic to logical dependencies. In other words, over 70% of classes that are semantically related will usually co-evolve and classes that are change related will usually share some degree of semantic coupling. Based on empirical results derived from a significant number of software projects, we conclude that identifying more efficient methods of semantic coupling computation as well as a directional relationship between semantic and change dependencies can help to improve CIA techniques that integrate semantic coupling information.

This will speed up the process of revealing ‘hidden dependencies’ not captured by source code dependencies. Our results can also guide software developers and researchers in developing future generations of tools for supporting program comprehension. Future work will involve comparing the change impact set identified when adopting identifier-based methods for semantic coupling measurement to class corpora-based methods based on precision and recall. Other future work will focus on the interplay between structural and semantic coupling as well as the measurement of semantic coupling between classes in OO software projects built by multi-lingual developers with comments written in several natural languages (e.g., English, French, German, etc.).