1 Introduction

Maintainability of software is an important external quality attribute which is concerned with how well the software accommodates a change in requirements of the user or real world [1]. One of the main reasons for high maintenance of software is its complexity: the more complex a software, the more will it require maintenance [2]. Figure 1 demonstrates the relationship between complexity and external quality factors like maintainability etc. [3]. As can be inferred from Fig. 1, reducing complexity can have the effect of producing more maintainable software.

Fig. 1.
figure 1

Structural complexity, cognitive complexity and their relationship with external quality attributes [3]

The use of OO paradigm has become widespread these days. This is because OO paradigm uses the notion of classes and objects, and classes are expected to be high-quality units that are easy to maintain [4]. Also, the use of hierarchical decompositions and abstractions in OO approach helps developers produce a less complex software [5]. However, these claimed benefits of OO paradigm have not paid off. And therefore maintaining software has still remained a time/effort consuming activity and a costly affair [6]. In order to control and manage maintenance time/effort related costs, there is an increasing need for predicting software maintainability. Researchers argue that v if accurate prediction of software maintainability can be made, it can help in effective decision making in areas like resource and staff allocation, comparing productivity and costs etc. [7].

One of the important approaches to control software maintenance is the usage of software metrics [8]. Many studies have been conducted in the past to examine the link between OO metrics and maintainability [4, 7,8,9,10,11,12,13,14,15,16]. All these studies have used static metrics that quantify the static aspects of the structural complexity of OO code at class level. However, OO code is inherently dynamic due to the presence of concepts of polymorphism, late binding etc. Therefore, it is of utmost importance that the quality of OO system be measured using run-time aspects. To achieve this, a class of software metrics has emerged in recent years, called dynamic metrics, which are collected while the software is under execution [17]. Various dynamic metrics have been proposed in the past few years for OO software [17, 18], however, little or no evidence exists about the effectiveness of these metrics for prediction of software maintainability.

In this paper, we use a dynamic system complexity measure from authors’ previous work [19] to find its usefulness for predicting maintainability of OO software. The measure is calculated at system level by taking into account the dynamic complexity of all its classes. A dynamic class complexity measure, in turn, is obtained from the dynamic complexity of its objects. The object level dynamic complexity measure is calculated as a function of the complexities arising due to the following three factors:

  1. (i)

    methods invoked by the object at run-time, i.e., static complexity of the methods being invoked.

  2. (ii)

    the way in which methods are being invoked, i.e., run-time object-method invocation relationship. Three types of run-time object-method invocation relationships are identified in authors’ work [19]: direct method invocations, transitive method invocations, and coupling method invocations.

  3. (iii)

    the number of times a method is invoked, i.e., frequency of method invocation by an object.

An experimental study is set up using 12 sample Java programs. A controlled experiment is then carried out and statistical techniques consisting of correlation and linear regression are conducted to correlate the proposed dynamic complexity measures with maintainability of OO software as external quality attribute. The results show that a significant positive correlation exists between the proposed dynamic complexity measures and maintainability of OO software and therefore can serve as useful indicators of maintainability.

This paper is organized as follows: Sect. 2 reviews the literature work, Sect. 3 briefly describes the authors’ previous work [19] on dynamic complexity measurement for object-level, class-level and system level dynamic complexity measures while Sect. 4 elaborates the empirical set-up and the controlled experiment. Section 5 discusses the results of the study, Sect. 6 gives an overview of validity threats of the experimental results and in the end, Sect. 7 concludes the paper.

2 Literature Overview

Maintainability is defined as “the ease with which a software system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment” in the IEEE Standard Glossary of Software Engineering [20]. A systematic review of maintainability prediction studies [21] indicates that researchers in the field of software engineering have used different facets/aspects of maintainability in their respective studies such as “time required to make changes” [22], “time to understand, develop, and implement modification” [23], “the number of revised lines of code” [12], “the number of revisions in which the class was involved during the maintenance history” [11] etc.

There have been empirical evidences on the link between OO static software measures and maintainability [4, 7,8,9,10,11,12,13,14,15,16]. However; no such claim can be made about OO dynamic measures. Several past review studies [18, 24] have indicated that very few authors have conducted empirical validation of their proposed dynamic measures. Yacoub et al. [25] proposed dynamic complexity measure for OO designs and used it to formulate architecture level reliability risk assessment methodology [26]. Arisholm et al. [27] proposed dynamic import and export coupling measures at different granularities and tried to find the relationship between these measures and change-proneness. For this, they presented a case study of an open source software Velocity, to analyze the changes (lines of code added and deleted) across its four sub-releases. The authors found that most of their dynamic export coupling measures serve as indicators of change-proneness. The work of Gupta and Chhabra [28] proposes dynamic cohesion measures and tries to find the effectiveness of these measures in predicting change-proneness of classes. The study suggested that their proposed dynamic cohesion measures can serve as better indicators of change-proneness in comparison to the existing cohesion measures.

3 Measuring Dynamic Complexity

In their previous work [19], the authors’ proposed that object level dynamic complexity is a function of the complexities arising due to the following three factors:

  • methods invoked by the object at run-time, i.e., static complexity of the methods being invoked.

  • the way in which methods are being invoked, i.e., run-time object-method invocation relationship. Three types of run-time object-method invocation relationships are identified in this paper: direct method invocations, transitive method invocations, and coupling method invocations. A direct invocation is characterized when an object invokes a method of its class directly whereas a transitive invocation is characterized when an object invokes a method of its class which in turn invokes other methods of its class. A coupling invocation, on the other hand, results when an object invokes a method of its class which in turn invokes other methods of other classes. These three run-time relations contribute towards measuring the dynamic complexity of an object.

  • the number of times a method is invoked, i.e., frequency of method invocation by an object.

The dynamic complexity of a class (system) is then obtained by aggregating the dynamic complexity of all its objects (classes).

The measure, DOCPXx(o), i.e. dynamic object complexity, of an object o under scenario x is defined as

$$ {\text{DOCPX}_{\rm x}}({\text{o}}) = \frac{{w_{1} *DC\_D_{x} \left( o \right) + w_{2} *DC\_T_{x} \left( o \right) + w_{3} *DC\_C_{x} (o)}}{{w_{1} + w_{2} + w_{3} }} $$
(1)

where

DC_Dx(o), DC_Tx(o) and DC_Cx(o) are the dynamic complexity of object o due to direct, transitive and coupling method invocation respectively while wi are the cognitive weights [29] assigned to the three types of dynamic complexity relations (i.e. dynamic, transitive and coupling) in terms of their importance towards measuring dynamic complexity. Given the cognitive weights for direct, transitive and coupling invocations as w1, w2 and w3 respectively, one can infer that w3 > w2 > w1. These weights can be assigned values depending upon the opinions of experienced analysts and software engineering experts [29].

The values DC_Dx(o), DC_Tx(o) and DC_Cx(o) are measured in similar fashion. We illustrate here how DC_Dx(o) is calculated. If no methods are invoked directly then DC_Dx(o) value is 0. Otherwise, DC_Dx(o) is the summation of the product of frequency of a method invoked directly and its static complexity. The dynamic complexity of object o in entire application scope is then defined as the average of dynamic complexity values for object o under all execution scenarios i.e.,

$$ {\text{DOCPX}}({\text{o}}) = \frac{{\mathop \sum \nolimits_{{{\text{j}} = 1}}^{{|{\text{x}}|}} {\text{DOCPX}}_{\text{x}} ({\text{o}})}}{|X|} $$
(2)

The measure, DCCPX, called dynamic class complexity, is defined as follows

$$ {\text{DCCPX}}({\text{C}}) = \mathop \sum \nolimits_{i = 1}^{k} {\text{DOCPX}}(o_{i} ) $$
(3)

where k is the number of objects created by class C. The measure, DSCPX, called dynamic system complexity, is defined as follows

$$ {\text{DSCPX}}({\text{S}}) = \mathop \sum \nolimits_{{{\text{i}} = 1}}^{\text{n}} {\text{DCCPX}}({\text{C}}_{\text{i}} ) $$
(4)

where n is the number of application classes in OO system S.

Now, consider an example in Fig. 2. Here, obja1 has one direct invocation for methodA(), two direct invocation for methodAA() and two transitive invocation for methodA(), object obja2 has one direct invocation for methodAA() and one transitive invocation for methodA(); while objb has one direct invocation for methodB() and one coupling invocation for methodA(). Considering the w1 = 1, w2 = 2 and w3 = 3, and the static complexities of all methods as unity, we get:

$$ \begin{aligned} {\text{DOCPX}}\left( {\text{obja1}} \right) = & \,\left( { 1* 3+ 2* 2+ 0} \right)/\left( { 1+ 2+ 3} \right) = 7/ 6\\ {\text{DOCPX}}\left( {\text{obja2}} \right) = & \,\left( { 1* 1+ 2* 1+ 0} \right)/\left( { 1+ 2+ 3} \right) = 1/ 3\\ {\text{DOCPX}}\left( {\text{objb}} \right) = & \,\left( { 1* 1+ 0 + 3* 1} \right)/\left( { 1+ 2+ 3} \right) = 2/ 3\\ \end{aligned} $$
$$ \begin{aligned} {\text{Hence}}\;{\text{DCCPX}}\left( {\text{A}} \right) = & \, 7/ 6+ 1/ 3= 9/ 6= 3/ 2\\ {\text{DCCPX}}\left( {\text{B}} \right) = & \, 2/ 3\\ {\text{DSCPX}}\left( {\text{S}} \right) = & \, 3/ 2+ 2/ 3= 1 3/ 6= 2. 1 6\\ \end{aligned} $$
Fig. 2.
figure 2

Sample Java code

4 Empirical Validation

Empirical validation of a measure tries to establish its practical utility by correlating it to some external quality attribute [30, 31]. It involves performing controlled experiments, case studies etc. to gather empirical data and then statistically analyzing this data to prove the practical utility of the metrics [32]. In this paper, we have performed a controlled experiment to correlate the dynamic complexity measure DSCPX with maintainability of OO software.

4.1 Empirical Set-up

We have used 12 sample Java programs randomly selected from various sources like [33] and web for our experimental study. The authors developed a dynamic analyzer tracer code in their previous work [19] using AspectJ [34], an aspect oriented programming (AOP) [35] extension for Java, to facilitate the collection of dynamic complexity metric data for these sample programs. AOP [35] is a way of modularizing cross-cutting concerns such as tracing, logging etc. that may be scattered throughout an application. Researchers [28] have suggested AOP as an efficient technique for dynamic metric data collection.

Table 1 provides summary of sample Java programs and lists the lines of code(LOC) and NC (number of classes) along with the values of the dynamic complexity measure DSCPX for the sample Java programs.

Table 1. DSCPX values for the sample programs

4.1.1 Experimental Goal

The main goal of our experiment is to find how the dynamic complexity measures are related with maintainability of OO software. As suggested in Wohlin et al. [32], we use GQM [36] approach to define this goal as follows:

$$ \begin{array}{*{20}l} {Analyse} \hfill & {{\text{dynamic}}\;{\text{complexity}}\;{\text{measures}}} \hfill \\ {For\;the\;purpose\;of} \hfill & {\text{evaluating}} \hfill \\ {With\;respect\;to } \hfill & {{\text{relationship}}\;{\text{with}}\;{\text{maintainability}}\;{\text{of}}\;{\text{OO}}\;{\text{software}}} \hfill \\ {From\;the\;point\;of\;view\;of} \hfill & {\text{researchers}} \hfill \\ {In\;the\;context\;of} \hfill & {{\text{final}}\;{\text{year}}\;{\text{postgraduate}}\;{\text{computer}}\;{\text{science}}\;{\text{students}}} \hfill \\ \end{array} $$

4.1.2 Planning

Context Selection:

This experiment addressed the problem whether dynamic complexity measures can be used as indicators of maintainability of OO software.

Selection of subjects:

The experiment was carried out with final year postgraduate computer science students of USICT, GGSIPU, New Delhi as subjects. There were a total of 28 subjects whose CGPA averaged 7.6. These subjects participated voluntarily and had adequate knowledge of concepts of OOP, Java and Software Engineering. Being post graduate students, some of them had industrial experience also. Students were chosen as subjects for convenience. Many empirical studies in the field of metric validation have chosen students as subjects [8, 37]. Researchers have suggested that students are acceptable as subjects [38] and that there is no difference in students and professionals under some conditions [39]. Also, it has been suggested to do pilot investigations in academic environment before going to an industrial set-up [36].

Selection of variables:

We have taken OO dynamic complexity as independent variable and OO software maintainability as dependant variable.

Instrumentation:

The independent variable (OO dynamic complexity) was measured using the measure DSCPX (defined in Sect. 3).

The dependent variable (maintainability) was operationalized as “the time/effort expended in understanding the software artefact and then incorporating a new/changed requirement”. This approach has been used by many researchers in the field of maintainability prediction [8, 37, 40]. We call it as maintainability time (maint-time for short) and have measured it in terms of person-hours as has been done in various other studies [40].

Experimental Design:

Twelve groups (2–3 students) were formed and were given one sample Java program each. Formation of subject groups as well as assignment of a program to a group was done randomly. Randomization helps in curbing bias in the experiment [41].

Hypothesis Formulation:

The hypothesis to be tested is stated as under:

  • Null Hypothesis H0: There does not exist statistically significant correlation between dynamic complexity measure DSCPX (independent variable) and maintainability of OO software (dependent variable).

  • Alternate Hypothesis H1: There exists statistically significant correlation between dynamic complexity measure DSCPX (independent variable) and maintainability of OO software (dependent variable).

4.1.3 Operation

Preparation:

Before performing the experiment, the subjects attended to a training session in which the subjects were given instructions on how to perform the experimental task, e.g. how to behave during experimental task, how to report maint-time values etc. However, full care was given so that the subjects never came to know about the intended study aspects and were never told about the hypothesis under test.

The subjects were provided with a document for their experimental task containing source code of the sample Java program under study, a brief description (two or three line) of what the program was all about, and one new requirement that had to be incorporated into its functionality. For example, the Employee program (P7) required the a new functionality to be added-computing the tax from salary of the regular employee.

Execution:

The experiment was done as a subject assignment. All the subjects were provided the material described in the previous paragraph. The subjects took the tasks home and did the tasks without any supervision. The subjects were asked to incorporate the new functionality and accordingly report the time spent by them in carrying out the task.

Table 2 depicts the DSCPX values along with the collected maint-time values.

Table 2. Average Maint-Time (in person-hours)

4.2 Data Analysis

We have performed correlation and linear regression analysis on the data. For correlation analysis, we have used both non-parametric and parametric tests in order to avoid assumptions about distribution of data (as the data set is very small). These tests include Kendall’s Tau’b statistic, Spearman’s rho and Pearson’s product moment coefficient. The level of significance α = 0.01 was used. Linear regression was performed to depict the causal relationships between DSCPX and maint-time. Regression analysis identifies whether independent variable is the cause of variation in dependent variable and is the most basic and commonly used predictive analysis.

5 Results

Table 3 presents the coefficients of correlation of Pearson, Kendall’s Tau’b and Spearman’s rho correlation analysis. As can be inferred, all the three coefficients depict significant positive correlation between DSCPX and maint-time. Therefore, null hypothesis H0 is rejected and alternate hypothesis H1 is accepted.

Table 3. Summary of correlation coefficients

Table 4 presents the regression coefficients. We see that the unstandardized coefficient value is .363. This means that one unit increase in independent variable (i.e., DSCPX) causes the predicted value of dependent variable (i.e., maint-time time) to change by .363 person-hours (or approx. 21 person-minutes).

Table 4. Regression coefficients

Table 5 presents regression model summary. In general, this table is used to obtain information about the regression line’s ability to explain the total variation in the dependent variable. The adjusted R2 gives an unbiased estimate of the population R2 and accounts for the variance in the dependent variable due to independent variable. In our case, adjusted R2 value is 0.645. It means that the independent variable (DSCPX) is able to explain 64.5% variance in dependent variable (maint-time) which is quite high for a single variable prediction model. Standard error gives an estimate of how close the predicted values of the dependent variable are to the observed values. In our case, its value is .85077. It means that the observed values fall within. ±1.7% approx. of the fitted regression line, giving a 98% (approx.) prediction interval.

Table 5. Regression model summary

6 Threats to Validity

Researchers [32] identify four types of validity threats in an experimental study: construct, internal, external and conclusion validity. The following sub-sections provide a brief overview of these threats, and the methods that we used to mitigate them.

6.1 Construct Validity

It describes the degree to which a variable (dependent or independent) is accurately measured by the measurement instruments used in the study. The construct validity of the independent variable DSCPX has been established in authors’ previous work [19] by performing theoretical validation using Briand et al. framework [42]. A dynamic analyzer tracer code was used to collect the metric values. However, one problem associated with such tracer code is to decide when the trace should be stopped so that the collected dynamic complexity values represent the complete application [28]. Therefore, it may be argued that the dynamic complexity values obtained in this study depend on the authors’ understanding of the source code of the sample Java programs under study. The dependent variable i.e., maint-time, is recorded as the time spent by the subjects in performing the experimental task. This approach has been used in experimental studies like [8, 37, 40] etc. to measure maintainability, so we consider this variable also constructively valid.

6.2 Internal Validity

It describes the degree to which a study can establish the cause-effect relationship between the independent variables and the dependent variable by controlling for any extraneous factors. We took a lot of measures to curb some of these extraneous factors. First of all, we used randomization wherever necessary to minimize any kind of bias. Although the subjects participated voluntarily, we also encouraged them by convincing that the effort they put in would help them in long run in growing as good software professionals. This boosted their enthusiasm and morale. Dividing the subjects into groups and assigning single experimental task to each group also helped in minimizing fatigue effect. Further, the subjects had never participated in a similar experimental task, therefore we can claim that the persistent effect was also not present.

Although clear instructions were given to the subjects how to behave during the task, plagiarism and influence among subjects could still be an issue. This is because the task was carried out without any supervision. The subjects might have discussed amongst them the type of problem they were solving. Finally, the effect of any confounding variable (like size) have not been taken into account in the current study. We plan to address this issue as a separate study in future.

6.3 External Validity

It refers to the extent to which results of an experimental study can be generalized about the population under study or other research settings. The two important issues concerning external validity of the current study can be (i) material and tasks used, and (ii) subjects. Regarding material and tasks used, for our data set, the sample programs were chosen randomly. However, care was given so that the chosen sample programs represent wide variety of domains. Still, it can be argued that our conclusions may be biased because of the representative data set. Also, due to the difficulty of obtaining professional subjects, the experiments were conducted using post graduate students. However, using students as subjects is not a big issue as researchers like [39] argue that “students are next generation of software professionals and therefore are close to the population of interest”.

6.4 Conclusion Validity

It is defined as the degree to which conclusions drawn in the experimental study are statistically valid. The major factor affecting the conclusion validity of this study that we could identify is the size of the sample data (12 programs). We are currently working on collecting bigger data set, to conduct a replication study.

7 Conclusion

In this paper, we have empirically validated the system level OO dynamic complexity measure DSCPX from authors’ previous work to determine its usefulness for predicting maintainability of OO software. An experimental study consisting of 12 sample Java programs and 28 subjects was set-up. Further, statistical techniques like correlation and linear regression were performed on the data obtained from the experimental study. The results of correlation analysis suggest a significant positive correlation between DSCPX and maintenance time. Moreover, the regression analysis results indicate that DSCPX can serve as a useful indicator of maintainability as external quality attribute.