Keywords

1 Introduction

Large software systems are usually composed of lots of small constitute elements (e.g., methods, fields, classes, and packages); any small error in one element may lead to catastrophic consequences [1]. Thus, how to ensure a high quality software system has become a problem faced by many people in the field of software engineering. Generally, we cannot control what we cannot measure. Therefore, how to understand and measure complex software systems has become an ever-important step to ensure a high-quality software system [2].

The complexity of a specific software system usually originates from its internal structure. In recent years, some researchers proposed some approaches to explore the complexity of software systems from the perspective of the internal structure of software systems. Up to now, many promising achievements have been reported. Generally, the studies on software structure analysis can be divided into two groups, i.e., i) traditional software structure metrics, and ii) software structure metrics based on complex network theory.

The traditional software structural metrics mainly focus on analyzing the local structure of software systems and fail to characterize the properties of software as a whole. With the development of complex networks, some researchers have introduced the theory of complex networks into the examination of software systems by building software networks from the source code of software systems. Complex network theory provides us with a new way to understand the internal structure of software systems. At present, the number of studies on software network analysis is still not very large, the construction of software networks is not accurate enough, and the metrics used in software network analysis and the data set used in the experiment are not comprehensive enough.

In this paper, we combine software structure analysis and complex network theory together and propose a SCANT (Software Complexity Analysis using complex Network Theory) approach to probe the internal complexity of software systems. Specifically, we build much more accurate software network models from the source code of a specific software system, and then introduce a set of statistical parameters in complex network theory to characterize the structural properties of the software system, with the aim of revealing some common structural laws enclosed in the software structure. By doing so, we can shed some light on the essence of software complexity.

2 Related Work

The traditional analysis of software system structure focuses on a single module. The McCabe metrics [3] are mainly based on graph theory and program structure control theory, using directed graph to represent the program control flow, so as to represent the complexity of the network according to the ring complexity in the graph. The Halstead metrics [4] are used to measure the complexity of a software system by counting the number of operators and operands in the program. The C&K metric suit [5] is based on the theory of object-oriented metrics and mainly includes six metrics. The MOOD metric suit [6] proposed by Abreu et at. indirectly reflect some basic structural mechanisms of the object-oriented paradigm.

With the development of complex networks, some researchers have introduced the theory of complex networks into the examination of software systems by building software networks from the source code of software systems. In their software networks, software elements such as attributes, methods, classes, and packages are represented by nodes, and the couplings between elements such as inheritance, method call, and implements are represented by undirected (or directed) edges. Based on the software network representation of the software structure, they introduced the complex network theory to characterize the structural properties of a specific software system, and further to improve its quality. Complex network theory provides us with a new way to understand the internal structure of software systems, and many related work has been reported.

3 The Proposed SCANT Approach

Our SCANT approach is mainly composed of four three, i.e., i) software network model construction, ii) calculating the values of statistical parameters, and iii) analyzing the parameter values to reveal the structural characteristics.

3.1 The Software Network Model

The software systems studied in this work are all open source software systems developed by using Java programming language. The topological information in software systems will be analyzed and extracted. In this work, we extract various software elements.

Since most statistical parameters in complex network theory do not consider the weight on the edges (or links), i.e., they only can be applied to un-weighted software networks. Thus, to apply the statistical parameters in complex network theory to characterize the software structure, in this work, we construct an un-weighted software network at the class level, i.e., Un-weighted Class Relationship Network (UCRN for short), to represent classes and the relationships between them. In UCRN, nodes represent the software elements at the class level (i.e., classes and interfaces), edges between nodes represent the relationship between classes, and the direction of edges represents the relationship direction between classes. In UCRN, we consider the following seven types of relationships [7], i.e., Inheritance relationship, Implementation relationship, Parameter relationship, Global Variable Relationship, Method Call Relationship, Local Variable Relationship, and Return Type Relationship.

If there is one of the seven kinds of relationships between two classes, then we establish a directed edge in the UCRN network between the nodes denoting the two classes. This edge is used to describe the coupling relationship. Thus, UCRN is essentially an un-weighted directed network which can be defined as

$$ \begin{array}{*{20}l} {{\text{UCRN}} = (V,L),n \in V,l \in L,} \hfill \\ {l = < n_{i} ,n_{j} > ,n_{i} ,n_{j} \in V} \hfill \\ \end{array} , $$
(1)

where V denotes the class (or interface) set in the software system, and L denotes the coupling relationship set between all pairs of nodes. Generally, if one class uses the service provided by another class, then a directed edge connecting the two classes will be established in the UCRN. We do not consider the weight on the edges. Thus, the weight on the edges will be the same, i.e., 1.

3.2 The Statistical Parameters

Here we introduce some statistical parameters widely used in complex network theory to characterize the structural properties of software systems. These statistical parameters are borrowed from [8].

Definition 1.

Betweenness Centrality.

Betweenness is a very important parameter in complex network theory, and it is usually used to reflect the importance of nodes. The betweenness centrality of node i in a network can be described as the ratio of the number of all shortest paths passing through node i to the number of the shortest paths in the whole network. Till now, the betweenness centrality has been widely applied in a wide range of networks such as biological networks, transportation networks, and social networks. Betweenness centrality can be formally described as

$$ B(v) = \sum\nolimits_{s \ne v \ne t} {\frac{{\phi_{st} (v)}}{{\phi_{st} }}} , $$
(2)

where \(\phi_{st}\) is the number of shortest paths between nodes s and t, and \(\phi_{st} (v)\) denotes the number of shortest paths between nodes s and t which also passes node v.

Definition 2.

Closeness Centrality.

Closeness centrality refers to the degree of closeness between a specific node and other nodes in the network. The higher the closeness centrality of a node is, the closer it is to other nodes. The closeness centrality of a node is the reciprocal of the average of the shortest path lengths between the node and all other nodes in the network and thus can be defined as

$$ C(i) = \frac{n}{{\sum\nolimits_{j} {d(j,i)} }}, $$
(3)

where \(d(j,i)\) is the shortest path length between nodes i and j, and n is the number of nodes in the whole network.

Definition 3.

Degree Distribution.

The degree of a node is the number of edges that the node used to be connected to other nodes. Degree distribution is a general description of the degree of nodes in a graph (or network), which is the probability distribution or frequency distribution of the degrees of the nodes in the network.

If a graph (or network) is composed of n nodes with nk nodes whose degree is k, then the degree distribution \(P(k) = \frac{{n_{k} }}{n}\). For directed graph (or network), P(k) has two versions, i.e., in-degree distribution and out-degree distribution.

Definition 4.

Clustering Coefficient.

Clustering coefficient is used to measure the degree to which nodes in a graph (or network) tend to cluster together, i.e., the aggregate density of nodes in a graph (or network). The clustering coefficient of a node in a network mainly refers to the proportion of the number of connections between the node and adjacent nodes to the maximum number of edges that can be connected between these nodes. The clustering coefficient of node i, Ci, can be computed according to the following formula

$$ C_{i} = \frac{{2e_{i} }}{{k_{i} (k_{i} - 1)}} = \frac{{\sum\nolimits_{jm} {a_{ij} a_{im} a_{mj} } }}{{k_{i} (k_{i} - 1)}}, $$
(4)

where ei is equal to the number of nodes whose clustering coefficient is equal to the edges actually connected by its neighbours. \(\frac{{k_{i} (k_{i} - 1)}}{2}\) is the maximum possible number of edges. Then the clustering coefficient of the network is the average of the clustering coefficients of all the nodes in the network, i.e.,

$$ C = \left\langle {C_{i} } \right\rangle = \frac{1}{N}\sum\nolimits_{i \in V} {C_{i} } , $$
(5)

where N is the number of nodes in the graph (or network), and V is the nodes set.

Definition 5

Average Shortest Path Length.

For an un-weighted network, the shortest path length is the minimum number of edges from one node to another node in the network; for the weighted network, the shortest path length is the minimum value of the sum of the edge weights from one node to another node. The average shortest path length of a network is defined as the average of the shortest path lengths between any two nodes in the network. The average shortest path length of a network can be defined as

$$ L = \frac{2}{N(N - 1)}\sum\nolimits_{i \ne j} {d_{ij} } , $$
(6)

where dij is the number of edges on the shortest path between nodes i and j, and N denotes the number of nodes in the network.

4 Software Structure Analysis

In this section, we use a set of four open source software systems as case studies to probe their topological properties.

4.1 Subject Systems

We selected a set of four open-source Java systems as our research subjects. These systems are selected from different domains with different scales. Specifically, the subject systems contain ant, jedit, jhotdraw, and wor4j. Table 1 shows some simple statistics of the four subject software systems. Specifically, System is the name of the subject system, Version shows the version of the corresponding software system, Directory is our analysed directory, LOC is the lines of code, and #C is the number of classes and interfaces.

Table 1. Statistics of the subject systems.

4.2 Results and Analysis

In this section, we constructed the software networks for all subject systems, and then used the statistical parameters to characterize the topological properties of these subject systems.

Node Centrality Analysis.

Network centrality metrics are mainly used to find the nodes which play an important role in the complex network. In this section, two centrality metrics are used, i.e., betweenness centrality and closeness centrality.

Betweenness centrality is one of the most important centrality metrics in complex network theory. It is widely used to characterize the importance of nodes. As shown in Fig. 1, we can find that, nearly in all the subject systems, about 90% of the nodes have a betweenness value less than 0.05, which means only 5% of classes contain important information and play important role in the implementation of the key functionalities of the software system; a large part of the classes do not perform important role. Betweenness centrality reflects the degree of interdependence between each class node and other class nodes. The higher betweenness centrality of class nodes is, the more important it is to the software network.

In the actual development process, the class call is usually a call chain, and the important class will generally be more called and called other classes, such as the core function class is usually called by various types of software to perform the corresponding action. Therefore, the key class in the software system, the performance of the betweenness centrality is that the betweenness centrality value is larger.

Fig. 1.
figure 1

The distribution of betweenness centrality values

As shown in Fig. 2, there is no class nodes whose closeness value is larger than 0.5, and in the four subject software systems, the closeness centrality values of most nodes are close to 0. The fact that the closeness value of some class is equal to 0 indicates that there are some isolated nodes in the network without any connections to other nodes. The larger the closeness centrality value of the class node is, the closer the class is related to all other class nodes, which means these class nodes have a best position in the network and can perceive the dynamics of the whole software network including the flow direction of information. Generally, key classes usually use the services provided by many more classes to complete core functionality. Thus, in the software network, we may find that some key class are more closely related to other class nodes.

Fig. 2.
figure 2

The distribution of closeness centrality values.

Clustering coefficient analysis. Figure 3 shows the distribution of clustering coefficient values. Obviously, the clustering coefficient values of most class nodes in ant, jedit, jhotdraw, and wro4j are close to 0, which means that most of the nodes whose neighbors are not closely coupled with each other; only a few class nodes have high clustering coefficient values.

For all the subject software systems, only a few class nodes have a relatively high clustering coefficient, i.e., only a few classes will use many other classes or be used by many other classes. This is in line with the characteristics of key classes of software systems. In the practical development process, classes that provide core functionalities (i.e., key classes) are usually called by many other classes to execute core functionalities. Generally, developers will write some small classes to provide some single-functionality classes, and then key classes will use the services provided by these classes to provide complex functionalities. Thus, the neighbours of key classes are usually coupled closely, which is reflected by a larger value of clustering coefficient.

Fig. 3.
figure 3

The distribution of clustering coefficient values.

Degree Distribution.

Figure 4 shows the degree distribution of nodes in the software network. As shown in Fig. 4, we can observe that the number of nodes decreases as the degree increases, and the more nodes in the software network, the more obvious this trend is.

It can be observed from Fig. 4 that when the degree is less than 10, the number of nodes accounts for almost 90% of the nodes in the software network; when the degree is greater than 50, the number of nodes is almost close to 0. Therefore, most of the nodes in the software network are only connected to a few nodes, and a few nodes are connected to most of the nodes, which is in line with the typical characteristics of scale-free networks. It indicates that in the software system, most of the classes only call a very small number of classes or are called by a very small number of classes, and only a few classes are called a large number of other classes or are called by a large number of classes.

Fig. 4.
figure 4

The degree distribution.

Average Path Length Analysis.

As shown in Table 2, although the software scales are different across systems, the average shortest path length is roughly equal to 3. The maximum average shortest path length is 3.379, and the minimum average shortest path length is 2.806. Therefore, software networks have small-world property.

Table 2. The average path length of software networks.

5 Conclusions

In this work, we used un-weighted software networks to represent software structure and introduced some statistical parameters in complex network theory to characterize the structural properties of software systems. We used a set of four open-source software systems as subject systems to reveal some topological properties of software systems. Specifically, we analyzed the distribution of many statistical parameters, such as centrality metrics (i.e., betweenness and closeness), clustering coefficient, and average shortest path length.

The results show that the software networks proposed in this work also belong to small-world and scale-free networks. The analysis of these important structural properties in software networks is of great significance to the field of software metrics.