Software defect prediction techniques using metrics based on neural network classifier

Jayanthi, R.; Florence, Lilly

doi:10.1007/s10586-018-1730-1

Software defect prediction techniques using metrics based on neural network classifier

Published: 07 February 2018

Volume 22, pages 77–88, (2019)
Cite this article

Download PDF

Cluster Computing Aims and scope Submit manuscript

Software defect prediction techniques using metrics based on neural network classifier

Download PDF

R. Jayanthi¹ &
Lilly Florence²

1911 Accesses
70 Citations
Explore all metrics

Abstract

Software industries strive for software quality improvement by consistent bug prediction, bug removal and prediction of fault-prone module. This area has attracted researchers due to its significant involvement in software industries. Various techniques have been presented for software defect prediction. Recent researches have recommended data-mining using machine learning as an important paradigm for software bug prediction. state-of-art software defect prediction task suffer from various issues such as classification accuracy. However, software defect datasets are imbalanced in nature and known fault prone due to its huge dimension. To address this issue, here we present a combined approach for software defect prediction and prediction of software bugs. Proposed approach delivers a concept of feature reduction and artificial intelligence where feature reduction is carried out by well-known principle component analysis (PCA) scheme which is further improved by incorporating maximum-likelihood estimation for error reduction in PCA data reconstruction. Finally, neural network based classification technique is applied which shows prediction results. A framework is formulated and implemented on NASA software dataset where four datasets i.e., KC1, PC3, PC4 and JM1 are considered for performance analysis using MATLAB simulation tool. An extensive experimental study is performed where confusion, precision, recall, classification accuracy etc. parameters are computed and compared with existing software defect prediction techniques. Experimental study shows that proposed approach can provide better performance for software defect prediction.

Software Defect Prediction Based on Selected Features Using Neural Network and Decision Tree

Bug prediction based on deep neural network with reptile search optimization to enhance software reliability

Article 20 February 2024

Renu Garg & Anamika Bhargava

Exploration of Machine Learning Techniques for Defect Classification

1 Introduction

When software engineering is an enormous field and widely exploited in software based industrial application. During recent years, human beings are gradually depending on software based applications where the quality of software is considered as the most salient factor for user experiences [1]. During real-time software based application, various consequences may occur such as failure of software functionality and wrong output etc. which may lead to the dissatisfaction from customer side. However, When huge amounts of tasks are processed and carried out through a software which can cause software functional disability. Software failure and software defects occurs generally in real-time applications. Software failure occurs when functionality varies from actual behavior which stimulates inconsistency between actual and required functionality of software which is also known as fault or bug in the software [2]. Increasing demand of user applications induces more complexity in software applications and it is affecting the business of various software companies which leads to the development of poor software application.

Previous studies and experiences in this field can assist in predicting the bugs in software products. Initially, software industries have adopted manual testing operation for software defect detection. The manual software testing requires 27% of human effort in overall development of application [3]. Manual testing methods require more time and human effort and cannot resolve all bugs present in the software. To address this issue, defect prediction models are widely used by industries. These models help in defect prediction, effort estimation, to check reliability of software, risk analysis etc. during development phase [4]. This can also help in risk minimization by using software quality prediction model in early stage of software development lifecycle (SDLC) resulting in user satisfaction and cost reduction and hence reducing the human effort too. Various techniques are proposed to address this issue of software defect prediction. In this context, software metrics play an important role for predicting the bugs in software product by analyzing the relationship between software metrics and quality of output product. Generally, software metrics are categorized as: process metrics, project and product metrics. Process metrics are used for software development and maintaining the software functionality and lifespan. Project metrics provide multiple parameters i.e. total number of developers, skills of developers, work scheduling, organization and size of software whereas product metrics describe various characteristics such as design feature, quality level requirement and performance level requirement for any particular software application.

Software defect identification, locating the defect and detection becomes a tedious task for researchers due to huge software application. Furthermore, defect density is also a challenging task in this field of software defect detection and prediction. In this field of software engineering and defect prediction, machine learning and data mining techniques are considered as most promising techniques. Data mining techniques mainly aims on the classifying software dataset into faulty and non-faulty dataset as bug prediction model. According to this process, input software dataset is provided to classifier where actual class values are known to the user. At this stage, classifier model analyzes all input parameters and formulates a trained model for further processing. Training model is based on the patterns of input software datasets. In next stage, a random test input is given to classifier and compared with the trained model which provides the output in terms of prediction of software bugs. Prior to this scheme, requirement based [5] and design metrics based [6] approaches shown significant performance. But software complexity and prediction accuracy remains a challenging task. Hence, data mining techniques were introduced such as bagging [7]—boosting [8], naïve Bayes [9], support vector machine [10], J48 decision tree [11] etc... These techniques provide better performance for software defect prediction modeling. In order to predict the software defects, input data is acquired from software metrics is parsed to classifier which are known as attributes or features of particular software product. Conventional techniques suffer from the accuracy issue and time requirement due to complex architecture of software. This complexity issue can be addressed by using feature reduction schemes. Generally, these schemes are applied for multidimensional datasets where multiple attributes are present and these attributes may be irrelevant to the learned pattern for classification. The feature selection technique helps to reduce the irrelevant feature from the dataset and improves the computation efficiency and also reduces the complexity. Various techniques are present for feature selection and reduction in data-mining technique such as principle component analysis (PCA), Kernel PCA, Graph-based kernel PCA, linear discriminant analysis and generalized discriminant analysis.

Recently, artificial intelligence also considered a promising technique for prediction purpose. Gnana et al. [12] presented a neural network based study for predicting the wind speed and obtained better performance for uncertain data. Hence neural network can also be implemented for software defect prediction for industrial applications. Furthermore, this technique can be improved by combining feature reduction scheme along with neural Network classification and better performance for software defect prediction can be obtained in terms of classification accuracy. This work mainly concentrates on software defect prediction technique where neural network classification technique is implemented and enhanced PCA is incorporated to obtain significant performance when compared with state-of-art software defect prediction techniques.

Rest of the manuscript is organized as follows: Sect. 2 presents a brief discussion about recent techniques for software defect prediction, proposed approach is discussed in Sects. 3, 4 deals with experimental study and Sect. 5 presents concluding remarks.

2 Literature survey

This section provides a brief study about recent techniques in this field of software defect prediction. as discussed in previous section that data mining based machine learning techniques have attracted researchers for software defect prediction due to its significant nature of performance.

Machine learning is promising technique for prediction, Wang et al. [13] presented software defect prediction model for improving the quality of software application system. Defective software databases consist imbalanced data which generates randomness in pattern characteristics. This issue motivates to develop an efficient and precise classifier for academic and industrial application scenario. Authors discussed that kernel based and ensemble learning can provide a better performance for various machine learning applications. Multiple kernels are capable for efficient learning when high dimensional datasets are considered which helps for better feature representation and can provide better performance while weak classifiers are applied. However, this work considers both multiple kernel and ensemble learning for software defect prediction and developed a new approach multiple kernel ensemble learning. For cost reduction, a weight updating strategy is applied where misclassified instances are considered and weight computation is applied again. Xu et al. [14] studied about software defect prediction techniques and concluded that conventional techniques uses preprocessing and feature selection scheme for reducing irrelevant features but still some important features get discarded resulting in degraded performance of defect prediction technique. To address this issue, maximal information correlation based technique is presented. This technique consists two main stages, first of all maximal information is computed from each candidate’s feature and clustering is applied in later stage.

Recently, Duksan et al. [15] discussed that software defect data are imbalanced in nature and very few instances show attributes which belongs to defective class during prediction stage. This stage causes performance degradation in software industries hence a precise classification scheme is required. To overcome this issue, the whole problem is converted into multi-objective optimization problem where multi-objective learning scheme is applied by considering varied cross-project environment. Mainly, detection probability maximization, false-alarm probability minimization and overall performance maximization are main objective of this work which are obtained by applying multi-objective optimization and naïve Bayes learning scheme.

In machine learning and classification techniques, amount of training and testing data also plays important role and imbalance problem also increases complexity during learning phase. Abdi et al. [16] derived a new approach for imbalance data and train-test ratio requirement for classifier to provide significant performance. This scheme uses combination of nearest-neighbor and instance selection whereas k-nearest neighborhood is applied for learning, Naïve Bayes is applied for global knowledge learning and classification. This helps to reduce the false alarm probability and improves classification accuracy.

Conventional techniques for software defect prediction fails to provide precision results and require more computational time and complexity. Barajas et al. [17] introduced a technique for defect prediction which can help to finish software development in a given time duration. Mainly thins work presents a comparative study between two learning techniques which are known as fuzzy linear regression technique and statistical linear regression technique for software defect prediction. These techniques follow a unique method for uncertainty modeling linear regression considers it as randomness and fuzzy model takes it as fuzziness uncertainty for further analysis.

In this field of software defect prediction, Shan et al. [18] used a well-known machine learning technique i.e., SVM (support vector machine). Furthermore, randomness in attributes is addressed by applying locally linear embedding technique with support vector classifier. According to this approach, SVM constraints are further optimized using ten-fold cross validation method and grid search scheme. Experimental study shows that LLE-SVM gives better performance for defect prediction.

As discussed in previous section that artificial intelligence also considered as promising technique for classification and prediction. Neural Network based software defect detection techniques are also introduced during recent years of software industry growth. Yang et al. [19] introduced software defect prediction using neural network technique where radial basis function neural network is used along with Bayesian method. Performance of radial basis neural network can be improved by improving the weight update structure, this is carried out by applying single Gaussian and two Gaussian structure whereas expectation-minimization scheme is also applied for weight realization. In [20], an improved approach for classification and prediction is presented for software defect prediction application. To carry out this research, artificial bee colony optimization (ABC) approach is combined with Artificial Neural network. ABC scheme is applied during neural network training phase which helps to obtain the optimal weights for neural network computation.

Bautista et al. [21] also presented a study for software defect prediction using neural network based machine learning approach. In this work, Github repositories are considered for defect prediction analysis. with the help of repositories relationship among software codes and their defects are analyzed and to obtain classification and prediction, neural network is implemented. In machine learning based classification and prediction technique, feature section and reduction can improve the performance. By considering this as an important factor, Khoshgoftaar et al. [22] developed a scheme for feature selection for imbalanced software defect dataset. First of all, wrapper-based attribute selection is applied resulting in selection of attribute subsets. In next stage, random under sampling is applied which helps to mitigate the negative effect of imbalanced dataset. This stage follows five steps for data preprocessing where first process applies training on the original or raw data which is in unaltered form, second stage follows training on sampled or fit dataset, in third stage unsampled version is considered for attribute selection, later selected data is only considered for training along with up-sampled version of input data. These studies show better performance and concluded that if a better approach for feature selection can be implemented then software defect prediction scheme can be improved and applied for real time industrial applications.

This brief literature provides a discussion about machine learning technique, optimization technique, feature selection (reduction) technique and artificial intelligence. Studies shows that artificial intelligence can provide precise performance for software defect prediction and it can be improved further by using future reduction scheme resulting in complexity reduction and overall performance enhancement.

3 Proposed approach

Software defect prediction is a crucial task in the field of software engineering. Previous section briefs about machine learning based software defect prediction techniques. However, these techniques aimed at imbalance problem of software bugs but still classification accuracy and overall performance remains a challenging task for researchers. To address this issue, here we present a combined scheme of feature reduction and artificial based neural network technique for software defect prediction. first subsection of article consists, improved PCA approach for dimension reduction and its mathematical modeling whereas second subsection provides details about combined implementation of neural network and proposed PCA approach.

4 Principle component analysis

Principal component analysis (PCA) is a technique to reduce the dimensionality of such datasets, increasing interpretability but at the same time used to minimize the loss of information. Principal component analysis is a mathematical procedure, the aim of PCA is to reduce the dimensionality of the dataset. It is also called an orthogonal linear transformation that transforms the data into a new coordinate system. The foremost thing is PCA is a feature extraction technique rather than a feature selection method. The linear combination of original attributes yields new attributes . The features with the highest variance are applicable to perform the reduction. Some papers like [23] used PCA for improving their experiments’ performance. According to [24], the PCA technique transforms n vector $\{x1, x2,{\ldots }, x n\}$ from the d-dimensional space to n vectors $\{x '1,x '2,{\ldots },x' n\}$ in a new $d '$ dimensional space.

$$\begin{aligned} x 'i\,=\,\sum k\,=\,1d '{} { ak},\,i\hbox {e}k,\,d'\le d, \end{aligned}$$

where ek are eigenvectors which corresponds to $d'$ largest eigen vectors for the scatter matrix S and ak, i are the projections (principal components original data sets) of the original vectors xi on the eigenvectors $\hbox {e}k$.

Principle component analysis is a correlation matrix based technique where matrix is obtained by applying second-order moment computation and provides characterization of any given input random vector. If zero-mean analysis is considered, then matrix follows characteristics of covariance matrix. In computer application field, PCA shows similarity with Karhunen–Loeve transform (KLT) where correlation can be extracted between pixel group or neighbouring pixels. Moreover, PCA helps to mitigate the second-order correlation which is generated by random process. In order to formulate a low-dimension uncorrelated data, eigenvector computation is applied on covariance matrix of input vector. This process linearly transforms high-dimensional data into low dimension. Generally, PCA is performed by applying singular value decomposition on the given input (SVD) data matrix.

A significant PCA model can be constructed by using information optimization techniques where data reconstruction error or variance maximization of projected input data can be considered for optimization process. PCA helps to compute $\mathcal{O}$ orthonormal direction in a given subspace i.e., $\overline{\overline{\mathcal{W}}}_i \in \mathcal{S}^{n}, i=1, 2, 3\ldots \mathcal{O}, \mathcal{O}<n $ these orthonormal directions are computed in the limit of maximum possibility of data variance. Furthermore, any given input vector $\left( {d\in \mathcal{S}^{n}} \right) $ also can be converted in $\mathcal{O}$-dimensional space without losing indispensable information about data. The input data vector d can be projected in the formulated $\mathcal{O}$-dimensional by using $\overline{\overline{\mathcal{W}}} $ where inner products $( {d^{T}\overline{\overline{\mathcal{W}}} _i } )$ are considered and resultant dimensionality reduction is obtained. During this process, PCA computes unit directions which are used for input vector data projection known as principle components i.e., $y=d^{T}\overline{\overline{\mathcal{W}}} _i $ which has largest variance. It can be represented as follows:

$$\begin{aligned} \sigma _{PCA} \left( \mathcal{W} \right) = \sigma \left[ {y^{2}} \right] = d^{T}\mathcal{C} \overline{\overline{\mathcal{W}}} _i =\frac{\mathcal{W}^{T}\mathcal{C}\mathcal{W}}{\left| {\left| \mathcal{W} \right| } \right| ^{2}}, \end{aligned}$$

(1)

where $\overline{\overline{\mathcal{W}}} = \mathcal{W}/{\left| {\left| \mathcal{W} \right| } \right| }$

In next stage, linear least square estimation is applied for input data reconstruction i.e., $\hat{d}$. This can be computed using Eq. (2).

$$\begin{aligned} \hat{d}_t =\sum \limits _{i=1}^\mathcal{O} \mathcal{G}_i \left( t \right) \overline{\overline{\mathcal{W}}} _i \end{aligned}$$

(2)

With the help of reconstructed data, reconstruction error can be computed by taking the difference between original and reconstructed data.

$$\begin{aligned} e=d-\hat{d}_t =\sum \limits _{i= \mathcal{O}+1}^n a_i \overline{\overline{\mathcal{W}}} _i \end{aligned}$$

(3)

Reconstruction error is orthogonal to the reconstructed data. Here our main aim is to reduce this reconstruction error for software metrics during PCA reconstruction which can improve the performance by reducing error in dimension reduced dataset.

To address this issue, we present a new approach for dimension reduction and improve the performance of PCA resulting in overall performance improvement. PCA can also be used as the basis that minimizes the reconstruction error arising when projecting the data onto a k-dimensional subspace.

$$\begin{aligned} \hbox {PCA reconstruction}= \hbox {PC scores} \cdot \hbox {Eigenvectors} \top +\hbox {Mean} \end{aligned}$$

PCA computes eigenvectors of the covariance matrix and sorts them by their eigenvalues i.e., amount of explained variance. The centered data can then be projected onto these principal axes to obtain the principal components or scores. To reduce the dimensionality we can use a subset of principal components and discard the rest.

Let us consider that input data d is in a matrix form as $\left( {n\times p} \right) $ and is modeled in the Gaussian form and covariance matrix computation need to be performed. In software defect prediction model, input data is converted such that sample size n is smaller than p and stored in a vector form. Using this assumption, maximum likelihood can be computed as follows:

$$\begin{aligned} {\Sigma }_{ML} =\frac{1}{n}{d}'\left[ \mathcal {J}_p-\frac{1}{n}vv'\right] d. \end{aligned}$$

(4)

In high dimensional cases, this matrix becomes non-positive definite, ill-conditioned or unstructured and evenly singular which causes performance degradation. This issue can be addressed further by using Gaussian maximum likelihood based principle component analysis. according to proposed approach, maximum likelihood based model is used for mapping underlying space into data space as:

$$\begin{aligned} x=\varepsilon +\mu +f{\Lambda }, \end{aligned}$$

(5)

where x denotes high-dimensional variable denoted as $\left( {p\times 1} \right) $, ${\Lambda }$ is linear transformation expressed as $\left( {p\times \mathcal{O}} \right) $, ${\Lambda }\rightarrow x, f\, \mathrm{{is}} \left( {\mathcal{O}\times 1} \right) $, $\mu $ is denoted by mean vector of $\left( {p\times 1} \right) $ and $\varepsilon $ denotes Gaussian random error or noise for input signal vector.

To obtain the efficient solution for given input vector, probability distribution $p\left( {x|f} \right) $ is formulated by using probability model of random error $ \left( \varepsilon \right) $ which can be expressed as:

$$\begin{aligned} p\left( {\varepsilon ;\sigma ^{2}} \right) =\left( {2\pi \sigma ^{2}} \right) ^{-\frac{p}{2}}\exp \left( {-\frac{1}{2}{\varepsilon }'\varepsilon }\right) . \end{aligned}$$

(6)

It is assumed that probability density unit is spherical. According to Eq. (5), $\varepsilon =x-\mu -f{\Lambda }$ and conditional probability can be obtained by computing $p\left( \varepsilon \right) $, this can be expressed as:

$$\begin{aligned}&p\left( {x{|}f;{\Lambda },\mu ,\sigma ^{2}} \right) \nonumber \\&\quad =\left( {2\pi \sigma ^{2}} \right) ^{-\frac{p}{2}}\exp \left( {-\,\frac{1}{2}\left| {\left| {x-\mu -f{\Lambda }} \right| } \right| ^{2}} \right) \end{aligned}$$

(7)

This technique provides complete distribution of the dataset in a given subspace without causing any error or any of the data is not distributed from the defined probabilistic limit.

Main aim of proposed PCA approach is to compute unknown parameters such as ${\Lambda }, \mu $ and noise variance $\sigma ^{2}$ by using maximum likelihood observations. In order to obtain this, likelihood and likelihood function need to be computed which can be expressed as:

$$\begin{aligned}&L\left( {{\Lambda }, \mu ,\sigma ^{2}|x} \right) \nonumber \\&\quad = \prod \limits _{i=1}^n p\left( {x_i ;\Lambda ,\mu ,\sigma ^{2}} \right) \nonumber \\&\quad =\left( {2\pi } \right) ^{-\frac{np}{2}}\left| {\Sigma } \right| ^{-\frac{n}{2}}\exp \left[ {-\,\frac{1}{2}\sum \limits _{i=1}^n \left( {x_i -\mu } \right) ^{{'}}\Sigma ^{-1}\left( {x_i -\mu } \right) } \right] \nonumber \\ \end{aligned}$$

(8)

$\left( {x_i -\mu } \right) ^{{'}}\varSigma ^{-1}\left( {x_i -\mu } \right) $ this also can be expressed as $tr\left( {\Sigma }^{-1}S \right) ] $ where $S= \frac{1}{n}\sum \nolimits _{i=1}^n \left( {x_i -\hat{\mu } } \right) \left( {x_i -\hat{\mu }} \right) '$

S is known as sample covariance matrix of observed software metrics data and $\hat{\mu }$ denotes maximum likelihood estimation of considered mean vector $\mu $ which is computed as $\hat{\mu } =\frac{1}{n}\sum \nolimits _{i=1}^n x_i = \bar{x}$.

Hence, log-likelihood function can be expressed as:

$$\begin{aligned} L\left( {{\Lambda }, \mu ,\sigma ^{2}|x} \right)= & {} -\,\frac{np}{2}\log \left( {2\pi } \right) -\,\frac{n}{2}\log \left| {\Sigma } \right| \nonumber \\&-\frac{n}{2}tr[{{\Sigma }^{-1}S}] \end{aligned}$$

(9)

Maximization of ${\Lambda }$ and $\sigma ^{2}$ can provide a better closed form solution for error reduction in PCA data reconstruction. Maximum likelihood of $ {\Lambda }$ and $\sigma ^{2}$ can be obtained as:

$$\begin{aligned} \hat{\Lambda }_{ML}= & {} \left( {p\times m} \right) \left( {L_m -\sigma ^{2}I_m } \right) ^{\frac{1}{2}} R\nonumber \\ \hat{\sigma }_{ML} ^{2}= & {} \sum \limits _{j=m+1}^p \lambda _j \times \frac{1}{p-m} \end{aligned}$$

(10)

Above mentioned technique improves the performance of PCA feature reduction by reducing the error during feature reconstruction in a given space by using maximum likelihood model and maximizing the effecting parameters.

Next stage considers implementation of neural network model for software defect prediction. Neural network implementation is discussed in next section.

4.1 Advantages of proposed approach

Feature selection technique identifies and extracts the most useful features of the dataset for the purpose of learning, and these features are very valuable and helpful for analysis and future prediction. The redundancy in the data is removed and the learning algorithm performance can be improved. Training data plays vital role in classifying and data prediction. If the data fail to reveal the statistical consistency that machine learning algorithms exploit, then learning will fail, so, it is important to remove the redundant data from the training set and it is made easy by using feature selection technique.

4.2 Neural network implementation for software defect prediction

This section presents an implementation study about neural network for software defect prediction model. neural network technique is based on the working of human brain [25]. It contains multiple units for information processing which are known as neurons. The complete network formulation contains three main computational layers which are known as input layer, hidden layer and output layer. According to the objective of this work, input software data metrics are parsed to the neural network where software data is processed by each layer where layers and neurons are connected based on their weights which shows importance of each neuron. During learning process, weigh of each neuron is considered and adjusted according to the requirement. At this stage, each neuron gives input to each preceding layer and later these inputs are multiplied by its weight to obtain the final weight. However, the multiplication output is added together. According to this process, neuron computes activation level from this sum and output is send to the following layer where final solution is estimated. A simple working process of neural network with all elements is depicted in Fig. 1.

An activation function can be a step, sign, sigmoid or linear function. Choosing the activation function is defined according to the expected task of the network, i.e., classification or regression. The output of a neuron which is in i th layer can be described by Eq. (11).

$$\begin{aligned} y_i = f_i \left( {\sum \limits _{j=1}^n W_{ij} x_j + \theta _i } \right) \end{aligned}$$

(11)

Where $y_i $ is the output of a neuron, n is the total number of inputs to this neuron, $x_j $ is the jth input, $W_{ij}$ is the weight between the current neuron and jth input, and $\theta _i $ is the bias of the neuron. $f_i$ represents the activation function of this layer. Generally, the activation function is a nonlinear function such as sigmoid, Gaussian and so on. This enables ANN to model nonlinear relationships.

Associations between software quality metrics and module defect proneness are often complex and nonlinear, so ANN is an appropriate choice for software defect prediction problem. The optimization goal of the network is to minimize the error function by optimizing the network weights (all $W_{ij}$). The network error at each iteration is calculated by using different methods such as the root mean squared error, mean absolute error, relative absolute error, and root relative squared error. This error is propagated backward in the network and weights are adjusted to minimize the error. The iteration continues until a stopping criterion is met. Stopping criteria can be either a maximum iteration number or minimum error value. The neural network created for the PC1 dataset is shown in Fig. 2.

5 Experimental study

This section provides an experimental study for software defect prediction using proposed feature reduction and classification. Complete study is implemented using MATLAB 2013 simulation tool and PROMISE [26] open source datasets are considered for experimental analysis.

In this work, we have considered four datasets from PROMISE repository which are named as KC1, JM1, PC3 and PC4 where various attributes are present in the given dataset. Table 1 shows various parameters about considered dataset where total number of attributes, available modules, defective modules and percentage defect are depicted.

Table 1 PROMISE software defect prediction dataset details

Full size table

These software dataset contains some general attributes which are presented in Table 2 where name of attribute and its details are given.

Table 2 PROMISE software defect prediction attribute details

Full size table

Using these datasets, we apply software defect prediction system where performance of proposed model is compared with other state-of-art techniques. Generally, for data-mining applications various measurement metrics are present such as true positive, rate, false positive rate, accuracy, confusion matrix, precision and recall etc. In these application, confusion matrix is known as most significant and important parameter for performance analysis. this matrix contains the value of actual and predicted class and based on these values classification results can be performed. A general model of confusion matrix is presented in Table 3.

This confusion matrix helps us to compute total accuracy, precision, specificity, sensitivity and F-measure of the proposed approach.

Accuracy is a measurement rate of correct classification. It is computed by taking the ratio of correct prediction and total number of prediction. It can be expressed as:

$$\begin{aligned} Acc= \left( {TP+TN} \right) /\left( {TP+TN+FP+FN} \right) \end{aligned}$$

(12)

Another parameter is known to perform sensitivity analysis on the model. This is the measurement of true positive rate which can be computed by identifying the correctly classified non- defective modules. This can be expressed as

$$\begin{aligned} Sensitivity= TP/\left( {TP+FN} \right) \end{aligned}$$

Table 3 Confusion matrix

Full size table

Next parameter is specificity and it is to compute the true negative rate which shows the measurement of correctly classified defective software modules and can be expressed as:

$$\begin{aligned} Specificity= TN/\left( {TN+FP} \right) \end{aligned}$$

Then, we compute precision of the proposed approach. It is computed by taking the ratio of True Positive and (True and False) positives.

$$\begin{aligned} P= TP/TP+FP \end{aligned}$$

Finally, F-measure is computed which is the mean of precision and sensitivity performance. It is expressed as:

$$\begin{aligned} F= 2\times P\times Sensitivity/P+Sensitivity \end{aligned}$$

In next stage, performance analysis is applied for considered software defect dataset. The complete experimental study is divided into four experimental scenarios according to the considered datasets.

5.1 Test case 1: KC1 dataset

In this case, PROMISE dataset “KC1” is considered for experimental analysis 0.80.9865. this dataset contains 2096 instances, 325 number of defects where 15.5% part is defective. Here we apply proposed hybrid approach for software defect prediction on KC1 PROMISE dataset. First of all, we compute confusion matrix as given in Table 4.

Table 4 Confusion matrix for KC1 dataset using proposed hybrid classifier

Full size table

Similarly, we compute other statistical performance analysis parameters such as precision, sensitivity, specificity, recall, F-score and accuracy. This analysis is presented in Table 5.

Table 5 Statistical performance analysis for KC1 dataset using proposed hybrid classifier

Full size table

This shows that proposed approach gives accuracy of 86.91% for KC1 dataset.

In next stage, here we compute ROC curve analysis and precision–recall curve for test case 1. These two analysis are depicted in Figs. 3 and 4.

To validate the performance of proposed approach for KC1 dataset, a comparative study is also presented which shows a significant performance for software bug prediction 81.08 AUC.

5.2 Test case 2: JM1 dataset

In this case, JM1 dataset is considered which contains 9535 modules with 18.35% defective module. Similar experiments are performed on this dataset also as provided in test case 1. Confusion matrix for this test case is given in Tables 6 and 7 shows other statistical analysis parameters (Fig. 5).

Table 6 Confusion matrix for JM1 dataset using proposed hybrid classifier

Full size table

Table 7 Statistical performance analysis for JM1 dataset using proposed hybrid classifier

Full size table

Similarly, we compute precision–recall curve analysis for JM1 dataset. P–R curve performance is depicted in Fig. 6.

5.3 Test case 3: PC3 dataset

In this case, we have considered PC1 dataset where 1125 modules are present with 10.23% defect. Confusion matrix for this test case is given in Tables 8 and 9 shows other statistical analysis parameters.

ROC curve analysis and P–R curve analysis is also performed here. Figures 7 and 8 shows ROC and P–R curve performance for PC3 dataset respectively. In this study, AUC, area under curve is obtained as 0.8918%.

Table 8 Confusion matrix for JM1 dataset using proposed hybrid classifier

Full size table

Table 9 Statistical performance analysis for JM1 dataset using proposed hybrid classifier

Full size table

5.4 Test case 4: PC4 dataset

Similar study is performed for PC4 dataset where 1399 module are present with 1.72% defect in the complete dataset. For this dataset, confusion matrix is given in Table 10 whereas statistical performance is presented in Table 11.

Table 10 Confusion matrix for PC4 dataset using proposed hybrid classifier

Full size table

Table 11 Statistical performance analysis for PC4 dataset using proposed hybrid classifier

Full size table

ROC curve analysis and P–R curves are given in Figs. 8 and 9 respectively. This study shows that proposed approach obtains AUC (area under curve) as 97.20% which is a significant improvement when compared with other state of art models. Similarly, it shows a better performance in terms of classification accuracy (Fig. 10).

Figure 9 shows ROC curve analysis for PC4 dataset and precision–recall graph is presented in Fig. 11.

The proposed methods improves accuracy in defect prediction using a fewer attributes as compared to the previous studies. We used attribute selection to show that, even when attributes were decreased by around 80% of total attributes the classifiers gave equal and even greater accuracy than total number used. Performance testing is done using full training set on four different datasets KC1, JM1, PC3 and PC4. KC1 dataset contains 2096 instances, 325 no. of defects where 15.5% part is defective and we applied Proposed hybrid approach for software defect Prediction and the result shows that the proposed approach gives 86.91% accuracy for KC1 dataset, 83.03% accuracy for JM1 dataset, 89% for PC3 and 93.64% for PC4.

Table 12 Comparative analysis in term of AUC

Full size table

Finally, we present a comparative study for software defect prediction. Table 12 shows comparative analysis with various state-of-art techniques in terms of area under curve (AUC).

Above mentioned table shows state-of-art technique such as k-NN, SVM, Naïve Bayes and LDA etc. Comparative study shows that proposed approach gives better performance in terms AUC (Fig. 12).

6 Conclusion and future work

This work mainly aims on the software defect prediction technique using data-mining techniques. However, this area has become an interesting field of researchers where various techniques have been discussed for further improving the performance of software defect detection or bug prediction. in this work, we have addressed the issue of classification accuracy for huge dataset by developing a new combined approach using feature reduction and classification. In order to obtain feature reduction model, PCA is applied where maximum likelihood is also combined to reduce the PCA reconstructed data. Furthermore, neural network classification technique is implemented for software bug prediction. Experimental study shows that proposed approach provides better performance and obtains AUC as 97.20% which is a significant improvement when compared with other state of art models. Similarly, it shows a better performance in terms of classification accuracy. To statistically prove the validity of the impact of feature selection The hypothesis are formed as follows:

H1 There is no such difference in the accuracy of the classifiers when there is no feature selection technique and when the feature selection techniques are used.

H2 There exists a difference in the accuracy of the classifiers when there is no feature selection and when the feature selection techniques are used.

Thus it is concluded that by using feature selection techniques the time and space complexity for defect prediction is reduced without effecting the prediction accuracy.

1.
It reduces the time and the amount of storage space required.
2.
Removal of multi-collinearity improves the performance of the machine learning model.
3.
It becomes easier to visualize the data when we reduce the data to low dimensions such as 2D or 3D.

Future scope of project lies in performing double pre-processing of dataset and t-tests by applying instance filtering along with attribute selection.

References

Tian, J.: Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement. Wiley, Hoboken (2005)
Book Google Scholar
Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. 42(3), 10 (2010). https://doi.org/10.1145/1670679.1670680
Article Google Scholar
http://www.softwaretestingtimes.com/2010/04/softwaretestingeffort-estimation.htm
Chauhan, N.S., Saxena, A.: A green software development life cycle for cloud computing. IT Prof. 15(1), 28–34 (2013)
Article Google Scholar
Sandhu, P.S., Brar, A.S., Goel, R., Kaur, J., Anand, S.: A model for early prediction of faults in software systems. In: 2nd International Conference on Computer and Automation Engineering, Singapore, pp. 281–285 (2010)
Emam, K.E., Melo, W., Machado, J.C.: The prediction of faulty classes using object-oriented design metrics. J. Syst. Softw. 56, 63–75 (2001)
Article Google Scholar
Kuncheva, L.I., Skurichina, M., Duin, R.P.W.: An experimental study on diversity for bagging and boosting with linear classifiers. Inf. Fus. 3(4), 245–258 (2002)
Article Google Scholar
Aljamaan, H.I., Elish, M.O.: An empirical study of bagging and boosting ensembles for identifying faulty classes in object-oriented software. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM ’09), pp. 187–194, IEEE, Nashville (2009)
Okutan, A., Yıldız, O.T.: Software defect prediction using Bayesian networks. Empir. Softw. Eng. 19(1), 154–181 (2014)
Article Google Scholar
Shan, C., Chen, B., Hu, C., Xue, J., Li, N.: Software defect prediction model based on LLE and SVM. In: Proceedings of the Communications Security Conference (CSC ’14), pp. 1–5 (2014)
Koru, A.G., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)
Article Google Scholar
Sheela, K.G., Deepa, S.N.: Neural network based hybrid computing model for wind speed prediction. Neurocomputing 122, 425–429 (2013)
Article Google Scholar
Wang, T., Zhang, Z., Jing, X., Zhang, L.: Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 23, 569–590 (2015)
Article Google Scholar
Xu, Z., Xuan, J., Liu, J., Cui, X.: MICHAC: defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Suita, pp. 370–381 (2016)
Ryu, D., Baik, J.: Effective multi-objective naïve Bayes learning for cross-project defect prediction. Appl. Soft Comput. 49, 1062 (2016). https://doi.org/10.1016/j.asoc.2016.04.009
Article Google Scholar
Abdi, Y., Parsa, S., Seyfari, Y.: A hybrid one-class rule learning approach based on swarm intelligence for software fault prediction. Innov. Syst. Softw. Eng. 11(4), 289–301 (2015). https://doi.org/10.1007/s11334-015-0258-2
Article Google Scholar
Valles-Barajas, F.: A comparative analysis between two techniques for the prediction of software defects: fuzzy and statistical linear regression. Innov. Syst. Softw. Eng. 11(4), 277–287 (2015). https://doi.org/10.1007/s11334-015-0256-4
Article Google Scholar
Shan C., Chen B., Hu C., Xue J., Li N.: Software defect prediction model based on LLE and SVM. In: Proceedings of the Communications Security Conference (CSC ’14), pp. 1–5 (2014)
Yang, Z.R.: A novel radial basis function neural network for discriminant analysis. IEEE Trans. Neural Netw. 17(3), 604–612 (2006). https://doi.org/10.1109/TNN.2006.873282
Article Google Scholar
Arar, Ö.F., Ayan, K.: Software defect prediction using cost-sensitive neural network. Appl. Soft Comput. J. 33, 263–277 (2015)
Article Google Scholar
Bautista, A.M., Feliu, T.S.: Defect prediction in software repositories with artificial neural networks. In: Mejia, J., Munoz, M., Rocha, Á., Calvo-Manzano, J. (eds.) Trends and Applications in Software Engineering. Advances in Intelligent Systems and Computing, vol. 405. Springer, Cham (2016)
Google Scholar
Khoshgoftaar, T.M., Gao, K.: Feature selection with imbalanced data for software defect prediction. In: 2009 International Conference on Machine Learning and Applications, Miami Beach, pp. 235–240 (2009)
Khoshgoftaar, T.M., Seliya, N., Sundaresh, N.: An empirical study of predicting software faults with case-based reasoning. Softw. Qual. J. 14(2), 85–111 (2006)
Article Google Scholar
Malhi, A.: PCA-based feature selection scheme for machine defect classification. IEEE Trans. Instrum. Meas. 53(6), 1517–1525 (2004)
Article Google Scholar
Clark, C.C.T., et al.: A review of emerging analytical techniques for objective physical activity measurement in humans. Sports Med. 47, 439–447 (2016)
Article Google Scholar
Software Defect Dataset: Promise repository, http://promise.site.uottawa.ca/SERepository/datasets-page.html
Andersson, C.: A replicated empirical study of a selection method for software reliability growth models. Empir. Softw. Eng. 12(2), 161–182 (2007)
Article Google Scholar
Andersson, C., Runeson, P.: A replicated quantitative analysis of fault distributions in complex software systems. IEEE Trans. Softw. Eng. 33(5), 273–286 (2007)
Article Google Scholar
Mangasarian, O.L., Musicant, D.R.: Lagrangian support vector machines. J. Mach. Learn. Res. 1, 161–177 (2001)
MathSciNet MATH Google Scholar
Suykens, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)
Article Google Scholar
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Article Google Scholar

Download references

Acknowledgements

I would like to express my gratitude to Dr. Lilly Florence Prof. & HOD, MCA programme, Adhiyamaan College of Engineering, Hosur for providing me the guidance and support to make this work possible.

Author information

Authors and Affiliations

MCA Department, PESIT-BSC, Bangalore, 560100, India
R. Jayanthi
MCA Department, Adhiyamaan College of Engineering, Hosur, India
Lilly Florence

Authors

R. Jayanthi
View author publications
You can also search for this author in PubMed Google Scholar
Lilly Florence
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Jayanthi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jayanthi, R., Florence, L. Software defect prediction techniques using metrics based on neural network classifier. Cluster Comput 22 (Suppl 1), 77–88 (2019). https://doi.org/10.1007/s10586-018-1730-1

Download citation

Received: 14 November 2017
Revised: 22 December 2017
Accepted: 05 January 2018
Published: 07 February 2018
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-018-1730-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Software defect prediction techniques using metrics based on neural network classifier

Abstract

Similar content being viewed by others

Software Defect Prediction Based on Selected Features Using Neural Network and Decision Tree

Bug prediction based on deep neural network with reptile search optimization to enhance software reliability

Exploration of Machine Learning Techniques for Defect Classification

1 Introduction

2 Literature survey

3 Proposed approach