1 Introduction

When software engineering is an enormous field and widely exploited in software based industrial application. During recent years, human beings are gradually depending on software based applications where the quality of software is considered as the most salient factor for user experiences [1]. During real-time software based application, various consequences may occur such as failure of software functionality and wrong output etc. which may lead to the dissatisfaction from customer side. However, When huge amounts of tasks are processed and carried out through a software which can cause software functional disability. Software failure and software defects occurs generally in real-time applications. Software failure occurs when functionality varies from actual behavior which stimulates inconsistency between actual and required functionality of software which is also known as fault or bug in the software [2]. Increasing demand of user applications induces more complexity in software applications and it is affecting the business of various software companies which leads to the development of poor software application.

Previous studies and experiences in this field can assist in predicting the bugs in software products. Initially, software industries have adopted manual testing operation for software defect detection. The manual software testing requires 27% of human effort in overall development of application [3]. Manual testing methods require more time and human effort and cannot resolve all bugs present in the software. To address this issue, defect prediction models are widely used by industries. These models help in defect prediction, effort estimation, to check reliability of software, risk analysis etc. during development phase [4]. This can also help in risk minimization by using software quality prediction model in early stage of software development lifecycle (SDLC) resulting in user satisfaction and cost reduction and hence reducing the human effort too. Various techniques are proposed to address this issue of software defect prediction. In this context, software metrics play an important role for predicting the bugs in software product by analyzing the relationship between software metrics and quality of output product. Generally, software metrics are categorized as: process metrics, project and product metrics. Process metrics are used for software development and maintaining the software functionality and lifespan. Project metrics provide multiple parameters i.e. total number of developers, skills of developers, work scheduling, organization and size of software whereas product metrics describe various characteristics such as design feature, quality level requirement and performance level requirement for any particular software application.

Software defect identification, locating the defect and detection becomes a tedious task for researchers due to huge software application. Furthermore, defect density is also a challenging task in this field of software defect detection and prediction. In this field of software engineering and defect prediction, machine learning and data mining techniques are considered as most promising techniques. Data mining techniques mainly aims on the classifying software dataset into faulty and non-faulty dataset as bug prediction model. According to this process, input software dataset is provided to classifier where actual class values are known to the user. At this stage, classifier model analyzes all input parameters and formulates a trained model for further processing. Training model is based on the patterns of input software datasets. In next stage, a random test input is given to classifier and compared with the trained model which provides the output in terms of prediction of software bugs. Prior to this scheme, requirement based [5] and design metrics based [6] approaches shown significant performance. But software complexity and prediction accuracy remains a challenging task. Hence, data mining techniques were introduced such as bagging [7]—boosting [8], naïve Bayes [9], support vector machine [10], J48 decision tree [11] etc... These techniques provide better performance for software defect prediction modeling. In order to predict the software defects, input data is acquired from software metrics is parsed to classifier which are known as attributes or features of particular software product. Conventional techniques suffer from the accuracy issue and time requirement due to complex architecture of software. This complexity issue can be addressed by using feature reduction schemes. Generally, these schemes are applied for multidimensional datasets where multiple attributes are present and these attributes may be irrelevant to the learned pattern for classification. The feature selection technique helps to reduce the irrelevant feature from the dataset and improves the computation efficiency and also reduces the complexity. Various techniques are present for feature selection and reduction in data-mining technique such as principle component analysis (PCA), Kernel PCA, Graph-based kernel PCA, linear discriminant analysis and generalized discriminant analysis.

Recently, artificial intelligence also considered a promising technique for prediction purpose. Gnana et al. [12] presented a neural network based study for predicting the wind speed and obtained better performance for uncertain data. Hence neural network can also be implemented for software defect prediction for industrial applications. Furthermore, this technique can be improved by combining feature reduction scheme along with neural Network classification and better performance for software defect prediction can be obtained in terms of classification accuracy. This work mainly concentrates on software defect prediction technique where neural network classification technique is implemented and enhanced PCA is incorporated to obtain significant performance when compared with state-of-art software defect prediction techniques.

Rest of the manuscript is organized as follows: Sect. 2 presents a brief discussion about recent techniques for software defect prediction, proposed approach is discussed in Sects. 3, 4 deals with experimental study and Sect. 5 presents concluding remarks.

2 Literature survey

This section provides a brief study about recent techniques in this field of software defect prediction. as discussed in previous section that data mining based machine learning techniques have attracted researchers for software defect prediction due to its significant nature of performance.

Machine learning is promising technique for prediction, Wang et al. [13] presented software defect prediction model for improving the quality of software application system. Defective software databases consist imbalanced data which generates randomness in pattern characteristics. This issue motivates to develop an efficient and precise classifier for academic and industrial application scenario. Authors discussed that kernel based and ensemble learning can provide a better performance for various machine learning applications. Multiple kernels are capable for efficient learning when high dimensional datasets are considered which helps for better feature representation and can provide better performance while weak classifiers are applied. However, this work considers both multiple kernel and ensemble learning for software defect prediction and developed a new approach multiple kernel ensemble learning. For cost reduction, a weight updating strategy is applied where misclassified instances are considered and weight computation is applied again. Xu et al. [14] studied about software defect prediction techniques and concluded that conventional techniques uses preprocessing and feature selection scheme for reducing irrelevant features but still some important features get discarded resulting in degraded performance of defect prediction technique. To address this issue, maximal information correlation based technique is presented. This technique consists two main stages, first of all maximal information is computed from each candidate’s feature and clustering is applied in later stage.

Recently, Duksan et al. [15] discussed that software defect data are imbalanced in nature and very few instances show attributes which belongs to defective class during prediction stage. This stage causes performance degradation in software industries hence a precise classification scheme is required. To overcome this issue, the whole problem is converted into multi-objective optimization problem where multi-objective learning scheme is applied by considering varied cross-project environment. Mainly, detection probability maximization, false-alarm probability minimization and overall performance maximization are main objective of this work which are obtained by applying multi-objective optimization and naïve Bayes learning scheme.

In machine learning and classification techniques, amount of training and testing data also plays important role and imbalance problem also increases complexity during learning phase. Abdi et al. [16] derived a new approach for imbalance data and train-test ratio requirement for classifier to provide significant performance. This scheme uses combination of nearest-neighbor and instance selection whereas k-nearest neighborhood is applied for learning, Naïve Bayes is applied for global knowledge learning and classification. This helps to reduce the false alarm probability and improves classification accuracy.

Conventional techniques for software defect prediction fails to provide precision results and require more computational time and complexity. Barajas et al. [17] introduced a technique for defect prediction which can help to finish software development in a given time duration. Mainly thins work presents a comparative study between two learning techniques which are known as fuzzy linear regression technique and statistical linear regression technique for software defect prediction. These techniques follow a unique method for uncertainty modeling linear regression considers it as randomness and fuzzy model takes it as fuzziness uncertainty for further analysis.

In this field of software defect prediction, Shan et al. [18] used a well-known machine learning technique i.e., SVM (support vector machine). Furthermore, randomness in attributes is addressed by applying locally linear embedding technique with support vector classifier. According to this approach, SVM constraints are further optimized using ten-fold cross validation method and grid search scheme. Experimental study shows that LLE-SVM gives better performance for defect prediction.

As discussed in previous section that artificial intelligence also considered as promising technique for classification and prediction. Neural Network based software defect detection techniques are also introduced during recent years of software industry growth. Yang et al. [19] introduced software defect prediction using neural network technique where radial basis function neural network is used along with Bayesian method. Performance of radial basis neural network can be improved by improving the weight update structure, this is carried out by applying single Gaussian and two Gaussian structure whereas expectation-minimization scheme is also applied for weight realization. In [20], an improved approach for classification and prediction is presented for software defect prediction application. To carry out this research, artificial bee colony optimization (ABC) approach is combined with Artificial Neural network. ABC scheme is applied during neural network training phase which helps to obtain the optimal weights for neural network computation.

Bautista et al. [21] also presented a study for software defect prediction using neural network based machine learning approach. In this work, Github repositories are considered for defect prediction analysis. with the help of repositories relationship among software codes and their defects are analyzed and to obtain classification and prediction, neural network is implemented. In machine learning based classification and prediction technique, feature section and reduction can improve the performance. By considering this as an important factor, Khoshgoftaar et al. [22] developed a scheme for feature selection for imbalanced software defect dataset. First of all, wrapper-based attribute selection is applied resulting in selection of attribute subsets. In next stage, random under sampling is applied which helps to mitigate the negative effect of imbalanced dataset. This stage follows five steps for data preprocessing where first process applies training on the original or raw data which is in unaltered form, second stage follows training on sampled or fit dataset, in third stage unsampled version is considered for attribute selection, later selected data is only considered for training along with up-sampled version of input data. These studies show better performance and concluded that if a better approach for feature selection can be implemented then software defect prediction scheme can be improved and applied for real time industrial applications.

This brief literature provides a discussion about machine learning technique, optimization technique, feature selection (reduction) technique and artificial intelligence. Studies shows that artificial intelligence can provide precise performance for software defect prediction and it can be improved further by using future reduction scheme resulting in complexity reduction and overall performance enhancement.

3 Proposed approach

Software defect prediction is a crucial task in the field of software engineering. Previous section briefs about machine learning based software defect prediction techniques. However, these techniques aimed at imbalance problem of software bugs but still classification accuracy and overall performance remains a challenging task for researchers. To address this issue, here we present a combined scheme of feature reduction and artificial based neural network technique for software defect prediction. first subsection of article consists, improved PCA approach for dimension reduction and its mathematical modeling whereas second subsection provides details about combined implementation of neural network and proposed PCA approach.

4 Principle component analysis

Principal component analysis (PCA) is a technique to reduce the dimensionality of such datasets, increasing interpretability but at the same time used to minimize the loss of information. Principal component analysis is a mathematical procedure, the aim of PCA is to reduce the dimensionality of the dataset. It is also called an orthogonal linear transformation that transforms the data into a new coordinate system. The foremost thing is PCA is a feature extraction technique rather than a feature selection method. The linear combination of original attributes yields new attributes . The features with the highest variance are applicable to perform the reduction. Some papers like [23] used PCA for improving their experiments’ performance. According to [24], the PCA technique transforms n vector \(\{x1, x2,{\ldots }, x n\}\) from the d-dimensional space to n vectors \(\{x '1,x '2,{\ldots },x' n\}\) in a new \(d '\) dimensional space.

$$\begin{aligned} x 'i\,=\,\sum k\,=\,1d '{} { ak},\,i\hbox {e}k,\,d'\le d, \end{aligned}$$

where ek are eigenvectors which corresponds to \(d'\) largest eigen vectors for the scatter matrix S and ak, i are the projections (principal components original data sets) of the original vectors xi on the eigenvectors \(\hbox {e}k\).

Principle component analysis is a correlation matrix based technique where matrix is obtained by applying second-order moment computation and provides characterization of any given input random vector. If zero-mean analysis is considered, then matrix follows characteristics of covariance matrix. In computer application field, PCA shows similarity with Karhunen–Loeve transform (KLT) where correlation can be extracted between pixel group or neighbouring pixels. Moreover, PCA helps to mitigate the second-order correlation which is generated by random process. In order to formulate a low-dimension uncorrelated data, eigenvector computation is applied on covariance matrix of input vector. This process linearly transforms high-dimensional data into low dimension. Generally, PCA is performed by applying singular value decomposition on the given input (SVD) data matrix.

A significant PCA model can be constructed by using information optimization techniques where data reconstruction error or variance maximization of projected input data can be considered for optimization process. PCA helps to compute \(\mathcal{O}\) orthonormal direction in a given subspace i.e., \(\overline{\overline{\mathcal{W}}}_i \in \mathcal{S}^{n}, i=1, 2, 3\ldots \mathcal{O}, \mathcal{O}<n \) these orthonormal directions are computed in the limit of maximum possibility of data variance. Furthermore, any given input vector \(\left( {d\in \mathcal{S}^{n}} \right) \) also can be converted in \(\mathcal{O}\)-dimensional space without losing indispensable information about data. The input data vector d can be projected in the formulated \(\mathcal{O}\)-dimensional by using \(\overline{\overline{\mathcal{W}}} \) where inner products \(( {d^{T}\overline{\overline{\mathcal{W}}} _i } )\) are considered and resultant dimensionality reduction is obtained. During this process, PCA computes unit directions which are used for input vector data projection known as principle components i.e., \(y=d^{T}\overline{\overline{\mathcal{W}}} _i \) which has largest variance. It can be represented as follows:

$$\begin{aligned} \sigma _{PCA} \left( \mathcal{W} \right) = \sigma \left[ {y^{2}} \right] = d^{T}\mathcal{C} \overline{\overline{\mathcal{W}}} _i =\frac{\mathcal{W}^{T}\mathcal{C}\mathcal{W}}{\left| {\left| \mathcal{W} \right| } \right| ^{2}}, \end{aligned}$$
(1)

where \(\overline{\overline{\mathcal{W}}} = \mathcal{W}/{\left| {\left| \mathcal{W} \right| } \right| }\)

In next stage, linear least square estimation is applied for input data reconstruction i.e., \(\hat{d}\). This can be computed using Eq. (2).

$$\begin{aligned} \hat{d}_t =\sum \limits _{i=1}^\mathcal{O} \mathcal{G}_i \left( t \right) \overline{\overline{\mathcal{W}}} _i \end{aligned}$$
(2)

With the help of reconstructed data, reconstruction error can be computed by taking the difference between original and reconstructed data.

$$\begin{aligned} e=d-\hat{d}_t =\sum \limits _{i= \mathcal{O}+1}^n a_i \overline{\overline{\mathcal{W}}} _i \end{aligned}$$
(3)

Reconstruction error is orthogonal to the reconstructed data. Here our main aim is to reduce this reconstruction error for software metrics during PCA reconstruction which can improve the performance by reducing error in dimension reduced dataset.

To address this issue, we present a new approach for dimension reduction and improve the performance of PCA resulting in overall performance improvement. PCA can also be used as the basis that minimizes the reconstruction error arising when projecting the data onto a k-dimensional subspace.

$$\begin{aligned} \hbox {PCA reconstruction}= \hbox {PC scores} \cdot \hbox {Eigenvectors} \top +\hbox {Mean} \end{aligned}$$

PCA computes eigenvectors of the covariance matrix and sorts them by their eigenvalues i.e., amount of explained variance. The centered data can then be projected onto these principal axes to obtain the principal components or scores. To reduce the dimensionality we can use a subset of principal components and discard the rest.

Let us consider that input data d is in a matrix form as \(\left( {n\times p} \right) \) and is modeled in the Gaussian form and covariance matrix computation need to be performed. In software defect prediction model, input data is converted such that sample size n is smaller than p and stored in a vector form. Using this assumption, maximum likelihood can be computed as follows:

$$\begin{aligned} {\Sigma }_{ML} =\frac{1}{n}{d}'\left[ \mathcal {J}_p-\frac{1}{n}vv'\right] d. \end{aligned}$$
(4)

In high dimensional cases, this matrix becomes non-positive definite, ill-conditioned or unstructured and evenly singular which causes performance degradation. This issue can be addressed further by using Gaussian maximum likelihood based principle component analysis. according to proposed approach, maximum likelihood based model is used for mapping underlying space into data space as:

$$\begin{aligned} x=\varepsilon +\mu +f{\Lambda }, \end{aligned}$$
(5)

where x denotes high-dimensional variable denoted as \(\left( {p\times 1} \right) \), \({\Lambda }\) is linear transformation expressed as \(\left( {p\times \mathcal{O}} \right) \), \({\Lambda }\rightarrow x, f\, \mathrm{{is}} \left( {\mathcal{O}\times 1} \right) \), \(\mu \) is denoted by mean vector of \(\left( {p\times 1} \right) \) and \(\varepsilon \) denotes Gaussian random error or noise for input signal vector.

To obtain the efficient solution for given input vector, probability distribution \(p\left( {x|f} \right) \) is formulated by using probability model of random error \( \left( \varepsilon \right) \) which can be expressed as:

$$\begin{aligned} p\left( {\varepsilon ;\sigma ^{2}} \right) =\left( {2\pi \sigma ^{2}} \right) ^{-\frac{p}{2}}\exp \left( {-\frac{1}{2}{\varepsilon }'\varepsilon }\right) . \end{aligned}$$
(6)

It is assumed that probability density unit is spherical. According to Eq. (5), \(\varepsilon =x-\mu -f{\Lambda }\) and conditional probability can be obtained by computing \(p\left( \varepsilon \right) \), this can be expressed as:

$$\begin{aligned}&p\left( {x{|}f;{\Lambda },\mu ,\sigma ^{2}} \right) \nonumber \\&\quad =\left( {2\pi \sigma ^{2}} \right) ^{-\frac{p}{2}}\exp \left( {-\,\frac{1}{2}\left| {\left| {x-\mu -f{\Lambda }} \right| } \right| ^{2}} \right) \end{aligned}$$
(7)

This technique provides complete distribution of the dataset in a given subspace without causing any error or any of the data is not distributed from the defined probabilistic limit.

Main aim of proposed PCA approach is to compute unknown parameters such as \({\Lambda }, \mu \) and noise variance \(\sigma ^{2}\) by using maximum likelihood observations. In order to obtain this, likelihood and likelihood function need to be computed which can be expressed as:

$$\begin{aligned}&L\left( {{\Lambda }, \mu ,\sigma ^{2}|x} \right) \nonumber \\&\quad = \prod \limits _{i=1}^n p\left( {x_i ;\Lambda ,\mu ,\sigma ^{2}} \right) \nonumber \\&\quad =\left( {2\pi } \right) ^{-\frac{np}{2}}\left| {\Sigma } \right| ^{-\frac{n}{2}}\exp \left[ {-\,\frac{1}{2}\sum \limits _{i=1}^n \left( {x_i -\mu } \right) ^{{'}}\Sigma ^{-1}\left( {x_i -\mu } \right) } \right] \nonumber \\ \end{aligned}$$
(8)

\(\left( {x_i -\mu } \right) ^{{'}}\varSigma ^{-1}\left( {x_i -\mu } \right) \) this also can be expressed as \(tr\left( {\Sigma }^{-1}S \right) ] \) where \(S= \frac{1}{n}\sum \nolimits _{i=1}^n \left( {x_i -\hat{\mu } } \right) \left( {x_i -\hat{\mu }} \right) '\)

S is known as sample covariance matrix of observed software metrics data and \(\hat{\mu }\) denotes maximum likelihood estimation of considered mean vector \(\mu \) which is computed as \(\hat{\mu } =\frac{1}{n}\sum \nolimits _{i=1}^n x_i = \bar{x}\).

Hence, log-likelihood function can be expressed as:

$$\begin{aligned} L\left( {{\Lambda }, \mu ,\sigma ^{2}|x} \right)= & {} -\,\frac{np}{2}\log \left( {2\pi } \right) -\,\frac{n}{2}\log \left| {\Sigma } \right| \nonumber \\&-\frac{n}{2}tr[{{\Sigma }^{-1}S}] \end{aligned}$$
(9)

Maximization of \({\Lambda }\) and \(\sigma ^{2}\) can provide a better closed form solution for error reduction in PCA data reconstruction. Maximum likelihood of \( {\Lambda }\) and \(\sigma ^{2}\) can be obtained as:

$$\begin{aligned} \hat{\Lambda }_{ML}= & {} \left( {p\times m} \right) \left( {L_m -\sigma ^{2}I_m } \right) ^{\frac{1}{2}} R\nonumber \\ \hat{\sigma }_{ML} ^{2}= & {} \sum \limits _{j=m+1}^p \lambda _j \times \frac{1}{p-m} \end{aligned}$$
(10)

Above mentioned technique improves the performance of PCA feature reduction by reducing the error during feature reconstruction in a given space by using maximum likelihood model and maximizing the effecting parameters.

Next stage considers implementation of neural network model for software defect prediction. Neural network implementation is discussed in next section.

4.1 Advantages of proposed approach

Feature selection technique identifies and extracts the most useful features of the dataset for the purpose of learning, and these features are very valuable and helpful for analysis and future prediction. The redundancy in the data is removed and the learning algorithm performance can be improved. Training data plays vital role in classifying and data prediction. If the data fail to reveal the statistical consistency that machine learning algorithms exploit, then learning will fail, so, it is important to remove the redundant data from the training set and it is made easy by using feature selection technique.

4.2 Neural network implementation for software defect prediction

This section presents an implementation study about neural network for software defect prediction model. neural network technique is based on the working of human brain [25]. It contains multiple units for information processing which are known as neurons. The complete network formulation contains three main computational layers which are known as input layer, hidden layer and output layer. According to the objective of this work, input software data metrics are parsed to the neural network where software data is processed by each layer where layers and neurons are connected based on their weights which shows importance of each neuron. During learning process, weigh of each neuron is considered and adjusted according to the requirement. At this stage, each neuron gives input to each preceding layer and later these inputs are multiplied by its weight to obtain the final weight. However, the multiplication output is added together. According to this process, neuron computes activation level from this sum and output is send to the following layer where final solution is estimated. A simple working process of neural network with all elements is depicted in Fig. 1.

Fig. 1
figure 1

Neuron with various elements

An activation function can be a step, sign, sigmoid or linear function. Choosing the activation function is defined according to the expected task of the network, i.e., classification or regression. The output of a neuron which is in i th layer can be described by Eq. (11).

$$\begin{aligned} y_i = f_i \left( {\sum \limits _{j=1}^n W_{ij} x_j + \theta _i } \right) \end{aligned}$$
(11)

Where \(y_i \) is the output of a neuron, n is the total number of inputs to this neuron, \(x_j \) is the jth input, \(W_{ij}\) is the weight between the current neuron and jth input, and \(\theta _i \) is the bias of the neuron. \(f_i\) represents the activation function of this layer. Generally, the activation function is a nonlinear function such as sigmoid, Gaussian and so on. This enables ANN to model nonlinear relationships.

Fig. 2
figure 2

Architecture of the neural network for PC1 dataset

Associations between software quality metrics and module defect proneness are often complex and nonlinear, so ANN is an appropriate choice for software defect prediction problem. The optimization goal of the network is to minimize the error function by optimizing the network weights (all \(W_{ij}\)). The network error at each iteration is calculated by using different methods such as the root mean squared error, mean absolute error, relative absolute error, and root relative squared error. This error is propagated backward in the network and weights are adjusted to minimize the error. The iteration continues until a stopping criterion is met. Stopping criteria can be either a maximum iteration number or minimum error value. The neural network created for the PC1 dataset is shown in Fig. 2.

5 Experimental study

This section provides an experimental study for software defect prediction using proposed feature reduction and classification. Complete study is implemented using MATLAB 2013 simulation tool and PROMISE [26] open source datasets are considered for experimental analysis.

In this work, we have considered four datasets from PROMISE repository which are named as KC1, JM1, PC3 and PC4 where various attributes are present in the given dataset. Table 1 shows various parameters about considered dataset where total number of attributes, available modules, defective modules and percentage defect are depicted.

Table 1 PROMISE software defect prediction dataset details

These software dataset contains some general attributes which are presented in Table 2 where name of attribute and its details are given.

Table 2 PROMISE software defect prediction attribute details

Using these datasets, we apply software defect prediction system where performance of proposed model is compared with other state-of-art techniques. Generally, for data-mining applications various measurement metrics are present such as true positive, rate, false positive rate, accuracy, confusion matrix, precision and recall etc. In these application, confusion matrix is known as most significant and important parameter for performance analysis. this matrix contains the value of actual and predicted class and based on these values classification results can be performed. A general model of confusion matrix is presented in Table 3.

This confusion matrix helps us to compute total accuracy, precision, specificity, sensitivity and F-measure of the proposed approach.

Accuracy is a measurement rate of correct classification. It is computed by taking the ratio of correct prediction and total number of prediction. It can be expressed as:

$$\begin{aligned} Acc= \left( {TP+TN} \right) /\left( {TP+TN+FP+FN} \right) \end{aligned}$$
(12)

Another parameter is known to perform sensitivity analysis on the model. This is the measurement of true positive rate which can be computed by identifying the correctly classified non- defective modules. This can be expressed as

$$\begin{aligned} Sensitivity= TP/\left( {TP+FN} \right) \end{aligned}$$
Table 3 Confusion matrix

Next parameter is specificity and it is to compute the true negative rate which shows the measurement of correctly classified defective software modules and can be expressed as:

$$\begin{aligned} Specificity= TN/\left( {TN+FP} \right) \end{aligned}$$

Then, we compute precision of the proposed approach. It is computed by taking the ratio of True Positive and (True and False) positives.

$$\begin{aligned} P= TP/TP+FP \end{aligned}$$

Finally, F-measure is computed which is the mean of precision and sensitivity performance. It is expressed as:

$$\begin{aligned} F= 2\times P\times Sensitivity/P+Sensitivity \end{aligned}$$

In next stage, performance analysis is applied for considered software defect dataset. The complete experimental study is divided into four experimental scenarios according to the considered datasets.

5.1 Test case 1: KC1 dataset

In this case, PROMISE dataset “KC1” is considered for experimental analysis 0.80.9865. this dataset contains 2096 instances, 325 number of defects where 15.5% part is defective. Here we apply proposed hybrid approach for software defect prediction on KC1 PROMISE dataset. First of all, we compute confusion matrix as given in Table 4.

Table 4 Confusion matrix for KC1 dataset using proposed hybrid classifier

Similarly, we compute other statistical performance analysis parameters such as precision, sensitivity, specificity, recall, F-score and accuracy. This analysis is presented in Table 5.

Table 5 Statistical performance analysis for KC1 dataset using proposed hybrid classifier

This shows that proposed approach gives accuracy of 86.91% for KC1 dataset.

Fig. 3
figure 3

ROC curve analysis for KC1 dataset

In next stage, here we compute ROC curve analysis and precision–recall curve for test case 1. These two analysis are depicted in Figs. 3 and 4.

Fig. 4
figure 4

Precision–recall curve analysis for KC1 dataset

To validate the performance of proposed approach for KC1 dataset, a comparative study is also presented which shows a significant performance for software bug prediction 81.08 AUC.

5.2 Test case 2: JM1 dataset

In this case, JM1 dataset is considered which contains 9535 modules with 18.35% defective module. Similar experiments are performed on this dataset also as provided in test case 1. Confusion matrix for this test case is given in Tables 6 and 7 shows other statistical analysis parameters (Fig. 5).

Table 6 Confusion matrix for JM1 dataset using proposed hybrid classifier
Table 7 Statistical performance analysis for JM1 dataset using proposed hybrid classifier
Fig. 5
figure 5

ROC curve analysis for JM1 dataset

Similarly, we compute precision–recall curve analysis for JM1 dataset. P–R curve performance is depicted in Fig. 6.

Fig. 6
figure 6

Precision–recall curve analysis for JM1 dataset

5.3 Test case 3: PC3 dataset

In this case, we have considered PC1 dataset where 1125 modules are present with 10.23% defect. Confusion matrix for this test case is given in Tables 8 and 9 shows other statistical analysis parameters.

ROC curve analysis and P–R curve analysis is also performed here. Figures 7 and 8 shows ROC and P–R curve performance for PC3 dataset respectively. In this study, AUC, area under curve is obtained as 0.8918%.

Table 8 Confusion matrix for JM1 dataset using proposed hybrid classifier
Table 9 Statistical performance analysis for JM1 dataset using proposed hybrid classifier
Fig. 7
figure 7

ROC curve analysis for PC3 dataset

Fig. 8
figure 8

Precision–recall curve analysis for PC3 dataset

5.4 Test case 4: PC4 dataset

Similar study is performed for PC4 dataset where 1399 module are present with 1.72% defect in the complete dataset. For this dataset, confusion matrix is given in Table 10 whereas statistical performance is presented in Table 11.

Table 10 Confusion matrix for PC4 dataset using proposed hybrid classifier
Table 11 Statistical performance analysis for PC4 dataset using proposed hybrid classifier

ROC curve analysis and P–R curves are given in Figs. 8 and 9 respectively. This study shows that proposed approach obtains AUC (area under curve) as 97.20% which is a significant improvement when compared with other state of art models. Similarly, it shows a better performance in terms of classification accuracy (Fig. 10).

Fig. 9
figure 9

ROC curve analysis for PC4 dataset

Figure 9 shows ROC curve analysis for PC4 dataset and precision–recall graph is presented in Fig. 11.

The proposed methods improves accuracy in defect prediction using a fewer attributes as compared to the previous studies. We used attribute selection to show that, even when attributes were decreased by around 80% of total attributes the classifiers gave equal and even greater accuracy than total number used. Performance testing is done using full training set on four different datasets KC1, JM1, PC3 and PC4. KC1 dataset contains 2096 instances, 325 no. of defects where 15.5% part is defective and we applied Proposed hybrid approach for software defect Prediction and the result shows that the proposed approach gives 86.91% accuracy for KC1 dataset, 83.03% accuracy for JM1 dataset, 89% for PC3 and 93.64% for PC4.

Fig. 10
figure 10

Analysis in terms of weights given to attributes

Table 12 Comparative analysis in term of AUC
Fig. 11
figure 11

Precision-curve analysis for PC4 dataset

Finally, we present a comparative study for software defect prediction. Table 12 shows comparative analysis with various state-of-art techniques in terms of area under curve (AUC).

Above mentioned table shows state-of-art technique such as k-NN, SVM, Naïve Bayes and LDA etc. Comparative study shows that proposed approach gives better performance in terms AUC (Fig. 12).

Fig. 12
figure 12

Comparative performance analysis in terms of AUC

6 Conclusion and future work

This work mainly aims on the software defect prediction technique using data-mining techniques. However, this area has become an interesting field of researchers where various techniques have been discussed for further improving the performance of software defect detection or bug prediction. in this work, we have addressed the issue of classification accuracy for huge dataset by developing a new combined approach using feature reduction and classification. In order to obtain feature reduction model, PCA is applied where maximum likelihood is also combined to reduce the PCA reconstructed data. Furthermore, neural network classification technique is implemented for software bug prediction. Experimental study shows that proposed approach provides better performance and obtains AUC as 97.20% which is a significant improvement when compared with other state of art models. Similarly, it shows a better performance in terms of classification accuracy. To statistically prove the validity of the impact of feature selection The hypothesis are formed as follows:

H1 There is no such difference in the accuracy of the classifiers when there is no feature selection technique and when the feature selection techniques are used.

H2 There exists a difference in the accuracy of the classifiers when there is no feature selection and when the feature selection techniques are used.

Thus it is concluded that by using feature selection techniques the time and space complexity for defect prediction is reduced without effecting the prediction accuracy.

  1. 1.

    It reduces the time and the amount of storage space required.

  2. 2.

    Removal of multi-collinearity improves the performance of the machine learning model.

  3. 3.

    It becomes easier to visualize the data when we reduce the data to low dimensions such as 2D or 3D.

Future scope of project lies in performing double pre-processing of dataset and t-tests by applying instance filtering along with attribute selection.