1 Introduction

Software defect prediction (SDP) is a field of study that aims at developing predictive models to identify defect-prone software modules using code quality attributes. SDP helps in the timely identification of defective modules, which benefits the industries in effectively allocating their limited software testing resources. The researchers have developed various ML techniques to predict defect proneness of the software modules [1]. Various SDP studies have also been developed to address challenges like feature selection and class imbalance in the datasets. However, these models face several challenges like low prediction performance and high training time [2]. The issue of obtaining an effective feature representation [3] is also common among these SDP models [4, 5]. This study tries to address these issues by proposing a novel defect prediction model PCA–RVFL.

The proposed model, PCA–RVFL, uses PCA for dimensionality reduction and RVFL to classify the defective and non-defective classes from the features obtained by PCA. RVFL is a randomized neural network architecture (RNNA) with direct links between the input and output layers. Unlike the iterative neural networks, in RVFL, the weights between the input and the hidden layer are randomly assigned while the weights of the input and hidden layers to the output layer are calculated analytically. These structural variations in RVFL make it computationally faster than traditional neural networks like multi-layer perceptron (MLP), without compromising with the prediction performance [6]. RVFL has been widely used in many different research areas like forecasting [7], online learning [8], classification [9], regression [10], etc. However, its use in SDP has not been explored. Hence, the motivation of this paper is to study the predictive capability of RVFL in SDP.

The research questions addressed in this study are listed below.

  • RQ1. What is the prediction performance of PCA–RVFL in comparison with other ML techniques for within-project scenarios in SDP?

  • RQ2. What is the prediction performance of PCA–RVFL in comparison with other ML techniques for inter-release scenarios in SDP?

  • RQ3. What is the computational efficiency of PCA–RVFL in comparison with other traditional neural network architectures?

The rest of the paper is organized in this format. Section 2 gives the previous literature review on SDP and RVFL, Sect. 3 presents PCA–RVFL, the proposed model in detail, and the pseudo-code, Sect. 4 presents the experimental framework, Sect. 5 presents the results and observations of the experiment, and Sect. 6 concludes the study.

2 Related Work

In various previous studies, it has been evident that neural network architectures tend to perform better in terms of predictive performance but lag in computation time efficiency [1, 11]. RNNA was introduced to overcome this barrier by using analytical functions rather than backward passes to update the weights and biases [12]. One such category of RNNA is RVFL. RVFL has performed better than other techniques in other fields [7,8,9,10]. However, the prediction capability of RVFL has not been assessed in SDP. Hence in this study, the performance of RVFL has been observed.

In previous studies, many different ML techniques have been employed for SDP. Dhamayanthi and Lavanya [13] explored PCA's performance with Naive Bayes (NB) for seven NASA datasets and achieved an improvement of 10.3% in terms of accuracy. Malhotra also conducted various studies [14, 15] to observe ML techniques for SDP on Android projects. The study concluded that ML techniques like NB, LogitBoost, and MLP outperform other ML techniques. Ghotra et al. [11] applied various machine learning techniques on cleaned versions of NASA and PROMISE datasets. ML techniques like NB, SVM, K-nearest neighbors (KNN), ANN, DTree, and ensemble technique were evaluated in the study. We selected different ML techniques from these studies that have previously outperformed others for comparison with the proposed model PCA–RVFL.

3 PCA–RVFL: The Proposed Model for Software Defect Prediction

3.1 Dimensionality Reduction Using Principal Component Analysis

PCA [16] is a popular statistical technique that extracts essential observations from a given dataset and represents it into a set of mutually perpendicular features called principal components formed by the linear combinations of the original features of the given dataset. PCA compresses the dataset by representing it using the principal components with the highest variance.

3.2 Classification Using Random Vector Functional Link Network

RVFL [17] is a feed-forward RNNA with randomized weights and biases. Figure 1 demonstrates the structure of RVFL. The weights between the input layer and the hidden layer are randomly initialized from a particular range. These weights and biases are not manipulated or updated throughout the training. The weights between the hidden and output layers and the weights between the input and output layers are analytically calculated.

Fig. 1
figure 1

Structure of RVFL

Given N arbitrary samples (xi, yi), i = 1, 2, 3, … N, where xi is a d-dimensional feature vector and yi is the encoded class label. All the samples belong to one of the output class labels, i.e., {0,1}.

The equation of RVFL for calculating the output labels is defined as:

$$\mathop \sum \limits_{t = 1}^{k} \beta_{t} g\left( {w_{t} x_{i} + b_{t} } \right) + \mathop \sum \limits_{t = k + 1}^{k + d} \beta_{t} x_{{i\left( {t - k} \right)}} = o_{i}$$
(1)

where i = 1, 2, 3, 4 … N, βt is the output weights vector that connects the tth hidden layer node to the output layer nodes. wt is the vector of input weights connecting the input layer nodes to the tth hidden layer node. bt is the bias for the tth hidden node, and oi is the output label for ith sample.

Equation (1) can also be rewritten as:

$$H\cdot\beta = O$$
(2)

where β is the output weight matrix, O is the predicted output class label matrix and H = {hi}, I = 1, 2, … N is written as:

$$H = \left[ {\begin{array}{*{20}c} {g\left( {w_{1} x_{1} + b_{1} } \right)} & \cdots & {g\left( {w_{k} x_{1} + b_{k} } \right)} & {x_{11} } & \cdots & {x_{1d} } \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {g\left( {w_{1} x_{N} + b_{1} } \right)} & \cdots & {g\left( {w_{k} x_{N} + b_{k} } \right)} & {x_{N1} } & \cdots & {x_{Nd} } \\ \end{array} } \right]_{{NX\left( {k + d} \right)}}$$
(3)

where H is formed by combining the values of both the input layer nodes and the hidden layer nodes to form a matrix of size N * (k + d).

The goal is to minimize the classification error, i.e.,

$$\left| {O - Y} \right| = 0$$
(4)

where Y is actual output class labels.

Using Eq. (2) in Eq. (4), we get

$$\left| {H\cdot\beta \, - \, Y} \right| = \, 0.$$
(5)

In this study, we are using ridge regression [18] or l2 least square method with a regularization parameter λ for calculation of optimum value of β:

$$\mathop \sum \limits_{i = 1}^{N} (y_{i} - h_{i}^{T} \beta )^{2} + \lambda \beta^{2} = 0.$$
(6)

The optimal β derived from Eq. (6) would be:

$$\beta = H \cdot (H^{T} \cdot H + \lambda I)^{ - 1} \cdot Y.$$
(7)

This optimized β value is now the trained weights between the hidden and the output layer nodes and between the input and the output layer nodes. To predict the trained model's output, Eq. (2) is used where β is now known, and the equivalent result would be the predicted class labels.

3.3 Pseudo-code for PCA–RVFL

The pseudo-code of PCA–RVFL is shown in Fig. 2.

Fig. 2
figure 2

Pseudo-code for PCA–RVFL

4 Experimental Setup and Framework

Figure 3 shows the experimental framework used in this study. The following subsections describe the various components of this framework.

Fig. 3
figure 3

Experimental framework of this study

4.1 Dataset

For the experiment, 17 datasets are taken from the open-source PROMISE repository [19]. Different releases of the projects are collected to analyze the proposed model PCA–RVFL for both within-project and inter-release scenarios. Table 1 shows the statistics and properties of the datasets. For every dataset, the number of samples, the number of defective samples, and the percentage of defective samples are mentioned.

Table 1 Statistics of dataset used in the study

4.2 Variables

The datasets described in Sect. 4.1 consist of 20 independent variables taken from the object-oriented (OO) metrics suite and one binary dependent variable which shows if the class in the project is defective (1) or non-defective (0). The 20 OO metrics defined by various researchers [20,21,22,23,24] are shown in Table 2. These independent variables have been used widely in previous studies [14] for training and defect prediction.

Table 2 Object-oriented metrics used in the study

4.3 Performance Measures

The performance of PCA–RVFL is evaluated using AUC–ROC and F-measure. The performance measures used in the study are defined below.

F-measure is the harmonic mean of precision and recall. Recall is the correctly predicted defective samples to the total samples which are defective. Precision is the ratio of the correctly predicted defective samples to the total predicted defective samples.

AUC–ROC is the area under the receiver operating characteristics curve. This curve shows the relationship between recall and false positive rate (FPR). The AUC–ROC indicates the ability of techniques to distinguish between defective and non-defective samples. Higher AUC–ROC values indicate better performance of the techniques.

4.4 Parameter Settings

For PCA–RVFL, the average AUC–ROC variation with the number of PCA components is shown in Fig. 4. The highest average AUC–ROC value is achieved for 12 PCA components, so it is used in the experiments.

Fig. 4
figure 4

Variation of average AUC–ROC with number of components in PCA

The parameter settings for RVFL are decided by conducting experiments and based on previous studies [6]. In this study, RVFL is implemented using ridge regression with the sigmoid activation function. The randomization range of weights and biases is [−1,1] from a uniform distribution, and the number of hidden layer nodes is 80.

PCA–RVFL has been compared with various ML techniques—KNN, MLP, RBF, NB, DTree, LOG, SVM, RF [25], and ELM [26]. DTree, LOG, SVM, RF, and NB techniques are implemented using the open-source Scikit-learn library of Python with default parameter settings. MLP and RBF networks are implemented using the Keras library of Python. The parameters for MLP, RBF, and KNN are chosen by conducting experiments on the datasets described in Table 1. For MLP and RBF, the learning rate has been taken as 0.1, with RMSProp optimizer, the number of hidden nodes 50, and batch size 50. For MLP, the activation function sigmoid is used. For ELM, the number of hidden layer nodes is 15, the activation function is sigmoid, and the randomization range of weights and bias is [−1,1] and [0,1], respectively.

5 Experimental Results and Analysis

5.1 RQ1. What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Within-Project Scenarios in SDP?

To answer this RQ, experiments are conducted according to the framework described in Sect. 4. Tables 3 and 4 show the comparison of AUC–ROC and F-measure of the ML techniques in the within-project scenario. The highest value for every dataset is highlighted in the tables.

Table 3 AUC–ROC comparison between ML techniques for within-project scenario
Table 4 F-measure comparison between ML techniques for within-project scenario

The proposed model has achieved the best AUC–ROC value for most datasets, i.e., 15 out of 17 datasets. The average AUC–ROC value for PCA–RVFL is 0.753 which is the highest among other ML techniques with average AUC–ROC values as—0.656 (for PCA–MLP), 0.652 (for PCA–NB), 0.652 (for RVFL), 0.693 (for PCA–ELM), 0.611 (for PCA–SVM), 0.645 (for PCA–RF), 0.672 (for PCA–DTree), 0.626 (for PCA–RBF), 0.644 (for PCA–KNN), and 0.631 (for PCA–LOG). PCA–RVFL has given a significantly better performance than RVFL, which shows that dimensionality reduction can play a crucial role in a model's prediction performance. The F-measure values in Table 4 indicate that PCA–RVFL has achieved an average F-measure value of 0.613. It has shown an increase of 12.47% (for PCA–MLP), 15.44% (for PCA–NB), 13.1% (for RVFL), 16.3% (for PCA–ELM), 17.54% (for PCA–SVM), 14.7% (for PCA–RF) 18.2% (for PCA–DTree), 16.3% (for PCA–RBF), 17.6% (for PCA–KNN), and 30.34% (for PCA–LOG) in F-measure. The better performance of PCA–RVFL indicates that it has good prediction capability in the within-project scenarios of SDP.

5.2 RQ2: What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Inter-release Scenarios in SDP?

To answer this RQ, experiments are conducted according to the framework described in Sect. 4. To perform the defect prediction on an inter-release scenario, one release of a dataset is selected for training, and the next release is selected for testing. Table 5 represents the comparison of the AUC–ROC values of various ML techniques. PCA–RVFL shows the best results for 8 out of 10 datasets with AUC–ROC value more than 0.7 for 6 out of 10 datasets. The proposed model has shown significant percentage increase in average AUC–ROC with other ML techniques—14.07% (for PCA–MLP), 14.07% (for PCA–NB), 20.79% (for PCA–DTree), 22.1% (for PCA–KNN), 30.03% (for PCA–LOG), 14.63% (for PCA–RBF), 8.9% (for RVFL), 10.11% (for PCA–ELM), 13.14% (for PCA–SVM), and 10.28% (for PCA–RF). This indicates that PCA–RVFL can handle a class imbalance in the datasets. Table 6 represents the comparison of the F-measure values of various ML techniques. The proposed model has shown the second-highest average F-measure (0.512), where PCA–NB (0.587) has shown the highest average F-measure.

Table 5 AUC–ROC comparison between ML techniques for inter-release scenario
Table 6 F-measure comparison between ML techniques for inter-release scenario

The model has achieved high AUC–ROC values for most datasets in the experiment, which indicates that PCA–RVFL is suitable for the inter-release scenario of defect prediction. Since the difference between the number of samples and the statistics of two adjacent releases is relatively high, this might be why the low performance of the ML techniques in the inter-release scenario compared to the within-project scenario. In the case of ant_1.5, the highest value of AUC–ROC in the within-project scenario is 0.889, while in the inter-release scenario, it is 0.671 when trained on ant_1.4. These variations indicate the difference in the subsequent releases and affect ML techniques’ performance in inter-release scenarios.

5.3 RQ3: What is the Computational Efficiency of PCA–RVFL in Comparison with Other Traditional Neural Network Architectures?

Apart from predictive capability, one of the most important aspects that differentiate PCA–RVFL from traditional neural network architectures is the computational efficiency of RVFL due to its randomization nature, unlike MLP and RBF that have iterative nature. This section investigates the training time efficiency of PCA–RVFL. The training time of two neural network techniques—MLP and RBF—is compared to prove the computation efficiency of PCA–RVFL. The results shown in Table 7 indicate that PCA–RVFL is a clear winner in terms of computational efficiency. The average training time for PCA–RVFL is 13.7 ms far lower than the average training time for RBF (52.9) and MLP (195.1).

Table 7 Training time comparison between neural network techniques

The non-iterative nature of RVFL makes it a computationally efficient ML technique compared to the traditional neural network architectures and makes it suitable for application in problem domains that involve training on large datasets.

6 Conclusion

The study proposed a novel SDP model PCA–RVFL based on randomized neural network RVFL and dimensionality reduction technique PCA. The experiments were conducted to assess the performance of PCA–RVFL for within-project and inter-release scenarios in SDP. The results indicate that PCA–RVFL has good prediction capability compared to ten classic ML techniques previously used in SDP in both scenarios. The AUC–ROC values for PCA–RVFL range from 0.624 to 0.889 for within-project scenario and from 0.671 to 0.754 for the inter-release scenario. The training time results are also in favor of the proposed model, making it computationally efficient compared to neural network architectures like MLP and RBF. The study concludes that randomized neural networks like RVFL can be employed in the field of SDP to achieve good prediction performance. Another important conclusion is that appropriate feature representation is necessary for building effective prediction models in SDP, which is evident from the impact of PCA on the model’s performance.