Application of Random Vector Functional Link Network for Software Defect Prediction

Malhotra, Ruchika; Aggarwal, Deepti; Garg, Priya

doi:10.1007/978-981-16-3097-2_11

Application of Random Vector Functional Link Network for Software Defect Prediction

Conference paper
First Online: 02 October 2021

276 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1371))

Abstract

Software defect prediction (SDP) aims to develop predictive techniques that identify software modules’ default proneness using structural and quality attributes. The use of randomized neural networks, especially random vector functional link (RVFL) network, has been limited in SDP. This study proposes a novel SDP model based on principal component analysis (PCA) and RVFL. The model uses PCA for dimensionality reduction of the data and employs RVFL to classify the defective modules. Extensive experiments are conducted on 17 PROMISE repository datasets, and the proposed model is compared with ten classic machine learning (ML) techniques for within-project and inter-release scenarios. The experimental results indicate that PCA–RVFL outperforms the classic ML techniques for most of the datasets, proving the predictive capability of the proposed model in the field of SDP.

Download conference paper PDF

1 Introduction

Software defect prediction (SDP) is a field of study that aims at developing predictive models to identify defect-prone software modules using code quality attributes. SDP helps in the timely identification of defective modules, which benefits the industries in effectively allocating their limited software testing resources. The researchers have developed various ML techniques to predict defect proneness of the software modules [1]. Various SDP studies have also been developed to address challenges like feature selection and class imbalance in the datasets. However, these models face several challenges like low prediction performance and high training time [2]. The issue of obtaining an effective feature representation [3] is also common among these SDP models [4, 5]. This study tries to address these issues by proposing a novel defect prediction model PCA–RVFL.

The proposed model, PCA–RVFL, uses PCA for dimensionality reduction and RVFL to classify the defective and non-defective classes from the features obtained by PCA. RVFL is a randomized neural network architecture (RNNA) with direct links between the input and output layers. Unlike the iterative neural networks, in RVFL, the weights between the input and the hidden layer are randomly assigned while the weights of the input and hidden layers to the output layer are calculated analytically. These structural variations in RVFL make it computationally faster than traditional neural networks like multi-layer perceptron (MLP), without compromising with the prediction performance [6]. RVFL has been widely used in many different research areas like forecasting [7], online learning [8], classification [9], regression [10], etc. However, its use in SDP has not been explored. Hence, the motivation of this paper is to study the predictive capability of RVFL in SDP.

The research questions addressed in this study are listed below.

RQ1. What is the prediction performance of PCA–RVFL in comparison with other ML techniques for within-project scenarios in SDP?
RQ2. What is the prediction performance of PCA–RVFL in comparison with other ML techniques for inter-release scenarios in SDP?
RQ3. What is the computational efficiency of PCA–RVFL in comparison with other traditional neural network architectures?

The rest of the paper is organized in this format. Section 2 gives the previous literature review on SDP and RVFL, Sect. 3 presents PCA–RVFL, the proposed model in detail, and the pseudo-code, Sect. 4 presents the experimental framework, Sect. 5 presents the results and observations of the experiment, and Sect. 6 concludes the study.

2 Related Work

In various previous studies, it has been evident that neural network architectures tend to perform better in terms of predictive performance but lag in computation time efficiency [1, 11]. RNNA was introduced to overcome this barrier by using analytical functions rather than backward passes to update the weights and biases [12]. One such category of RNNA is RVFL. RVFL has performed better than other techniques in other fields [7,8,9,10]. However, the prediction capability of RVFL has not been assessed in SDP. Hence in this study, the performance of RVFL has been observed.

In previous studies, many different ML techniques have been employed for SDP. Dhamayanthi and Lavanya [13] explored PCA's performance with Naive Bayes (NB) for seven NASA datasets and achieved an improvement of 10.3% in terms of accuracy. Malhotra also conducted various studies [14, 15] to observe ML techniques for SDP on Android projects. The study concluded that ML techniques like NB, LogitBoost, and MLP outperform other ML techniques. Ghotra et al. [11] applied various machine learning techniques on cleaned versions of NASA and PROMISE datasets. ML techniques like NB, SVM, K-nearest neighbors (KNN), ANN, DTree, and ensemble technique were evaluated in the study. We selected different ML techniques from these studies that have previously outperformed others for comparison with the proposed model PCA–RVFL.

3 PCA–RVFL: The Proposed Model for Software Defect Prediction

3.1 Dimensionality Reduction Using Principal Component Analysis

PCA [16] is a popular statistical technique that extracts essential observations from a given dataset and represents it into a set of mutually perpendicular features called principal components formed by the linear combinations of the original features of the given dataset. PCA compresses the dataset by representing it using the principal components with the highest variance.

3.2 Classification Using Random Vector Functional Link Network

RVFL [17] is a feed-forward RNNA with randomized weights and biases. Figure 1 demonstrates the structure of RVFL. The weights between the input layer and the hidden layer are randomly initialized from a particular range. These weights and biases are not manipulated or updated throughout the training. The weights between the hidden and output layers and the weights between the input and output layers are analytically calculated.

Given N arbitrary samples (x_i, y_i), i = 1, 2, 3, … N, where x_i is a d-dimensional feature vector and y_i is the encoded class label. All the samples belong to one of the output class labels, i.e., {0,1}.

The equation of RVFL for calculating the output labels is defined as:

$$\mathop \sum \limits_{t = 1}^{k} \beta_{t} g\left( {w_{t} x_{i} + b_{t} } \right) + \mathop \sum \limits_{t = k + 1}^{k + d} \beta_{t} x_{{i\left( {t - k} \right)}} = o_{i}$$

(1)

where i = 1, 2, 3, 4 … N, β_t is the output weights vector that connects the tth hidden layer node to the output layer nodes. w_t is the vector of input weights connecting the input layer nodes to the tth hidden layer node. b_t is the bias for the tth hidden node, and o_i is the output label for ith sample.

Equation (1) can also be rewritten as:

$$H\cdot\beta = O$$

(2)

where β is the output weight matrix, O is the predicted output class label matrix and H = {h_i}, I = 1, 2, … N is written as:

$$H = \left[ {\begin{array}{*{20}c} {g\left( {w_{1} x_{1} + b_{1} } \right)} & \cdots & {g\left( {w_{k} x_{1} + b_{k} } \right)} & {x_{11} } & \cdots & {x_{1d} } \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {g\left( {w_{1} x_{N} + b_{1} } \right)} & \cdots & {g\left( {w_{k} x_{N} + b_{k} } \right)} & {x_{N1} } & \cdots & {x_{Nd} } \\ \end{array} } \right]_{{NX\left( {k + d} \right)}}$$

(3)

where H is formed by combining the values of both the input layer nodes and the hidden layer nodes to form a matrix of size N * (k + d).

The goal is to minimize the classification error, i.e.,

$$\left| {O - Y} \right| = 0$$

(4)

where Y is actual output class labels.

Using Eq. (2) in Eq. (4), we get

$$\left| {H\cdot\beta \, - \, Y} \right| = \, 0.$$

(5)

In this study, we are using ridge regression [18] or l2 least square method with a regularization parameter λ for calculation of optimum value of β:

$$\mathop \sum \limits_{i = 1}^{N} (y_{i} - h_{i}^{T} \beta )^{2} + \lambda \beta^{2} = 0.$$

(6)

The optimal β derived from Eq. (6) would be:

$$\beta = H \cdot (H^{T} \cdot H + \lambda I)^{ - 1} \cdot Y.$$

(7)

This optimized β value is now the trained weights between the hidden and the output layer nodes and between the input and the output layer nodes. To predict the trained model's output, Eq. (2) is used where β is now known, and the equivalent result would be the predicted class labels.

3.3 Pseudo-code for PCA–RVFL

The pseudo-code of PCA–RVFL is shown in Fig. 2.

4 Experimental Setup and Framework

Figure 3 shows the experimental framework used in this study. The following subsections describe the various components of this framework.

4.1 Dataset

For the experiment, 17 datasets are taken from the open-source PROMISE repository [19]. Different releases of the projects are collected to analyze the proposed model PCA–RVFL for both within-project and inter-release scenarios. Table 1 shows the statistics and properties of the datasets. For every dataset, the number of samples, the number of defective samples, and the percentage of defective samples are mentioned.

Table 1 Statistics of dataset used in the study

Full size table

4.2 Variables

The datasets described in Sect. 4.1 consist of 20 independent variables taken from the object-oriented (OO) metrics suite and one binary dependent variable which shows if the class in the project is defective (1) or non-defective (0). The 20 OO metrics defined by various researchers [20,21,22,23,24] are shown in Table 2. These independent variables have been used widely in previous studies [14] for training and defect prediction.

Table 2 Object-oriented metrics used in the study

Full size table

4.3 Performance Measures

The performance of PCA–RVFL is evaluated using AUC–ROC and F-measure. The performance measures used in the study are defined below.

F-measure is the harmonic mean of precision and recall. Recall is the correctly predicted defective samples to the total samples which are defective. Precision is the ratio of the correctly predicted defective samples to the total predicted defective samples.

AUC–ROC is the area under the receiver operating characteristics curve. This curve shows the relationship between recall and false positive rate (FPR). The AUC–ROC indicates the ability of techniques to distinguish between defective and non-defective samples. Higher AUC–ROC values indicate better performance of the techniques.

4.4 Parameter Settings

For PCA–RVFL, the average AUC–ROC variation with the number of PCA components is shown in Fig. 4. The highest average AUC–ROC value is achieved for 12 PCA components, so it is used in the experiments.

The parameter settings for RVFL are decided by conducting experiments and based on previous studies [6]. In this study, RVFL is implemented using ridge regression with the sigmoid activation function. The randomization range of weights and biases is [−1,1] from a uniform distribution, and the number of hidden layer nodes is 80.

PCA–RVFL has been compared with various ML techniques—KNN, MLP, RBF, NB, DTree, LOG, SVM, RF [25], and ELM [26]. DTree, LOG, SVM, RF, and NB techniques are implemented using the open-source Scikit-learn library of Python with default parameter settings. MLP and RBF networks are implemented using the Keras library of Python. The parameters for MLP, RBF, and KNN are chosen by conducting experiments on the datasets described in Table 1. For MLP and RBF, the learning rate has been taken as 0.1, with RMSProp optimizer, the number of hidden nodes 50, and batch size 50. For MLP, the activation function sigmoid is used. For ELM, the number of hidden layer nodes is 15, the activation function is sigmoid, and the randomization range of weights and bias is [−1,1] and [0,1], respectively.

5 Experimental Results and Analysis

5.1 RQ1. What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Within-Project Scenarios in SDP?

To answer this RQ, experiments are conducted according to the framework described in Sect. 4. Tables 3 and 4 show the comparison of AUC–ROC and F-measure of the ML techniques in the within-project scenario. The highest value for every dataset is highlighted in the tables.

Table 3 AUC–ROC comparison between ML techniques for within-project scenario

Full size table

Table 4 F-measure comparison between ML techniques for within-project scenario

Full size table

The proposed model has achieved the best AUC–ROC value for most datasets, i.e., 15 out of 17 datasets. The average AUC–ROC value for PCA–RVFL is 0.753 which is the highest among other ML techniques with average AUC–ROC values as—0.656 (for PCA–MLP), 0.652 (for PCA–NB), 0.652 (for RVFL), 0.693 (for PCA–ELM), 0.611 (for PCA–SVM), 0.645 (for PCA–RF), 0.672 (for PCA–DTree), 0.626 (for PCA–RBF), 0.644 (for PCA–KNN), and 0.631 (for PCA–LOG). PCA–RVFL has given a significantly better performance than RVFL, which shows that dimensionality reduction can play a crucial role in a model's prediction performance. The F-measure values in Table 4 indicate that PCA–RVFL has achieved an average F-measure value of 0.613. It has shown an increase of 12.47% (for PCA–MLP), 15.44% (for PCA–NB), 13.1% (for RVFL), 16.3% (for PCA–ELM), 17.54% (for PCA–SVM), 14.7% (for PCA–RF) 18.2% (for PCA–DTree), 16.3% (for PCA–RBF), 17.6% (for PCA–KNN), and 30.34% (for PCA–LOG) in F-measure. The better performance of PCA–RVFL indicates that it has good prediction capability in the within-project scenarios of SDP.

5.2 RQ2: What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Inter-release Scenarios in SDP?

To answer this RQ, experiments are conducted according to the framework described in Sect. 4. To perform the defect prediction on an inter-release scenario, one release of a dataset is selected for training, and the next release is selected for testing. Table 5 represents the comparison of the AUC–ROC values of various ML techniques. PCA–RVFL shows the best results for 8 out of 10 datasets with AUC–ROC value more than 0.7 for 6 out of 10 datasets. The proposed model has shown significant percentage increase in average AUC–ROC with other ML techniques—14.07% (for PCA–MLP), 14.07% (for PCA–NB), 20.79% (for PCA–DTree), 22.1% (for PCA–KNN), 30.03% (for PCA–LOG), 14.63% (for PCA–RBF), 8.9% (for RVFL), 10.11% (for PCA–ELM), 13.14% (for PCA–SVM), and 10.28% (for PCA–RF). This indicates that PCA–RVFL can handle a class imbalance in the datasets. Table 6 represents the comparison of the F-measure values of various ML techniques. The proposed model has shown the second-highest average F-measure (0.512), where PCA–NB (0.587) has shown the highest average F-measure.

Table 5 AUC–ROC comparison between ML techniques for inter-release scenario

Full size table

Table 6 F-measure comparison between ML techniques for inter-release scenario

Full size table

The model has achieved high AUC–ROC values for most datasets in the experiment, which indicates that PCA–RVFL is suitable for the inter-release scenario of defect prediction. Since the difference between the number of samples and the statistics of two adjacent releases is relatively high, this might be why the low performance of the ML techniques in the inter-release scenario compared to the within-project scenario. In the case of ant_1.5, the highest value of AUC–ROC in the within-project scenario is 0.889, while in the inter-release scenario, it is 0.671 when trained on ant_1.4. These variations indicate the difference in the subsequent releases and affect ML techniques’ performance in inter-release scenarios.

5.3 RQ3: What is the Computational Efficiency of PCA–RVFL in Comparison with Other Traditional Neural Network Architectures?

Apart from predictive capability, one of the most important aspects that differentiate PCA–RVFL from traditional neural network architectures is the computational efficiency of RVFL due to its randomization nature, unlike MLP and RBF that have iterative nature. This section investigates the training time efficiency of PCA–RVFL. The training time of two neural network techniques—MLP and RBF—is compared to prove the computation efficiency of PCA–RVFL. The results shown in Table 7 indicate that PCA–RVFL is a clear winner in terms of computational efficiency. The average training time for PCA–RVFL is 13.7 ms far lower than the average training time for RBF (52.9) and MLP (195.1).

Table 7 Training time comparison between neural network techniques

Full size table

The non-iterative nature of RVFL makes it a computationally efficient ML technique compared to the traditional neural network architectures and makes it suitable for application in problem domains that involve training on large datasets.

6 Conclusion

The study proposed a novel SDP model PCA–RVFL based on randomized neural network RVFL and dimensionality reduction technique PCA. The experiments were conducted to assess the performance of PCA–RVFL for within-project and inter-release scenarios in SDP. The results indicate that PCA–RVFL has good prediction capability compared to ten classic ML techniques previously used in SDP in both scenarios. The AUC–ROC values for PCA–RVFL range from 0.624 to 0.889 for within-project scenario and from 0.671 to 0.754 for the inter-release scenario. The training time results are also in favor of the proposed model, making it computationally efficient compared to neural network architectures like MLP and RBF. The study concludes that randomized neural networks like RVFL can be employed in the field of SDP to achieve good prediction performance. Another important conclusion is that appropriate feature representation is necessary for building effective prediction models in SDP, which is evident from the impact of PCA on the model’s performance.

References

Omri, S., Sinz, C.: Deep learning for software defect prediction: a survey. In: IEEE/ACM 42nd International Conference on Software Engineering Workshops (ICSEW), pp. 209–214 (2020). https://doi.org/10.1145/3387940.3391463
Li, L., Lessmann, S., Baesens, B.: Evaluating software defect prediction performance: an updated benchmarking study (2019). arXiv:1901.01726 [cs.SE]
Cui, M., Sun, Y., Lu,Y., Jiang, Y.: Study on the ınfluence of the number of features on the performance of software defect prediction model. In: Proceedings of the 2019 3rd International Conference on Deep Learning Technologies, pp. 32–37 (2019). https://doi.org/10.1145/3342999.3343010
Arora, I., Tetarwal, V., Saha, A.: Open issues in software defect prediction. Procedia Computer Sci. 46, 906–912 (2015). https://doi.org/10.1016/j.procs.2015.02.161
Article Google Scholar
Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. 27, 504–518 (2015). https://doi.org/10.1016/j.asoc.2014.11.023
Article Google Scholar
Zhang, L., Suganthan, P.N.: A comprehensive evaluation of random vector functional link networks. Inf. Sci. 367–368, 1094–1105 (2016). https://doi.org/10.1016/j.ins.2015.09.025
Article Google Scholar
Bisoi, R., Dash, P., Mishra, S.P.: Modes decomposition method in fusion with robust random vector functional link network for crude oil price forecasting. Appl. Soft Comput. 80, 475–493 (2019). https://doi.org/10.1016/j.asoc.2019.04.026
Article Google Scholar
Zhang, L., Suganthan, P.N.: Visual tracking with convolutional random vector functional link network. IEEE Trans. Cybern. 47(10), 3243–3253 (2017). https://doi.org/10.1109/TCYB.2016.2588526
Article Google Scholar
Zhang, Y., Wu, J., Cai, Z., Duc, B., Philip, S.Y.: An unsupervised parameter learning model for RVFL neural network Author links open overlay panel. Neural Netw. 112, 85–97 (2019). https://doi.org/10.1016/j.neunet.2019.01.007
Article Google Scholar
Vukovíc, N., Petrovíc, M., Miljkovíc, Z.: A comprehensive experimental evaluation of orthogonal polynomial expanded random vector functional link neural networks for regression. Appl. Soft Comput. 70, 1083–1096 (2018). https://doi.org/10.1016/j.asoc.2017.10.010
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the ımpact of classification techniques on the performance of defect prediction models. In: IEEE/ACM 37th IEEE International Conference on Software Engineering (2015). https://doi.org/10.1109/ICSE.2015.91
Cao, W., Wang, X., Ming, Z., Gao, J.: A review on neural networks with random weights. Neurocomputing 275(31), 278–287 (2018). https://doi.org/10.1016/j.neucom.2017.08.040
Article Google Scholar
Dhamayanthi, N., Lavanya, B.: Software defect prediction using principal component analysis and naïve bayes algorithm. In: Chaki N., Devarakonda N., Sarkar A., Debnath N. (eds.) Proceedings of International Conference on Computational Intelligence and Data Engineering. Lecture Notes on Data Engineering and Communications Technologies, 28. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-6459-4_24
Malhotra, R.: An empirical framework for defect prediction using machine learning techniques with Android software. Appl. Soft Comput. 49, 1034–1050 (2016). https://doi.org/10.1016/j.asoc.2016.04.0324
Article Google Scholar
Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. J. 27, 504–518 (2015). https://doi.org/10.1016/j.asoc.2014.11.023
Article Google Scholar
Abdi, H., Williams, L.J.: Principal component analysis. WIREs. Comput. Statistics 2(4), 433–459 (2010). https://doi.org/10.1002/wics.101
Article Google Scholar
Husmeier, D.: Random vector functional link (RVFL) networks. Neural networks for conditional probability estimation. In: Perspectives in Neural Computing, pp. 87–97. Springer, London (1999). https://doi.org/10.1007/978-1-4471-0847-4_6
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970). https://doi.org/10.1080/00401706.1970.10488634
Article MATH Google Scholar
Menzies, T., Krishna, R., Pryor, D.: The promise repository of empirical software engineering data. Department of Computer Science, North Carolina State University (2016) [Online] Available: http://promisedata.org/repository
Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Software Eng. 20(6), 476–493 (1994). https://doi.org/10.1109/32.295895
Article Google Scholar
Henderson-Sellers, B.: Object Oriented Metrics: Measures of Complexity in New Jersey, pp. 142–147. Prentice-Hall (1996)
Google Scholar
Martin, R.: OO design quality metrics—an analysis of dependencies. In: Workshop Pragmatic and Theoretical Directions in Object-Oriented Software Metrics (1994)
Google Scholar
Tang, M.H., Kao, M.H., Chen, M.H.: An empirical study on object-oriented metrics. In: Proceedings of Metrics, pp. 242–249 (1999). https://doi.org/10.1109/METRIC.1999.809745
Thomas, J., McCabe, J.: A complexity measure. IEEE Trans. Software Eng. SE-2(4), 308–320 (1976). https://doi.org/10.1109/TSE.1976.233837
Alpaydin, E.: Introduction to Machine Learning. MIT press, Cambridge (2014)
MATH Google Scholar
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006). https://doi.org/10.1016/j.neucom.2005.12.126
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Software Engineering, Delhi Technological University, Bawana Road, New Delhi, India
Ruchika Malhotra, Deepti Aggarwal & Priya Garg

Authors

Ruchika Malhotra
View author publications
You can also search for this author in PubMed Google Scholar
Deepti Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Priya Garg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Education and Training Division, Centre for Development of Advanced Computing (CDAC), Noida, Uttar Pradesh, India
Arti Noor
Computer Science and Information Technology, Kwantlen Polytechnic University, Surrey, BC, Canada
Abhijit Sen
Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, India
Gaurav Trivedi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Malhotra, R., Aggarwal, D., Garg, P. (2022). Application of Random Vector Functional Link Network for Software Defect Prediction. In: Noor, A., Sen, A., Trivedi, G. (eds) Proceedings of Emerging Trends and Technologies on Intelligent Systems . ETTIS 2021. Advances in Intelligent Systems and Computing, vol 1371. Springer, Singapore. https://doi.org/10.1007/978-981-16-3097-2_11

Download citation

DOI: https://doi.org/10.1007/978-981-16-3097-2_11
Published: 02 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3096-5
Online ISBN: 978-981-16-3097-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

3 PCA–RVFL: The Proposed Model for Software Defect Prediction

3.1 Dimensionality Reduction Using Principal Component Analysis

3.2 Classification Using Random Vector Functional Link Network

3.3 Pseudo-code for PCA–RVFL

4 Experimental Setup and Framework

4.1 Dataset

4.2 Variables

4.3 Performance Measures

4.4 Parameter Settings

5 Experimental Results and Analysis

5.1 RQ1. What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Within-Project Scenarios in SDP?

5.2 RQ2: What is the Prediction Performance of PCA–RVFL in Comparison with Other ML Techniques for Inter-release Scenarios in SDP?

5.3 RQ3: What is the Computational Efficiency of PCA–RVFL in Comparison with Other Traditional Neural Network Architectures?

6 Conclusion

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation