Introduction

The ease and speed that comes with the use of software has made it an indispensable tool in the daily lives of humans. embedded software is used in a wide range of sectors including health, finance, manufacturing, transportation, and sales among others. The importance and merits of software cannot be overemphasized- it makes the decision-making process faster, increases productivity, makes manual processes automatic, and provides an overall better customer experience. As more developments come around, software becomes more complex in order to keep up with the ever-growing and changing user requirements.

The proliferation of Internet of Things technology (IoT) has put new requirements for the design of embedded software. Three categories of devices are used in the IoT ecosystem: Low-end, middle-end, and high-end devices [1]. Low-end devices encapsulate a set of microcontrollers that typically have a limited RAM (capacity less than 50kB) and a flash memory that doesn’t exceed 250kB in size. There are three types of low-end devices: 8-bit, 16-bit, and 32-bit architecture. Several real-time operating systems have been developed for low-end devices: RIOT, Contiki, tinyOS, freeRTOS, Zepher, etc., [2].

For IoT system, the principal goal of software engineering is to produce and deliver effective and efficient software within the specified timeframe and budget. There are various non-functional features of software which can be used to determine its quality. The most important feature to consider here is its reliability [3, 4]. To assess its reliability, the amount of defects in the software have to be detected and calculated. So the lower the number of defects, the more reliable the software is. Non-functional software features that can be used to assess software quality are reliability, security, maintainability, and scalability. Scalability measures the greatest workloads that the system can handle while still meeting performance criteria. Reliability, This quality feature describes the likelihood that the system or its component will operate without failure for a certain amount of time under preset conditions. It is represented as a percentage of likelihood. The security criterion ensures that any data included within the system or its components is safe from malware assaults or unauthorized access. The time necessary for a solution or its component to be corrected, updated to improve performance or other attributes, or adapted to a changing environment is defined as maintainability There are several factors which could cause defects in software but the most common are the misinterpretation, incompleteness or unimplementation of the requirements [5].

Embedded systems are prone to security attacks, opaque, and less controllable from the end-user. Finding bugs in embedded software received scant attention by the research communities. The reasons as narrated in [6] are, among other things, the proliferation of embedded devices and the lack of forensic tools for low-end devices. The widely used approaches for the analysis of the embedded software are by obtaining device firmware, and static and dynamic firmware analyses.

A defect is essentially an error or flaw which prevents the software from meeting its requirements [7]. It is impossible to create a software that is 100% defect-free, so the only way to curb this \(\sim\) is to detect and fix as many defects as can be detected before the final product is delivered to the users. A vital stage in the production of software to ensure its quality is the phase of software testing. The early detection of the defective software components assist the software quality assurance teams in identifying the units to prioritize during the testing phase [8]. However, there are challenges associated with this stage of production. The high cost, long duration and limited resources prevent the intensive testing of every software component [9]. It is crucial to create smart tools capable of detecting software components that are defective in the testing phase [10, 11] . The prediction and detection of software defects is a fast-developing field in the research community. Software defect detection is the development of models to be used in the early testing process to pinpoint which components in the software are defective [12, 13]. It is an automatic approach which greatly enhances the testing process [14]. A survey conducted by GitLab revealed that in the software development process, the testing phase consumes the most time and causes the most delays [15]. According to the World Quality Report 2019-2020, the software testing phase takes up almost 30% of the entire software project costs [16]. A software defect is a coding fault that results in inaccurate or unexpected output from a software programme that does not fulfil actual requirements. Human aspect, communications failure, unrealistic development timetable, poor design login, faulty debuggers, poor coding methods, poor tools, lack of version control, and bugged 3rd party tools are the primary causes of software errors. Early defect detection tests are a cost-effective and accurate method of developing secure, resilient software. As defects or vulnerabilities go undiscovered, they generate a never-ending costs in terms of cost-to-fix or remediation.

The primary contribution of this research is the design of a software defect detection model using the Multi-Layer Perceptron Neural Network, a deep learning technique to improve its accuracy in detecting defects. This research also provides an evaluation of the performance of the proposed model, and uses it as a base model to assess the approach of other deep learning techniques. In summary, the contributions of this research include:

  1. 1.

    To examine deep learning and machine learning approaches used in software defect detection.

  2. 2.

    To identify public software defect datasets which can be used to train the software defect detection models

  3. 3.

    To propose a detection model using the Multi-Layer Perceptron Neural Network

  4. 4.

    To examine the performance of the proposed model and present its experimental results.

The remainder of our paper is organized as follows. Section 2 provides a literature review of the software defect development. Section 3 illustrates the main concepts that will be used in the proposed system. The performance evaluation of the proposed system are introduced in Section 4. In Section 5, the results are analysed. Finally, we conclude the paper in Section 6.

Literature review

The development of a software defect detection model with high accuracy has proven to be a herculean task. Over the past decade, several approaches have been proposed, but most of them have not been able to meet the standards in their accuracy of predicting and detecting the defects [17, 18]. According to [19], the introduction of network computing technologies like cloud computing has provided users all around the world with an affordable and flexible network-based service provision scheme. This storage service allows users to outsource their large local data to cheaper remote storage servers, reducing cost and protecting the integrity of the data. The blockchain has proven to be effective in preventing the leakage of data in 5G environments. Unnecessary reliance on a Trusted Third Party and the burden of significant overhead are some of the challenges the blockchain has mitigated, allowing the secure generation and sharing of watermarked content [20]. As a result of the widespread adoption of network storage services, there are emerging performance and security issues that affect the scalability. The high cost, reliance on third parties and repeated auditing of data are also challenges faced by the existing data auditing mechanism. To resolve this, a blockchain-based deduplication scheme was proposed to help check the data integrity and credibility of audit results in [21]. Deep learning, designed from hierarchical structure comprising multiple neural layers, has the ability to extract and learn information for generation of the reconstruction features from the input data through neural processing layer-by-layer. It has a wide range of uses including language understanding, visual recognition and threat detection in a network [22]. In [23], an efficient attribute-based scheme was proposed to prevent the breach of privacy of the access subject in the process of decision-making through the introduction of a state-of-the-art hash-based binary search tree. Renewable energy sources (RES) are of vital importance in modern power systems, but they are easily affected by the environment. Most dispatching mechanisms depend on centralized organizations, but the authors in [24] tried to resolve this by proposing a blockchain-based scheme to dispatch energy for RES systems.

Deep learning is a hot topic in computing, but building an effective deep learning model is a very challenging task, as a result of its dynamic nature, and the differences in real-world data and problems. In [25], a comprehensive review of deep learning techniques was presented, considering different types of real-world tasks like unsupervised or supervised, and their real-world application areas. Kantardzic [26] presented the state-of-the-art techniques for the analysis and extraction of information from massive amounts of data in high-dimensional data spaces, as well as an extensive view on software tools. Han and Kamber [27] and Han et al. [28] revealed how deep learning techniques have been used to solve a wide range of real-world problems with great success. In [29], the theoretical bases for the Multi-Layer Perceptron Neural Network was presented for the backpropagation learning algorithm and the architecture. The MLP model comprises of the input layer, hidden layer, and the output layer. The nodes in the MLP model are activated through the Sigmoid function by the Weka 3.8.6 tool, which is used to test software defect prediction models with varying configurations for MLP networks [30]. In [31], various metric-based bug datasets were collected and assessed in order to acquire a common set of source code metrics. The primary aim was to show how effective the dataset is in bug prediction. Cetiner and Sahingoz [32] presented a comparative and comprehensive analysis about deep learning and machine learning-based software defect prediction models through the comparison of 10 different learning algorithms on public datasets. The experimental results revealed that the proposed model showed high accuracy in the prediction of software defect and therefore, increased the quality of the software. [33] presented an Intelligent Cloud enabled Internet of Everything infrastructure as a first step in combining these two broad sectors and offering important services to end users. The Wind Driven Optimization Technique is used to enhance energy usage by clustering the different IoT networks. Rajput et al. [34] proposed a reference model for assisting diabetics in remote areas The concept enhances communications and interactions between patients and physicians. The current study’s analysis goal is to analyze the risk variables and the correlations that exist among these risk factors.For prediction, Naive Bayes, SVM, random forest, logistic regression, decision tree and KNN classifiers are employed. Rupa et al. [35] A blockchain-based cloud-integrated IoT solution is offered, which can aid in the detection of intruders via virtual surveillance. The key feature of this technique is that it may work in regions where monitoring and control is difficult, and data is saved in a tamper-proof blockchain environment

Proposed method

Multi-Layer Perceptron (MLP)

Multi-Layer Perceptron (MLP) is one of the supervised learning models used in deep learning [25]. MLP neural networks have been used to solve a variety of complex and diverse real-world problems with great success [26].This section describes the different aspects of an MLP. This model is made up of three types of layers which are input layer, hidden layer, and output layer [27]. The MLP neural network consists of one or more hidden layers between the input layer and the output layer. Each layer is made up of small units called neurons. The artificial neurons function similarly to biological neurons where the neurons receive inputs from other neurons, process them and produce an output.

Tools

Weka 3.8.6 tool was used to train and test the software defect prediction models with different configurations for Multi-layer Perceptron neural network. Weka uses Sigmoid function as the activation function of all the nodes in the MLP neural network [29]. Sigmoid function can be defined as follows,

$$\begin{aligned} f(x)=\frac{1}{1+{e}^{-x}} \end{aligned}$$
(1)

The Sigmoid function performs better in classification problems with linearly non- separable classes [11]. The sigmoid function may be exploited in complex classification functions because it provides non-linear bounds when coupled with a non-linear framework. The kernel function is better suited for the classification tasks with linearly non-separable classes because the sigmoid function is homogeneous, continuous, and differentiable everywhere and its derivative can be defined in terms of itself. Furthermore, The Sigmoid function accepts any real number as input and returns a value in the range of 0 to 1 as output. Figure 1 illustrates the Sigmoid function.

Fig. 1
figure 1

Sigmoid function

During the learning phase, the weights are adjusted using the gradient descent approach. Gradient descent algorithm is an adoption in traditional back propagation in which the network weights are shifted along the negative gradient of the response surface together with learning rate and momentum. A response to the learning issues is a weighted combination that minimize the error function. Weights are updated during the learning process using the following formulas [19]. ‘w’ refers to the weight assigned to a connection. ‘\(\Delta\)w’ denotes change in weight ‘w’ and it is calculated using (2).Gradient is determined using the back propagation algorithm. The next value of weight ‘w’ is denoted by ‘w\(_{next}\)’ and it is calculated using (3).

$$\begin{aligned} \Delta w= -learning\; rate\times gradient + momentum\times \Delta w_{previous} \end{aligned}$$
(2)
$$\begin{aligned} w_{next}= w+\Delta w \end{aligned}$$
(3)

The learning rate and momentum are applied to update the weights during the learning process. The learning rate and momentum were assigned fixed values.

Learning rate is a hyper parameter that indicates how well the model should react differently to the predicted error each time the system model weights are adjusted. Momentum can help to expedite training, and learning rate plans can aid in the optimization process. According to the problem, the momentum and the learning rate both are allocated definite values in order to achieve stable convergence.

Datasets

This study considered using public datasets to train and verify the proposed software defect detection model. Furthermore, the literature suggests exploring and using new datasets for developing software defect detection models. Therefore, this research also considered investigating and using new software defect datasets to train and test the proposed model. Five different public datasets formed by [31] were selected as one of the main concerns of this study is to use new software defect datasets. Three datasets were selected from Tera-Promise and two datasets were selected from GitHub Bug Repository. The datasets selected from Tera-Promise are Xalan 2.6, Velocity 1.5 and Poi 3.0. Netty 3.6.3 and mcMMO 1.4.06 are the two datasets chosen from GitHub Bug Repository. The properties of the selected datasets that include the number code metrics found in the original dataset, the number of code metrics calculated using OSA, total number of instances, number of defective instances and number of non-defective instances are illustrated in Table 1.

Table 1 Properties of the selected datasets

Proposed WORKFLOW

The workflow used for proposing a software defect prediction model using MLP is illustrated in Fig. 2. The proposed workflow mainly consists of four steps. First step is dataset selection. The second step is generation of the prediction model using a specific MLP network configuration. Training and testing the generated prediction model using training and test datasets is performed in third step. The final stage involves performance evaluation of the software defect prediction model.

Fig. 2
figure 2

Proposed workflow

According to [28] there are no hard and fast guidelines for deciding the ideal number of hidden layers and the ideal number of neurons in each layer. Therefore, this research conducted an experiment with 4 different network configurations on the five selected datasets to determine the best possible number of hidden layers and the number of neurons in each hidden layer. The following guidelines were followed when deciding the combination of the number of hidden layers and the number of neurons in each hidden layer.

  • The experiment was conducted setting the number of hidden layers to 2 and 3

  • The number of neurons in each hidden layer was selected between the number of neurons in the input layer (60) and the number of neurons in the output layer (2)

  • The number of neurons was gradually decreased from the first hidden layer to the last hidden layer.

Performance evaluation

This study utilized k-fold cross-validation approach. Furthermore, the value of k was set to 10 and therefore, 10-fold cross validation was used to get an evaluation result and estimate of the error.

In 10-fold cross validation, the original dataset is divided into equal sizes of 10 sub datasets randomly [28]. The training and testing are repeated 10 times for a selected dataset. Figure 3 illustrates the procedure of 10-fold cross validation.

Fig. 3
figure 3

10-fold cross-validation technique

Furthermore, the process of 10-fold cross-validation can be described as follows. In the first iteration, the first subset of data is used as the testing dataset while the rest of the 9 subsets of data are used as training datasets. In the second iteration, the second subset of data is used as the testing dataset while the remaining 9 sub datasets are used for training the model. This process is repeated 10 times. It produces 10 results, and an average result is computed as the final result. In k-fold cross validation, each subset of data is utilized one time for testing and an equal number of times for training.

As illustrated in Fig. 4, a confusion matrix for the binary classification problem produces 4 outcomes which are True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN) [32].

  • True Positive (TP) presents the number of positive instances which were correctly predicted as positive [1];

  • False Negative (FN) presents the number of positive instances which were incorrectly predicted as negative [1];

  • False Positive (FP) presents the number of negative instances which were incorrectly predicted as positive [1];

  • True Negative (FN) presents the number of instances which were correctly predicted as negative [1].

Fig. 4
figure 4

Confusion Matrix

Therefore, True Positive (TP) and True Negative (FN) reflect that the prediction has been made correctly whereas False Positive (FP) and False Negative (FN) indicate that the prediction has been made incorrectly.

The proposed configurations of MLP neural network for software defect detection were evaluated using several performance measures which are accuracy, precision, recall, F-measure and ROC Area. The Weka tool provides all these performance indicators. Some of the established performance metrics applied for measurements are as follows;

$$\begin{aligned} Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(4)
$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(5)

The score of 1.0 reflects a perfect recall whereas the score of 0.0 reflects the worst recall. A higher value of recall means the prediction model has higher ability of detecting true positives as positive.

$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(6)

The perfect precision is defined by the score of 1.0 whereas the worst precision is defined by the score of 0.0. A higher value of precision reflects that the prediction model shows a low rate of incorrectly classifying negative instances as positive.

Since it is clear that neither recall nor precision can give a complete measure on their own. F-Measure combines both recall and precision into a single measure by calculating the harmonic mean of them. Therefore, F-Measure gives equal weights for both recall and precision which is more important when imbalanced data is used for training.

$$\begin{aligned} F -Measure=\frac{2*Precision*Recall}{Precision+ Recall} \end{aligned}$$
(7)

Results analysis

This section analyses the experimental results obtained for all MLP network configurations on the five selected datasets. The performance of each MLP architecture was evaluated using accuracy, precision, recall, F-Measure and ROC Area. Table 2 illustrates the percentage of prediction accuracy of each MLP network configuration on each dataset.

Table 2 Prediction accuracy results

The MLP model with two hidden layers and 25 and 5 neurons in the first and second hidden layers, respectively, shows the highest prediction accuracy for Xalan 2.6, Velocity 1.5, Poi 3.0 datasets. However, the MLP model with the same network configuration shows the lowest prediction accuracy for Netty 3.6.3 dataset. The MLP which consists of three hidden layers and 15 and 10 nodes in the first and second hidden layers respectively and 5 nodes in the final hidden layer has performed better with higher prediction accuracy in Netty 3.6.3 dataset. The average accuracy of this MLP model is 78.2120%.

Table 3 shows the evaluation results including precision, recall and F-Measure for all considered MLP network configurations on Xalan 2.6 dataset. The results present the precision, recall and F-Measure for each target class as well as the weighted average of per-class values. According to the findings, in precision, recall and F-Measure, the MLP model with two hidden layers and 25 and 5 neurons in each hidden layer has performed well in both output classes when compared to the other configurations of MLP neural network.

Table 3 Results obtained for Xalan 2.6 dataset

Table 4 presents the evaluation results on Velocity 1.5 dataset. It shows that in precision, the MLP with two hidden layers and 15 and 5 neurons in first and second hidden layers respectively performed well for the true class whereas the MLP with two hidden layers and 25 and 5 neurons in each hidden layer performed well in the false class. In recall, vice versa has happened for true and false classes. However, when the weighted average in recall and precision are considered, the MLP with two hidden layers and 25 and 5 neurons in each hidden layer performed well for Velocity 1.5 dataset.

Table 4 Results obtained for Velocity 1.5 dataset

The evaluation results of Poi 3.0 dataset are provided in Table 5. According to the results, the MLP neural network with two hidden layers and 25 and 5 neurons in the first and second hidden layers respectively has shown the best performance in recall, precision and F-Measure in both true and false classes when compared to other network configurations.

Table 5 Results obtained for Poi 3.0 dataset

The results obtained for Netty 3.6.3 dataset with each MLP model is presented in the Table 6. It is clear that in precision, recall and F-Measure, the MLP with three hidden layers having 15, 10 and 5 nodes from the first hidden layer to the third hidden layer respectively has well detected both true and false classes.

Table 6 Results obtained for Netty 3.6.3 dataset

Table 7 illustrates the results of mcMMO 1.4.06 dataset with all experimented MLP model configurations. As per the findings, the MLP neural network with two hidden layers, with 15 and 5 nodes in each hidden layer has relatively high precision, recall and F-Measure in both true and false classes when compared to other MLP network configurations.

Table 7 Results obtained for mcMMO 1.4.06 dataset

According to precision, recall and F-Measure, the MLP neural network with three hidden layers, with 30 neurons in the first hidden layer, 15 neurons in the second hidden layer and 5 neurons in the final hidden layer has not shown higher performance in any of the datasets.

According to ROC Area results presented in Table 8, the MLP neural network with two hidden layers and 25 neurons in the first hidden layer and 5 neurons in the second hidden layer has performed better in Xalan 2.6 and Poi 3.0 datasets in discovering between the defective and non-defective classes. For Velocity 1.5 dataset, the MLP with two hidden layers having 15 nodes and 5 nodes respectively in the first and second hidden layers has produced the highest ROC Area. Furthermore, the MLP with three hidden layers with 15, 10 and 5 nodes from the first hidden layer to the last hidden layer respectively has shown a higher ROC Area on mcMMO 1.4.06dataset. The MLP with three hidden layers and 30 neurons in the first hidden layer, 15 neurons in the second hidden layer and 5 neurons in the last hidden layer has performed well with ROC Area value of 0.789 on Netty 3.6.3 dataset.

Table 8 ROC Area results

ROC curves generated by the MLP with two hidden layers and 25 neurons in the first hidden layer and 5 neurons in the second hidden layer for true class of Xalan 2.6, Velocity 1.5, Poi 3.0, Netty 3.6.3 and mnMMO 1.4.06 datasets are presented in Figs. 5, 6, 7, 8 and 9 respectively. X-axis in each plot represents False Positive rate while Y-axis represents True Positive rate.

Fig. 5
figure 5

ROC curve of Xalan 2.6

Fig. 6
figure 6

ROC curve of Velocity 1.5

Fig. 7
figure 7

ROC curve of Poi 3.0

Fig. 8
figure 8

ROC curve of Netty 3.6.3

Fig. 9
figure 9

ROC curve of mcMMO 1.4.06

After analysing the results obtained for all performance metrics including accuracy, precision, recall, F-Measure and ROC Area as well as considering the class imbalance issue existing in the datasets, this study has discovered that the best possible network configuration for software defect detection model using MLP can be the prediction model with two hidden layers having 25 neurons in the first hidden layer and 5 neurons in the second hidden layer. Figure 10 illustrates proposed MLP architecture.

Fig. 10
figure 10

Proposed architecture of MLP model for software defect prediction

Conclusion

The field of software engineering mainly focuses on delivering high-quality software within a specified budget and period. However, the defects found in the software can cause delayed the delivery process while reducing the quality of software. Therefore, early detection of defects in the software being developed and fixing before it is delivered to the end users is a crucial task. Due to the complexity of user requirements as well as the infeasibility of producing defect-free software, the software testing process requires high cost and time to deliver quality software. Early detection of defect prone modules in the software being developed assists software quality assurance teams to identify which modules require more attention during the testing process and thereby make use of available resources for software testing efficiently and effectively. Automation of early detection of defective modules in software has been an active research area for many years. With the evolution of different technologies, software defect detection has also become an emerging research area. Various approaches including statistical methods, machine learning techniques and deep learning techniques have been applied for software defect detection. This study dealt with the investigation of one of the emerging areas in AI which is deep learning for software defect detection. This research analysed the performance of Multi-layer Perceptron neural network, which is a supervised deep learning technique for software defect detection. An experiment with four different network configurations of MLP neural network was conducted to propose the best possible MLP architecture among them for software defect detection. The performance of the four MLP neural network configurations was evaluated using several metrics computed using a confusion matrix. The used evaluation metrics are accuracy, precision, recall, F-Measure and ROC area. Among the MLP network configurations used for the experiment, this study has discovered that the best possible network configuration for the software defect detection model using MLP can be the prediction model with two hidden layers having 25 neurons in the first hidden layer and 5 neurons in the second hidden layer. The study concludes that the proposed MLP neural network performs better on 3 out of 5 used datasets when compared to the other network configurations. However, this study concludes that more empirical studies should be conducted to assess the performance of the proposed software defect detection model and thereby help to refine it.