Introduction

Nowadays, we are living in the “Internet +” era, and software applications are needed for daily life or business activities. People are becoming more and more dependent on software systems in their daily lives, and their requirements for software quality are getting higher and higher. With the continuous increase of software scale, the software structure is becoming more and more complex, and the existence of software defects has become an inevitable fact [1]. Therefore, an effective software defect prediction model is urgently needed. Traditional software defect prediction is based on the analysis of files, software packages, or codebases, but in reality, software developers may cause defects every time they submit code. On the other hand, because a large amount of code may only bring a very small number of defects, this causes code reviewers spend a lot of time. Therefore, the researchers proposed an instant defect prediction technology.

However, whether it is traditional software defect prediction or instant software defect prediction, the accuracy of model prediction is affected by the imbalance of data set categories. 20% of the defects in the software engineering field may exist in 80% of the modules [2], as shown in Fig. 1. In most cases, code changes that do not cause defects to account for a larger proportion, so there is an imbalance rate in the data set, that is, the imbalance between the minority and majority categories, which will affect the classification prediction effect of the model. Most types, that is, code changes that will not produce defects will make the model have an artificially high prediction accuracy, and it is difficult to obtain the expected results in practical applications. In addition, the data set features contain many irrelevant or redundant features, which will also affect the accuracy of the model prediction.

To reflect on the above research, a software defect prediction framework based on Nested-Stacking is proposed. First, Nested-Stacking uses techniques such as heterogeneous feature selection and normalization to improve data characteristics during the data preprocessing stage, so that the model can achieve better results from the classification process. In the classification stage, Nested-Stacking technology is used to achieve various good and different ensemble learning stacking [3], and the second layer is nested with custom MLP [4] such excellent traditional neural network classifiers, allowing the model to learn from machine learning; different advantages from depth studies. In addition, two comprehensive indicators, AUC and F1-score, are used to evaluate the effect of model classification.

Fig. 1
figure 1

Code repository

Finally, this paper conducts large-scale experiments on two defect data sets with different granularity levels and compares the classification performance of Nested-Stacking with other models. We get from the experimental data that the Nested-Stacking proposed in this paper is better than other models. To ensure the repeatability of the experiment and promote model change, this article also discloses the code and data of the experiment. The main contributions can be summarized as follows:

  1. (1)

    As far as we know, we are the first to propose software defect prediction based on Nested-Stacking. Nested-Stacking can learn software defect data characteristics more deeply using traditional machine learning and deep learning. Nested-Stacking first selects some of the features most suitable for the baseline model through heterogeneous feature engineering and then normalizes all the feature values. The processed data are input into the Nested-Stacking classifier for the first stage of prediction. Finally, the meta-classifier LogisticRegression is used to predict defects and output evaluation indicators.

  2. (2)

    In a large-scale empirical study, we compared the classification performance of Nested-Stacking and other models in two evaluation scenarios (WPDP/CPDP) between two data sets (Kamei/PROMISE).

  3. (3)

    The empirical results show that in the two evaluation scenarios, the classification of Nested-Stacking is better than other models. Especially in WPDP-Kamei, the classification performance of Nested-Stacking is very competitive.

The organizational structure of this paper is as follows: The second section introduces the background and related work of software defect prediction. The third section introduces Nested-Stacking and heterogeneous feature selection in detail. In the fourth section, the experimental setup is introduced. Section 5 presents the experimental results and analysis. Section 6 introduces threats to effectiveness. Finally, the seventh section summarizes the work of this paper and looks forward to future work.

Related work

This section introduces the related work of real-time software defect prediction, as well as the related work of feature selection and integration models in the field of traditional software defects.

Just-in-time software defect prediction

In recent years, JIT-SDP has become a research hotspot in the field of defect prediction because of its fine-grained and instant traceability. In the software defect prediction problem, Khuat et al. [5] empirically evaluated the importance of sampling various classifier sets of imbalanced data by combining sampling technology and ensemble learning model and predicted positive effects for data with category imbalance problem. Zhu et al. [6] proposed a just-in-time defect prediction model DAECNN-JDP based on a denoising autoencoder and convolutional neural network. DAECNN-JDP sets different weights for position vectors of each dimension feature and trains them automatically by adaptive trainable vectors. By training the denoising autoencoder, the input features free from noise pollution can be obtained, learn more robust feature representation. Pascarella et al. [7] proposed a novel fine-grained model to predict the defective documents contained in the submission, and reduce the workload required to judge defects according to the classification performance and the degree to which the model. Yan et al. [8] proposed a two-stage framework, namely defect identification and defect location. Given a new change, first, identify whether it is a buggy change or a clean change. If a new change is identified as defective, the JIT defect location will sort the source code lines introduced by the new change according to its suspicious degree score. In the positioning stage, they use the software natural positioning method of the N-gram model. Bejjanki et al. proposed class imbalance reduction (CIR) [9], which is an algorithm that establishes symmetry between defective records and non-defective records in an imbalanced data set by considering the distribution characteristics of the data set. Yang et al. proposed a supervision method DEJIT [10] based on Differential Evolution (DE) to build a JIT-SDP model. Specifically, they first proposed a metric called Density Percent Average (DPA), which was used as the optimization target for the training set. Then, logistic regression (LR) is used to build a predictive model. To make LR get the maximum DPA on the training set, they use the DE algorithm to determine the coefficient of LR.

In JIT-SDP-related research, many processing methods tend to change the data distribution. In real data, the distribution of training and test data cannot be confirmed, and changes in the data distribution may lead to a decrease in model accuracy. This article does not perform any up-sampling or down-sampling of the data, nor does it modify the data distribution, which ensures that the model learns the most primitive data distribution.

Software defect prediction based on ensemble model

In the field of software defect prediction, there have also been many related pieces of research on integrated models. Alsawalqah [11] proposed a hybrid classification method for software defect prediction. The main idea of this method is to develop experts and robust classification models based on similar pattern groups. Malhotra [12] conducted an empirical comparison of software defect prediction models developed using various integration methods based on boosting in three open-source JAVA projects, all of which included resampling technology. The results show that the use of resampling technology before ensemble classification can significantly improve the prediction accuracy of the model. Matloob et al. [13] proposed an integrated classification framework based on multiple feature selection algorithms and compared the results with ten machine learning classifiers. Li et al. [14] proposed a new two-stage integrated learning (TSEL) method, which includes an integrated multi-core domain adaptation (EMDA) stage and an integrated data sampling (EDS) stage. Iqbal et al. [15] presented a classification framework which uses Multi-Filter feature selection technique and Multi-Layer Perceptron (MLP) to predict defect prone software modules. Öztürk [16] studied the effect of hyperparameter optimization on the integrated learning algorithm in terms of defect prediction performance. The purpose of Kakkar et al. ’s study [17] was to determine the most appropriate attribution technique for handling missing values in SDP datasets.

At present, the integration models in the field of software defects are mostly single soft integration or traditional stacking, such as bagging, voting, and stacking. From the perspective of the prediction effect, it may not be as effective as the tree algorithm, and it takes a long time. In addition, the common K-Fold segmentation data used in most recent studies have not yet considered the impact of class imbalance in software defect data. The Nested-Stacking proposed in this paper uses StratifiedKFold and integrates the advantages of various boost algorithms to improve the accuracy of software defect prediction.

Software defect prediction based on feature selection

Ni et al. proposed a multi-objective feature selection method MOFES (Multi-Objective Feature Selection) [18], which simultaneously optimizes two objectives. An optimization goal is to minimize the number of selected features. This goal is related to the cost analysis of this problem. Another goal is to maximize the performance of the constructed model. This goal is related to the benefit analysis of the problem. Balogun A O uses 4 different classifiers to evaluate 5 software defect data sets in the NASA software defect database with 4 filter feature ranking (FFR) and 14 filter feature subset selection (FSS) methods [19]. The feature selection based on the wrapper in the traditional software defect prediction model has good prediction performance, but it has a high computational cost and lack of generalization. Therefore, Oluwagbemiga B A et al. proposed a hybrid multi-filter packaging method [20], which uses the relationship between filter–filter and filter–wrapper to give the optimal feature subset and reduce the time-consuming model. Bashir K et al proposed a new method based on Maximum Likelihood Logistic Regression (MLLR) [21]. In the data set used in the study, analysis of variance (ANOVA) F test results verifies that the proposed method is superior to all FS techniques, both in sampled and unsampled data.

The amount of software defect data is large and the feature dimension is high. A large number of redundant features will affect the model classification effect, so feature selection is necessary. At present, the methods of feature selection can be divided into three types: filtering, embedded, and encapsulated. First of all, filtering algorithms (such as Pearson correlation coefficient [22] and Spearman rank correlation coefficient [23]) select the optimal feature subset independent of the algorithm and have good flexibility. However, the accuracy of the selected features is limited. The encapsulated algorithm selects the optimal feature subset based on the performance of the classification algorithm based on multiple rounds of iterations. However, due to the large search space leading to high computational complexity, the encapsulated algorithm selects the optimal feature subset based on the performance of the classification algorithm based on multiple rounds of iteration. The huge search space leads to high computational complexity, and the method has limited versatility, that is, when the algorithm changes, a new round of complex training will start again. Software defect data exist in most of the features with small variance to determine the label, so the above-mentioned common methods cannot better solve the data problem.

This study proposes Nested-Stacking to predict the software defect data set, and choose different features according to different baseline models at the data level. The experimental results found that nesting and stacking various boosting algorithms and MLP can also reduce the impact of class imbalance in the data set on the classification effect. The innovations of this article are as follows:

(1) Customized baseline model can be nested to adapt to various distributions of software defect data. (2) The data segmentation algorithm in Nested-Stacking uses StratifiedKFold instead of traditional KFold to adapt to the imbalanced distribution of data. (3) Each baseline model in Nested-Stacking can randomly select or specify the feature vector for training and prediction. For example, specify RandomForest to use only the first, fourth, seventh, and eighth column features for training and prediction, or AdaBoost randomly selects 10% of the features for training and prediction.

Nested-Stacking framework

In this section, we will elaborate on the process and steps of the model framework proposed in this article, as shown in Fig. 2.

Fig. 2
figure 2

Framework diagram of the model proposed in our study

(1) Data collection and preprocessing: First, collect two open-source software defect data sets with different granularity levels. Second, different baseline models perform heterogeneous feature selection based on the calculation results of their respective feature importance, to filter out features with low importance, and to achieve the goal of reducing data dimensions, improving model accuracy, and computing speed. Finally, the data are normalized to prevent the difference in feature levels from affecting the internal processing of the model. Heterogeneous feature selection will be detailed in this section. (2) Model construction: First, determine the combination of baseline model and meta-classifier through repeated experiments. After each baseline model is stacked, the prediction result is input to the Meta Classifier, and the final classification prediction is performed. (3) Model evaluation: using the within-project Stratified K-Fold \((K=10)\) [24] cross-validation and cross-project verification experimental methods, two comprehensive evaluation indicators: Area Under Curve (AUC), and F1-score on the generalization performance of the model authenticating.

Nested-Stacking classifier

The core of Nested-Stacking Classifier is to build multiple different baseline modes and stack them to obtain better classification performance. The flowchart is shown in Fig. 3. From the perspective of the baseline model, model stacking can be either homogeneous or heterogeneous. If the model is generated using the same induction algorithm, then the ensemble is called isomorphism; otherwise, it is heterogeneous. Both are now used for software defect prediction. In addition, from the perspective of the generation strategy of the baseline model, the integrated learning methods can be divided into two categories: the parallel method mainly represented by bagging, and the sequential method mainly represented by boosting. In addition to the bagging and boosting methods mentioned above, stacking is another very effective integrated learning method. The core idea is to reduce the generalization error by adding a variety of good and different baseline models and using a meta-fclassifier to combine the results of baseline model predictions, thereby improving the accuracy of software defect prediction. In this article, we stacked the representative algorithms of these two methods and nested a layer of basic Stacking. Figure 3 compares the similarities and differences between the integrated model framework and the Nested-Stacking Classifier framework in the previous study [25]. The heterogeneous integration of different types of basic learners can improve the generalization ability of the model; in addition, combining the prediction results with meta-learners can improve the fitting ability of the model. Nested-Stacking makes the model further improve the classification effect through nested rules.

Fig. 3
figure 3

Nested-Stacking classifier

As shown in Fig. 3, the input data of the Nested-Stacking classifier are selected and normalized by heterogeneous features. The Nested-Stacking classifier contains a total of three layers. The first layer is integrated by three boosting algorithms: LightGBM, CatBoost, and AdaBoost, and a simple stacking model including MLP and RandomForest is nested. The meta-classifier is Gradient Boosting Decision Tree. The last layer is the meta-classifier of LogisticRegression, which performs the final classification prediction. In the early stage of the experiment, to obtain a baseline model with better generalization ability and accuracy, we chose CART, SVM, Naïve Bayes, k-Nearest Neighbor, BaggingClassifier, DesicionTree, and other different types and numbers of baseline models to conduct a wide range of combined experiments. The results show that when LightGBM, CatBoost, and AdaBoost are used as baseline models, the model has the best predictive performance. Compared with other ensemble learning algorithms, Nested-Stacking uses the predicted results of the primary model as the input of the second model when stacking. Therefore, the model deviation is effectively reduced, the accuracy is improved, and the prediction result of the software defect data set is optimized.

Heterogeneous feature selection

In the Nested-Stacking framework, each baseline model can perform different feature selection, which we call heterogeneous feature selection. Using the baseline models to select the best-performing features in each model (the features are not necessarily the same), to stack their prediction results, can improve the overall classification effect of Nested-Stacking. In the heterogeneous feature selection, you can also set the baseline model to perform feature selection by percentage, thereby reducing the overall running time of the model. Next, we will introduce the algorithm flow of feature selection for part of the baseline model in the Nested-Stacking framework.

AdaBoost [26] is an embedded algorithm in the field of feature selection. It uses integrated learning weak classifiers to select the best-performing features to obtain the optimal target feature subset. Its weak classifiers are usually composed of some decision stubs, BP neural networks and SVM, and other compositions. Some code changes usually have only minor feature differences, which are usually easily ignored by soft thresholds such as SVM and BP neural networks. Therefore, to improve the model’s sensitivity to these small differences, a threshold-based decision stub is used as the weak classifier of AdaBoost. Specifically, the AdaBoost feature selection process for the software defect data set is as follows:

Input: Software defect data set

$$\begin{aligned} \begin{aligned} {X}=&\left\{ \left( {x}_{1}^{1}, {x}_{1}^{2}, \ldots , {x}_{1}^{m}\right) , \right. \\&\left. \left( {x}_{2}^{1}, {x}_{2}^{2}, \ldots , {x}_{2}^{m}\right) , \ldots , \right. \\&\left. \left( {x}_{n}^{1}, {x}_{n}^{2}, \ldots , {x}_{n}^{m}\right) \right\} , \end{aligned} \end{aligned}$$
(1)

each defect data label

$$\begin{aligned} Y=\left\{ y_{1}, y_{2}, \ldots , y_{n}\right\} , \end{aligned}$$
(2)

where m is the number of features and n is the number of samples.

Output: The optimal target feature subset F.

Step 1. For a software defect data feature j, train a corresponding weak classifier h\(_j\) to evaluate its importance.

Step 2. Set the correspondence hypothesis between features and labels

$$\begin{aligned} h_{t}=\left\{ x_{i}^{j} \rightarrow Y\right\} . \end{aligned}$$
(3)

Step 3. The error corresponding to \({D}({X}_{{i}}^{{j}})\) is expressed as

$$\begin{aligned} \varepsilon _{t}=\sum _{i: h_{t}\left( x_{i}^{j}\right) \ne y_{i}} D_{t}\left( x_{i}^{j}\right) . \end{aligned}$$
(4)

Step 4. The features with the smallest error \(\varepsilon _t\) are sequentially deleted from the initial recognition feature set X, and added to the optimal target feature subset F at the same time:\(X=X{-}\{f\}, F=F{+}\{f\}\).

Step 5. Use the best-performing classifier h\(_t\) error to update the weight of each weak classifier

$$\begin{aligned} \beta _{t}=\frac{1}{2} \ln \left( \frac{1-\varepsilon _{t}}{\varepsilon _{t}}\right) . \end{aligned}$$
(5)

Step 6. Update \({D}({X}_{{i}}^{{j}})\) to

$$\begin{aligned} D_{t+1}\left( x_{i}^{j}\right) =\frac{D_{t}\left( x^{j}\right) }{N_{t}} \times \left\{ \begin{array}{l} e^{-\beta t}, if h_{t}\left( x_{i}^{j}\right) =y_{i} \\ e^{\beta t}, \text{ otherwise } . \end{array}\right. \end{aligned}$$
(6)

Among them, N\(_t\) is a regularization constant term, such that

$$\begin{aligned} \sum _{i=1}^{m} D_{t}\left( x_{i}^{j}\right) =1. \end{aligned}$$
(7)

Step 7. After completing the iteration, the final strong classifier is obtained

$$\begin{aligned} H\left( x_{i}^{j}\right) ={\text {sign}}\left( \sum _{t=1}^{T} \beta _{t} \times h_{t}\left( x_{i}^{j}\right) \right) . \end{aligned}$$
(8)

By setting the weak classifier to traverse all the features, the optimal target feature subset is formed according to the optimal feature selected in each iteration

$$\begin{aligned} \begin{aligned} F=&\left\{ \left( {x}_{1}^{1}, {x}_{1}^{2}, \ldots , {x}_{1}^{k}\right) ,\right. \\&\left. \left( {x}_{2}^{1}, {x}_{2}^{2}, \ldots , {x}_{2}^{k}\right) , \ldots ,\right. \\&\left. \left( {x}_{n}^{1}, {x}_{n}^{2}, \ldots , {x}_{n}^{k}\right) \right\} , \end{aligned} \end{aligned}$$
(9)

where k represents the number of features of the optimal target feature obtained.

RandomForest [27] is a combination classifier based on the decision tree, which can be used for feature selection. RandomForest uses the Bagging method to randomly and repeatably draw samples from the original sample set for classifier training. About 37% of the sample data will not be selected. These data are called Out of Bag (OOB). When calculating the importance of a certain column of defect features, use OOB data as the base learner after the test set test training, and the test error rate is recorded as errOOB; add noise to the important features to be calculated in the OOB sample, and calculate errOOB again; calculate all base learning. The average error of the test of the device is calculated using the average accuracy decrease rate (MDA) as an indicator for feature importance calculation

$$\begin{aligned} MDA=\frac{1}{n} \sum _{t=1}^{n}\left( {\text {errOOB}}_{t}-{\text {errOOB}}_{t}^{\prime }\right) . \end{aligned}$$
(10)

In Eq. (11), n is the number of base learners; \(\hbox {errOOB}^{\prime }\) is the error outside the bag after adding noise. The more the MDA index declines, the greater the impact of the corresponding feature on the prediction result, and the higher its importance. This feature importance calculation method is called RandomForest’s out-of-bag estimation. According to this method, the importance of software defect data-related features is ranked and feature selection is performed.

Data preprocessing

In terms of data preprocessing, this paper uses StandardScaler to normalize the experimental data, so that Nested-Stacking can process and fit the data faster.

Experimental setup

This section introduces the experimental setup from five aspects: data sets, performance evaluation indicators, and data analysis method. Our experiment is performed on the hardware with Intel (R) Core (TM) i5-10600KF CPU 4.10 GHz, 16 GB of RAM, and an RTX3090GPU. The operating system is Windows 10 Pro. The programming environment for scripts is Python 3.8.

With the analysis of the experimental results, this article answers the following two questions:

  1. (1)

    In the within-project cross-validation scenario, how does the classification performance of the Nested-Stacking model compared to other models?

  2. (2)

    In the cross-project verification scenario, how does the classification performance of the Nested-Stacking model compared to other models?

Datasets

To verify the effectiveness of our proposed framework, this study uses Kamei [28] and PROMISE [29] as experimental data sets. The data set information provided by Kamei is shown in Table 1, and its features are shown in Table 2.

Table 1 Datasets information (Kamei)
Table 2 Features of datasets (Kamei)

Table 3 shows the specific information of the PROMISE data set, including the project name, project version, number of code files, and defect rate. In addition, the 20 static metric features selected in this experiment are all extracted by Jureczko et al. [30] for object-oriented programming language design, including the depth of inheritance tree (Depth of Inheritance Tree), the number of subclasses (Number of Children), the number of lines of code (Line of Code), and related code complexity features, etc., as shown in Table 4.

Table 3 Datasets information (PROMISE)
Table 4 Features of datasets (PROMISE)

Performance indicators

This section will introduce the evaluation indicators selected in this experiment, such as confusion matrix, precision, recall, F1-score, and area under the ROC curve (AUC). The software defect prediction problem solved in this paper is a two-class problem, so comprehensive indicators such as AUC and F1-score are used to evaluate the prediction model. The confusion matrix is shown in Table 5. TP (True Positive) represents the number of defects that are correctly predicted; TN (True Negative) represents the number of non-defects that are correctly predicted; FP (False Positive) represents that it is incorrectly classified as defective, but The actual number of defects; FN (False Negative) represents the number of defects that are incorrectly classified as non-defects.

Precision is called precision rate; Recall is called recall rate; F1-score is usually used as a model evaluation indicator in the field of software defect prediction, which is the harmonic average of precision and recall. Their calculation formula is as follows:

$$\begin{aligned} { Precision }= & {} \frac{T P}{T P+F P} \end{aligned}$$
(11)
$$\begin{aligned} { Recall }= & {} \frac{T P}{T P+FN} \end{aligned}$$
(12)
$$\begin{aligned} F1-score= & {} 2 * \frac{{ Precision }^{*}{ Recall }}{{ Precision }+{ Recall }}. \end{aligned}$$
(13)

Area Under Curve (AUC): AUC is defined as the area under the ROC curve and the coordinate axis. The corresponding points in the ROC space are calculated according to the model classification results, and the ROC curve is formed by connecting these points. The abscissa is False-Positive Rate (FPR), and the ordinate is True-Positive Rate (TPR).

Data analysis method

The experiment considered two actual business scenarios to evaluate the predictive performance of the framework proposed in this article, which is the within-project StratifiedKFold \((K=10)\) cross-validation and cross-project validation. These experimental methods are described in detail below.

Within-project stratified K-Fold (K = 10) cross-validation

First, you need to randomly shuffle the data set, and then divide the data set into ten parts, that is, after tenfold, each onefold is used as the test set, and the other tenfold are used as the training set. The prediction model is trained and verified to calculate the performance indicators. A total of 10 runs. In the second experiment, the performance index obtained each time was averaged as the evaluation result of tenfold cross-validation. The usage of StratifiedKFold is similar to KFold, but because of its stratified sampling principle, it ensures that the proportion of samples in each category in the training set and test set is the same as the original data set.

Cross-project validation

In recent years, cross-project defect prediction has received widespread attention. The basic idea of cross-project defect prediction is to train a defect prediction model with the defect data set of one project, and then use it to predict the defects of another project.

Statistical test

The above experiments all use the Wilcoxon signed-rank test [31] and Cliff’s \(\delta \) test [32] to show the significance of the performance difference between Nested-Stacking and other models. Wilcoxon signed-rank test is a commonly used non-parametric statistical hypothesis test. This article uses the \(\rho \)-value to test whether the model’s performance difference is statistically significant at the 0.05 significance level. The experiment also uses Cliff’s \(\delta \) to determine the magnitude of the difference in prediction performance between models. Usually, the size of the difference can be divided into four levels, as shown in Table 6. In summary, when the \(\rho \)-value is less than 0.05 and Cliff’s \(\delta \) is greater than or equal to 0.147, there is a significant difference in the prediction performance between the models.

Table 5 Confusion matrix
Table 6 Cliff’s \(\delta \) values and their effective levels

Experimental results and analysis

This section will use the with-in project StratifiedKFold \((K=10)\) cross-validation and cross-project validation on two data sets to compare the classification performance differences between Nested-Stacking and other models.

Analysis of the RQ1

The experiment compares Nested-Stacking with other models and uses the with-in project StratifiedKFold \((K = 10)\) cross-validation to classify and predict the two types of open-source project data sets. Since each model produces ten prediction results in the with-in project StratifiedKFold \((K = 10)\) cross-validation scenario, the AUC and F1-score in the table are the averages of the experimental results. The better AUC and F1-score will be marked in bold.

WPDP-Kamei

In the experiment of Kamei data set, this study follows LocalJIT [33] and Kamei’s experimental methods and performance indicators (AUC, F1-score), and compares them with their experimental results. The experimental results are shown in Fig. 4 and Table 7.

Table 7 Within-project StratifiedKFold \((K=10)\) cross-validation (Kamei)

In Table 7, rows 2–7, respectively, give the AUC results of Nested-Stacking and the other three models on 6 data sets. Line 8 is the average AUC predicted value of each model. The 9th line (W/D/L) shows the number of wins for each prediction model on the 6 data sets. The highest AUC value predicted by the other three models is Stacking-0.8198, and the lowest is LocalJIT-0.704. The Nested-Stacking average AUC prediction value is 0.8322, which is an average increase of 8.25% compared to the other three models. It is worth noting that the AUC prediction result of Nested-Stacking in the Mozilla data set is 0.8517, which is the highest AUC index obtained by predicting a single item in all models.

Fig. 4
figure 4

WPDP-Kamei

The experiments in this section analyze and compare the classification performance of Nested-Stacking and other JIT-SDP models. The experimental results based on the within-project StratifiedKFold \((K=10)\) cross-validation show that, compared with other JIT-SDP models, Nested-Stacking has an average increase of 8.25% in AUC and an average increase of 22.08% in F1-score, which proves the rationality of Nested-Stacking as the core classifier in this paper. It can be concluded from Fig. 4 that the classification performance of Nested-Stacking on JIT-SDP is better than other JIT-SDP models.

WPDP-PROMISE

In the WPDP experiment of the PROMISE data set, this study follows the experimental method and performance indicator (F1-score) of ImprovedCNN [34] and COSTE [35]. The experimental results are shown in Fig. 5 and Table 8.

Table 8 Within-project StratifiedKFold \((K=10)\) cross-validation (PROMISE)

In Table 8, the first column is the PROMISE data set, and the other columns are the prediction model and its prediction results. Among them, the second column is the Nested-Stacking proposed in this article; the third column is the Stacking model with the nested part removed, and only three boosting algorithms are retained; the fourth column is the improved CNN model predicted by PROMISE; the last column is an excellent class imbalanced data processing model-COSTE.

The 12th line is the average predicted value of AUC for each model, and the 13th line (W/D/L) shows the number of wins for each model on PROMISE. The prediction results of Nested-Stacking in ant, camel, jedit, synapse, and xalan are better than the other three prediction models; traditional Stacking has better prediction results in log4j, poi, and xerces; ImprovedCNN has the best prediction results in lucene and velocity. Although the average effect of COSTE is relatively good, it does not have a very eye-catching performance in a certain data set. In addition to Nested-Stacking, the highest average F1-score predicted value is traditional Stacking-0.7912, and the lowest is ImprovedCNN-0.6003. The average F1-score of Nested-Stacking is 0.7988, which is an average increase of 14.83% compared to other models.

It can be seen from Table 8 and Fig. 5 that this part of the experiment compares the classification performance of Nested-Stacking and other prediction models on PROMISE. The experimental results based on the within-project StratifiedKFold \((K=10)\) cross-validation show that the F1-score of Nested-Stacking has increased by 14.83% on average, which proves the rationality of Nested-Stacking as a traditional software defect prediction model.

Based on the above two WPDP experiments, it can be concluded that Nested-Stacking is superior to other prediction models in classification performance on software defect data sets of different granularities.

Analysis of the RQ2

This section discusses the classification performance difference between Nested-Stacking and other models in cross-project validation scenarios. The AUC or F1-score in the graph and table is the average of the experimental results.

CPDP-Kamei

In this CPDP experiment, we use the six data sets provided by Kamei, so each software defect prediction model produces 30 prediction results, and the results are averaged. The experimental results are shown in Fig. 6.

The AUC prediction comparison of CPDP-Kamei is shown in Fig. 6. The average of the 30 prediction results produced by Nested-Stacking is 0.6713, and the average AUC of the remaining three models is 0.6648, 0.6065, and 0.6958, respectively. In comparison, Nested-Stacking is 0.9% higher than traditional Stacking’s cross-project verification; compared with LocalJIT, it has a larger increase of 10.6%; however, compared with the originator of JIT-SDP, Kamei, it is reduced by 3.6%. It may be due to overfitting of Nested-Stacking. To sum up, compared with other models, Nested-Stacking has an average increase of 2.4% in cross-project verification scenarios, which proves that the model has a certain degree of advancement.

Fig. 5
figure 5

WPDP-PROMISE

Fig. 6
figure 6

Cross-project validation (Kamei)

The comparison of CPDP-Kamei’s F1-score prediction results is shown in Fig. 6. It can be seen from the figure that the average of the 30 prediction results produced by Nested-Stacking is 0.7073, and the average AUC of the remaining three models are 0.6940, 0.5758, and 0.3814, respectively. In comparison, Nested-Stacking in the cross-project validation scenario is 1.33% higher than the F1-score predicted value of traditional Stacking; compared with LocalJIT, it has a larger improvement of 22.84%; compared with the Kamei, it is increased by 85.45%, and has a very amazing lifting effect. To sum up, compared with other models, Nested-Stacking has an average increase of 28.51% in cross-project validation scenarios, which proves that the model proposed in this paper still has high practical value in cross-project validation scenarios.

Fig. 7
figure 7

CPDP-PROMISE

CPDP-PROMISE

In the CPDP experiment of the PROMISE data set, this study follows the experimental methods and performance indicator (F1-score) of DBN [36], TCA+ [37], and DTL-DP [38], and compares with their experimental results. The results are shown in Fig. 7 and Table 9.

In the CPDP experiment in this section, Nested-Stacking is compared with the other three models. Under the guidance of previous research, we conducted 22 experiments. In each experiment, we randomly select two project versions from two different projects, one as the training set and the other as the test set. Nested-Stacking’s average F1-score prediction value is 0.6712, and DBN, TCA+, and DLT-DP reach 0.5681, 0.4799, and 0.6179, respectively. Compared with the other three models, Nested-Stacking has increased 18.15%, 39.65%, and 8.46%, respectively. It can be seen from Fig. 7 and Table 9 that the model proposed in this paper works well in most experiments, followed by DTL-DP. The four models in the experiment won 12 times, 2 times, 0 times, and 8 times, respectively.

Combining the CPDP experiment of the two data sets, it can be concluded that in the CPDP scenarios of the two different granularity data sets, the classification performance of Nested-Stacking is improved compared with the other models. Therefore, the model proposed in this article will show its superiority in actual test scenarios.

Table 9 Cross-project validation (PROMISE)

Discussion

From the experimental part of this study, we demonstrated the superiority of Nested-Stacking using two large data sets. But why does Nested-Stacking perform well? We will scientifically discuss the main reasons why the proposed framework is superior to other baseline models in three directions.

First, in JIT-SDP-related studies, many processing methods tend to change the data distribution. The distribution of training and test data in real data cannot be confirmed, and the change of data distribution may lead to a decrease in model accuracy. Nested-Stacking does not perform any up-sampling or down-sampling of the original data, nor does it modify the data distribution, ensuring that the software defect prediction model learns the original data distribution.

Second, most recent studies use ordinary KFold data partitioning and have not considered the effect of class imbalance in software defect data. In this paper, we propose that StratifiedKFold, a data partitioning method, and different boosting algorithms and MLP can significantly reduce the effect of class imbalance on the classification effect.

Third, traditional feature selection methods can be divided into filtering, embedded, and encapsulated types. However, software defect data exist in most feature determination data with small variance. Therefore, the above commonly used methods cannot solve the feature engineering problems better. In Nested-Stacking, different baseline models select heterogeneous features based on their respective feature importance calculation results, to filter out low-importance features, so as reduce data dimension, and improve model accuracy and computing speed.

Finally, integration models in software defect prediction models are mostly single soft integrations or traditional stacks, such as bagging, voting, and stacking strategies. In terms of the prediction effect, it may not be as good as the single tree model and take longer. In the early stage, to obtain a classifier with better generalization ability and accuracy, we selected baseline models of different types and quantities to conduct extensive combined experiments. Nested-stacking classifier contains three layers, of which the first layer is integrated with LightGBM, CatBoost, and AdaBoost algorithms, and a simple stacking model containing MLP and RandomForest is nested. The meta-classifier is Gradient Boosting Decision Tree. The end meta-classifier performs the final classification prediction for the LogisticRegression. In contrast to other ensemble learning algorithms, Nested-Stacking takes the predicted results of the first-level model as input to the second-level model when stacking. Therefore, the model deviation is effectively reduced and the accuracy is improved, and the prediction result of the software defect data set is optimized.

Threats to validity

This section analyzes the effectiveness of the proposed software defect prediction model from two main perspectives: external effectiveness and internal effectiveness.

The external validity reflects whether the conclusions drawn from the experimental research are universal: this paper uses two granular software defect data sets (Kamei, PROMISE), both of which can obtain all static metrics of the project program modules from the open-source data sets, these representative data sets can ensure the correctness of the research conclusions. Another aspect of external validity is the experimental part. Although we strictly follow the process described in the comparison model, they may not be fully restored to their original details. Furthermore, methods based on traditional machine learning or deep learning are random, which also makes the experimental results different from those provided in the original literature.

Internal validity affects the accuracy of experimental results: the code written in this article is mainly based on the Scikit-learn machine learning package of Python to ensure the accuracy of model construction. From the perspective of model accuracy and stability, AUC and F1-score are used to ensure the reliability of the metric. These two comprehensive evaluation indicators have been used in much previous literature to evaluate defect prediction models.

Conclusion validity refers to the rationality of the relationship between treatment and outcome. The experiment was repeated ten times to minimize random deviation. In addition, at the significance level of \(\rho \)-value=0.05, two non-parametric hypothesis testing methods, Wilcoxon signed-rank test, and the Cliff’s \(\delta \) test were used to compare the significant difference and effect size. Further research on other statistical testing methods and significance levels will be explored to analyze our Nested-Stacking model.

Conclusions and future work

The Nested-Stacking framework proposed in this paper is based on the superposition criterion, combined with various Boosting algorithms and custom deep learning algorithms, and maximizes the scientific nature of feature selection. The Nested-Stacking framework conducts a large number of experiments on feature engineering and baseline model selection. The experiments have passed the within-project StratifiedKFold \((K=10)\) cross-validation and cross-project validation, showing the effectiveness of the model on software defect data at two levels of granularity. This proves that Nested-Stacking can provide reliable decision support for software testing. Furthermore, the framework adopts a heterogeneous integration method, and basic learners and meta-learners are algorithms in various fields. Therefore, this method can improve the fitting ability and generalization ability of the model, and is better than the current popular software defect prediction method.

However, the disadvantage of Nested-Stacking is that the optimal combination of the baseline model is derived from complex experiments, and the process is not very efficient. In the future, at the model combination level, our goal is to build a more intelligent and automated prediction system, in which all model combinations are traversed until the optimal combination is found and its parameters are optimized. Finally, the whole Nested-Stacking process becomes more efficient and intelligent.

Open access

To verify the reproducibility of the experimental data in this article and to promote model changes, all the data and codes in this study are available at https://github.com/WangHuoShanPY.