1 Introduction

Google Play Store hosts 3.48 million Android applications as of the first quarter of 2021 (Statista Research Department, 2021b). In 2020, approximately 112.6 thousand Android applications were released monthly on average through Google Play Store (Statista Research Department, 2021a). If not all, most of these applications are being updated frequently for corrective or perfective maintenance reasons, including the need for adding new features, removing outdated features, fixing bugs, and improving performance. As a good practice, developers tend to release frequent updates to obtain quick feedback from users and keep their applications up to date. This practice adds additional complexity to the software development process since frequent changes may lead to more defects and require additional effort for developing new test cases. However, there are many other reasons, such as complicated internal logic, why defects still occur in software systems (Wang et al., 2021). Time to market pressure, tight deadlines, and a limited budget for testing are some of the other potential causes of software defects.

To efficiently allocate resources to defect-prone components in software systems, software defect prediction models have been developed using statistical techniques and mostly machine learning algorithms since the 1990s by software engineering researchers (Alan & Catal, 2011). Most of the machine learning models assumed that there is a sufficient number of labeled data points for building supervised machine learning models; however, in some cases, there might be very few labeled data (a.k.a., semi-supervised defect prediction) or even no labeled data exists (a.k.a., unsupervised defect prediction). In these cases, supervised machine learning models cannot be applied because of the missing labels in the data. Researchers built different kinds of models to address these cases (Catal, 2014; Catal & Diri, 2008, 2009; Catal et al., 2010; Li et al., 2020; Sun et al., 2021; Zhang et al., 2017). Some researchers also aimed to utilize data from other companies to build their defect prediction models (a.k.a., cross-project defect prediction) (Jin, 2021; Wu et al., 2017).

Researchers working on software defect prediction aim to help developers in improving software quality and testing efficiency by building machine learning models to predict defect-prone units timely (Lessmann et al., 2008). While such models provide some benefits in some contexts, they also have the following drawbacks (Kamei et al., 2012): (1) developers should explore defect-prone units to identify defects; (2) a developer should be assigned to a defect identification task; (3) developers may forget the details about the changes they made before. To overcome these drawbacks, researchers proposed to predict the code changes that may lead to defects, a.k.a., just-in-time defect prediction (JITDP) (Kamei et al., 2012; Kim et al., 2008; Mockus & Weiss, 2000). Change-level predictions are expected to help the developer, who made a code change, in identifying defects by only reviewing the change at an early stage without forgetting details (Kamei et al., 2012). Recently, many models have been developed to address JITDP and some of them focused on deep learning (DL) algorithms (Yang et al., 2015; Zeng et al., 2021; Zhao et al., 2021a, b, c).

To build a defect prediction model, a dataset including data about the changes (e.g., lines of codes added/modified and the number of modified files) and associated defect data (e.g., defective module and non-defective module) is required. As in many available datasets (Mahmood et al., 2015; Wang et al., 2021), defect data for Android applications are imbalanced (Zhao et al., 2021c). In other words, the number of changes leading to defects is much fewer than those not leading to any defect. Imbalanced datasets cause learning difficulty (Wang & Yao, 2013), and prediction performance decrease (Mahmood et al., 2015). Sampling methods are the dominant approaches to tackle imbalanced learning problems (He & Ma, 2013).

Since recently DL algorithms have achieved remarkable results in many different application domains, there is a tendency among software engineering researchers to apply them for all kinds of relevant problems (Giray, 2021), often without considering the scale of the available datasets and applicability of the algorithm in the context. In this research, our objective is to evaluate the performance of shallow learning algorithms (i.e., traditional machine learning algorithms) and the effect of sampling methods on defect prediction models and also compare these results with the performance of DL algorithms. If acceptable performance can be achieved with traditional algorithms, we can conclude that there is no need to deal with DL algorithms because they require extra computing power, training time, and human effort to find the optimal model. Therefore, we evaluate three high-performance machine learning algorithms and eight sampling methods in this study. We report our findings on the comparison of the prediction performance of three machine learning algorithms (i.e., MLP, TabNet, and XGBoost) using eight sampling methods (i.e., ROS, RUS, SMOTE, SMOTEN, SMOTESVM, SMOTET, BSMOTE, and ADASYN). Experiments were performed on 12 publicly available datasets built based on Android apps (Catolino et al., 2019).

The contributions of this study are four-fold:

  • We demonstrated that DL algorithms using sampling methods perform significantly worse than the decision tree-based ensemble method.

  • We proposed a new XGBoost-based prediction model using the StandardScaler normalization and SVMSMOTE data balancing approaches and evaluated its performance on 12 publicly available Android datasets.

  • The XGBoost-based model is 116 times faster than the MLP method and has a 32% higher MCC (Matthews correlation coefficient) than the baseline method. This study showed that DL-based models are not always the answer for building highly accurate prediction models and XGBoost-based models can even provide better performance in terms of computational time complexity and performance.

  • Our experiments are reproducible and improvable, as our code is publicly available at https://github.com/rvdinter/JIT-defect-prediction-Android-apps.

The rest of the paper is organized as follows: Section 2 introduces the related work. Section 3 explains the research methodology. Section 4 presents the results. Section 5 discusses our findings and reports the threats to validity. Section 6 concludes the paper.

2 Related work

Kamei et al. (2012) proposed predicting defects by analyzing changes instead of files or packages. They conducted a large-scale study involving six open-source and five commercial non-mobile applications to predict defects just-in-time.

Scandariato and Walden (2012) focused on developing a vulnerability model specific to Android applications. Their prediction model is based on object-oriented metrics, like the number of superclasses, depth of inheritance tree, cumulative Halstead bugs, and Halstead volume, to name a few. Kaur et al. obtained better prediction performance for mobile applications by using process metrics (i.e., number of lines added/deleted, number of developers, number of revisions) instead of code metrics (Kaur et al., 2015, 2016). Malhotra (2016) compared the prediction performance of 18 machine learning algorithms using object-oriented metrics as the feature set. Ricky et al. (2016) built a prediction model for the games developed for Android and Windows Phone using software metrics.

Catolino et al. (2019) analyzed the relevant metrics useful for predicting defects just-in-time in mobile applications. They applied information gain for feature selection with a threshold of 0.1, as suggested by previous studies (Catolino et al., 2018; Quinlan, 1986). For coping with imbalanced dataset problem, they applied Synthetic Minority Over-sampling Technique (SMOTE), proposed by Chawla et al. (2002). They identified the following six metrics that contributed to defect prediction with an information gain value exceeding 0.1: number of unique changes to modified files (nuc), number of lines added (la), number of lines deleted (ld), number of modified files (nf), number of modified directories (nd), and number of developers working on the file (ndev). Zhao et al. (2021b) proposed an imbalanced DL (IDL) methodology for JIT defect prediction in Android applications through applying a cost-sensitive cross entropy loss function to a deep neural network. They compared their model against the sampling-based imbalanced learning methods (ROS, RUS, SMOT, SMOTEN, SVMSMOT, SMOTT, SMOTB, and ADASYN) by conducting experiments on a benchmark dataset involving 12 Android applications and found out that their model performed better than the other imbalanced learning methods in terms of Matthews correlation coefficient performance indicator.

Bennin et al. (2017) state the drawbacks of sampling approaches: (1) the possibility of generating erroneous or duplicated data instances, (2) the tendency to generate less diverse data points within the minority class. To overcome these drawbacks, they propose an over-sampling approach called MAHAKIL. Their approach outperformed four other over-sampling approaches (ROS, SMOTE, Borderline-SMOTE, and ADASYN) using five classification models (C4.5, NNET, KNN, RF, SVM) on 20 imbalanced datasets consisting of non-mobile applications. Tantithamthavorn et al. (2018) investigated the impact of four resampling methods (over-sampling, under-sampling, SMOTE, and ROSE) by building defect prediction models based on seven classification techniques (random forest, logistic regression, Naive Bayes, neural network (AVNNet), C5.0 Boosting (C5.0), extreme gradient boosting (xGBTree), and gradient boosting method) and 101 publicly available datasets. Bennin et al. (2019) conducted an experiment on six sampling methods (SMOTE, Borderline-SMOTE, Safe-level SMOTE, ADASYN, random over-sampling, random under-sampling), five prediction models (KNN, SVM, C4.5, RF, and NNET), and 20 open-source projects.

Researchers have been utilizing DL for defect prediction, more densely as of 2019 (Giray et al., 2023). Two recent surveys report the increasing use of DL in JITDP (Zhao et al., 2023) and specifically for mobile apps (Jorayeva et al., 2022a). Jorayeva et al. (2022b) investigated the performance of DL and data balancing approaches on cross-project defect prediction for mobile apps. Cheng et al. (2022) proposed a method for cross-project JITDP in the context of Android mobile apps. Huang et al. (2023) used multi-task learning and deep neural network to alleviate limited labeled data problem JITDP on mobile apps.

In this study, we replicated the experiments conducted by Zhao et al. (2021b). Also, we investigated the performance of sampling methods when used with a base deep neural network (MLP), a neural network designed for tabular data (TabNet), and a traditional machine learning algorithm designed for tabular data (XGBoost) in JIT defect prediction for Android applications. We compare the results of our algorithms against the IDL methodology by Zhao et al. (2021b). To the best of our knowledge, this is the first study that evaluates the relative performance of shallow learning algorithms against DL algorithms in the case of JITDP.

3 Research methodology

In the following, we describe the steps of the method. First of all, we describe the adopted datasets necessary for JIT prediction. This is followed by the models that are used for JIT prediction. Since we are dealing with unbalanced datasets, we will also elaborate on the sampling-based imbalanced learning methods. Finally, we describe the adopted methods and the flowchart for the experiments.

3.1 Dataset

We have used 12 publicly available benchmark datasets from Android apps built by Catolino et al. (2019). Catolino et al. (2019) also provide an in-depth description of each of these apps. Table 1 provides an overview of the 12 Android apps that have been used to evaluate the algorithms. The table shows for each app the lines of code (#LOC), total number of commit instances (#TC), total number of defective instances (#DC), total number of clean instances (#CC), and the ratio of defective instances (%DR). An instance is deemed as defective when the committed instance introduces the defect; otherwise, it is deemed as clean. The ratio of defective instances varies greatly between 14 and 40%. Also, the scale of the apps varies, as the code lines of these apps are between 9506 and 275,637. The number of samples in the dataset is equal to the total number of commit instances. This means that for Turner, there are only 164 samples to learn from. Catolino et al. (2019) analyzed several features that could be of interest for classifying defective commit instances. They identified six features that could be categorized into three scopes: history, size, and diffusion (Zhao et al., 2021b).

Table 1 Metadata of the 12 Android apps

Table 2 provides a brief description of the six features deemed most informative for JIT defect prediction for Android apps.

Table 2 Description of six features for JIT defect prediction

3.2 Models

In this study, we evaluate three models: a vanilla deep neural network, a neural network designed for tabular data, and a traditional machine learning algorithm designed for tabular data. We focus our models on tabular data, as the datasets by Catolino et al. (2019) are of tabular form.

3.2.1 Multilayer perceptron

We make use of a multilayer perceptron model based on the studies from Zhao et al. (2021b) and Xu et al. (2019). The multilayer perceptron (MLP) is a class of feedforward artificial neural networks (ANN). It makes use of three network layer types: (1) the input layer, (2) the hidden layer, and (3) the output layer. The first layer is the input layer, which uses as many units as the number of features in the dataset. The last layer is the output layer, which outputs the requested result. In our case, as shown in Fig. 1, the input layer is of size 6, and the output layer is of size 1, as the model classifies whether the commit does (i.e., the output is 1) or does not (i.e., the output is 0) contain a defect using a Sigmoid activation function. There can be multiple hidden layers in an MLP. Also, the number of units in a hidden layer can vary. If more than one hidden layer is used in an ANN, nowadays, the model is called deep neural network (DNN). Our MLP makes use of 2 hidden layers with 10 units leveraging the ReLu activation function. For the hyperparameters, we apply the RMSProp optimization algorithm with a batch size of 16 and 10,000 iterations. We use an adapted learning rate of 0.001 without weight decay to keep the model as simple as possible. We did not apply early stopping for regularization.

Fig. 1
figure 1

The multilayer perceptron model architecture adapted from Zhao et al. (2021b) and Xu et al. (2019)

3.2.2 TabNet

A major downside to MLP and other DL models is that their predictions cannot be explained. However, TabNet attempts to break this assumption (Arık & Pfister, 2020). This DL algorithm attempts to combine the best of both worlds: the performance of DL algorithms while being explainable like decision tree-based classifiers. Being developed by Google Cloud AI, it is already widely rolled out in Google Cloud Platform (Arık & Le, 2020). As Arık and Pfister (2020) describe: “TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features” (Arık & Pfister, 2020).

We optimized TabNet’s model hyperparameters using sklearn’s RandomizedSearchCV on the Reddit dataset, as the TabNet model contains many hyperparameters, which makes GridSearchCV very resource-intensive. In this search, we used mostly default hyperparameters, while applying a clip value of 1, and a learning rate of 2e-3. Additionally, we used a maximum of 150 epochs, and a patience of 20 epochs with no improvement in the loss after which the model stops training, which also required the use of a validation set, which is a 10% partition of the training set.

3.2.3 XGBoost

To solve class imbalance challenges, one could use sampling methods, ensemble-based algorithms, or cost-based methods (Zhao et al., 2021b). XGBoost, or eXtreme Gradient Boosting, is a decision tree-based ensemble method developed by Chen et al. (2015). It is based on gradient boosting machines, such as AdaBoost, which has been used often to compare classifiers in this domain (Catolino et al., 2019; Zhao et al., 2021a, b, c). A gradient boosting machine uses a loss function, many weak learners, and an additive model to add weak learners to minimize the loss (Brownlee, 2019). XGBoost’s advantage over other gradient boosting machines is its speed and performance. As Brownlee (2019) notes, XGBoost has been the go-to model for tabular machine learning challenges on Kaggle for years.

We used GridSearchCV to find XGBoost’s optimal hyperparameters on the Reddit dataset. We search to find the optimal maximum depth, number of estimators, and learning rate. This resulted in a maximum depth of 3100 estimators, and a learning rate of 0.1.

3.3 Sampling-based imbalanced learning methods

In general, imbalanced learning methods are based on sampling, ensembles, or cost functions. We evaluate our models described in the previous section against eight sampling methods to analyze whether the models can gain significant performance apart from hyperparameter tuning. Table 3 lists the eight imbalanced learning methods we evaluated. RUS is the only under-sampling method, SMOTET is a combination of over- and under-sampling methods, while all other methods use an over-sampling method.

Table 3 Imbalanced learning methods

3.4 Metric

In the JITDP, AUC, F-measure, and MCC metrics are often reported to synthesize the results of a study. However, Ng (2017) describes that adding multiple metrics in a paper can confuse which metric is most important. Therefore, he recommends using an all-encompassing metric. Previous studies have proven that the MCC is the most appropriate metric for JIT defect prediction (Song et al., 2018; Yao & Shepperd, 2020).

The MCC, or Matthews correlation coefficient, is a metric used for binary and multiclass classification. It is designed for imbalanced datasets, such as software defect prediction. MCC is derived from the Pearson correlation coefficient and takes all terms from the confusion matrix. MCC’s formula is expressed as

$$\begin{array}{c}\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}} \#\end{array}$$
(1)

where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively. The MCC is a correlation coefficient value between − 1 and + 1. This statistic is also known as the phi coefficient. As such, a coefficient of + 1 is a perfect prediction, 0 is an average random prediction, and a coefficient of − 1 is an inverse prediction.

3.5 Experiments

As in the experiments by Zhao et al. (2021b), we evaluate a set of model and sampler configurations. To do so, we have generated nested iterations. Figure 2 shows the detailed visualization of the procedure.

Fig. 2
figure 2

Visualization of the experiment

First, we iterate over the models to evaluate: MLP, TabNet, and XGBoost. Then, we iterate over the eight sampling methods. For each of these model-sampler configurations, we perform twofold cross-validation, which we repeat 25 times (i.e., N*M cross-validation, N = 2, M = 25). If the repeated cross-validation is finished, we move to the next sampler until all samplers have been evaluated against one model. Then, we move to the next model and repeat the process.

4 Results

In this section, we evaluate the results from our models and each of the sampling-based learning methods. We included the results from Zhao et al. (2021b) in Table 4 as a baseline for comparison. The results show the mean and the corresponding standard deviation of each sampling-based method for the twelve datasets. The last row shows the average for the sampling-based method for all datasets. The highest mean values for Table 4 are shown in bold. Table 4 shows that the IDL method gained the highest score on 8 out of 12 datasets. Additionally, the IDL method gained the highest score on average.

Table 4 Baseline results, adapted from Zhao et al. (2021b)

Tables 5, 6 and 7 show the results from our study. As with Table 4, the highest means are in bold. Additionally, if the highest mean value is higher than the highest mean value from the baseline in Table 4, we added an asterisk. Table 5 shows the MCC results for the MLP model. The MLP outperforms the IDL method on 7 out of 12 datasets. Even though our MLP is based on the MLP from Zhao et al. (2021b), hyperparameter optimization resulted in large performance gains. Additionally, the average result for the MLP model with SVMSMOTE sampling-based method was higher than the baseline IDL method. Table 6 shows the average MCC results for the TabNet model. As its hyperparameters have been optimized for the Reddit dataset, we see that the TabNet model gained the highest performance on that dataset. Unfortunately, the hyperparameter settings are not dataset-independent, as the average MCC results for the other datasets and sampling-based methods are not outperforming the baseline. Table 7 shows the results for the XGBoost model with the eight sampling-based methods. We see that the MCC results for this method are higher on average than the baseline and DL models. Additionally, SVMSMOTE outperformed the baseline’s highest average values on 5 out of 12 datasets, and SMOTET achieved the highest average MCC of all methods.

Table 5 Average MCC results of the MLP model and sampling-based methods
Table 6 Average MCC results of the TabNet model and sampling-based methods
Table 7 Average MCC results of the XGBoost model and sampling-based methods

We performed a Scott-Knott ESD (SKESD) test to statistically verify which sampling method performs best per model. Figures 3, 4 and 5 show the SKESD results for the MLP, TabNet, and XGBoost models respectively. For the MLP and XGBoost, SVMSMOTE has been ranked highest overall, while it has been ranked third for the TabNet model. TabNet, however, has ranked SMOTET as the overall best sampling method. SMOTET is ranked second for the XGBoost model and sixth for the MLP model. Overall, SVMSMOTE has been ranked as the best sampling-based method among the models. The SVMSMOTE method for XGBoost has improved the MCC results by 32% over the IDL baseline method. Additionally, we performed a SKESD test for the machine learning methods per dataset for all sampling methods. In Fig. 6, MLP and TabNet have been ranked closely together, while XGBoost has been ranked first for eleven out of twelve datasets.

Fig. 3
figure 3

SKESD test for sampling-based methods for MLP

Fig. 4
figure 4

SKESD test for sampling-based methods for TabNet

Fig. 5
figure 5

SKESD test for sampling-based methods for XGBoost

Fig. 6
figure 6

SKESD test for XGBoost vs. DL models

Lastly, we share our observations of the time consumption of the MLP, TabNet, and XGBoost algorithms. The TabNet model utilized the GPU, as it has been optimized for GPU processing. The MLP and XGBoost utilized a CPU. The experiments have been run on the Kaggle Kernels free cloud computing service. Figure 7 shows the time consumption of each of the machine learning algorithms for the full experiment. We see that the MLP took 1860 min (31 h) to complete the experiment. The TabNet algorithm took 414 min (6.9 h), while XGBoost took just 16 min to complete the experiment. This means that XGBoost completed the experiment over 116 times faster than the MLP algorithm and over 25 times faster than the TabNet algorithm.

Fig. 7
figure 7

The total time consumption of each machine learning algorithm for the full experiment

5 Discussion

This study presents a comparison between a baseline study, a slightly adapted deep neural network, a state-of-the-art DL method for tabular data, and a decision trees-based ensemble method. Our results show that the XGBoost decision tree-based ensemble method is the fastest and statistically highest ranked method. XGBoost in combination with the SVMSMOTE over-sampling method shows an improvement over the IDL baseline method of 32% of the MCC results. Furthermore, our XGBoost method is 99% faster than the MLP method. If the time consumption of the baseline method can be assumed to be like the MLP method, we would also cut the time cost of the IDL method by 99%.

Our results also show that the differences between various sampling methods are minimal in comparison to the results between various machine learning algorithms. SMOTESVM and SMOTET have shown to be good sampling methods, however, to increase the performance of a machine learning algorithm.

This study is subject to threats to validity that can be classified as construct, internal, and external.

Construct validity

As Ng (2017) described, having multiple-number evaluation metrics (e.g., precision and recall) makes it challenging to compare algorithms. Having single-number evaluation metrics allows us to sort all models according to their performance. An all-encompassing single-number evaluation metric for the JIT defect prediction domain is the Matthews correlation coefficient. The MCC is appropriate for imbalanced datasets. Additionally, to verify the statistical validity of our results, we apply a state-of-the-art statistic test method. The Scott-Knott effect size difference has been designed for JIT defect prediction. It was used for “significant difference” analysis.

Internal validity

In our work, we carefully implemented the sampling methods and machine learning algorithms using the DMLC XGBoost, DreamQuark TabNet, and PyTorch library. The optimal parameters have been obtained using RandomSearchCV for the TabNet algorithm and GridSearchCV for the XGBoost library. The MLP was based on previous studies (Zhao et al., 2021b, c); however, more optimal settings could be found through hyperparameter optimization algorithms. For comparative methods, we implemented third-party libraries with default parameters.

External validity

The datasets we applied our methods on are publicly available. The datasets are 12 Android apps developed in the Java programming language. As described by Zhao et al. (2021b), we have not evaluated whether the methods are suitable for Android apps developed in other languages (i.e., Kotlin) or IoS apps. Additionally, further studies must investigate whether our methods can be applied to other domains of JIT defect prediction. Our study also assumes the adoption of models with unbalanced datasets. If balanced datasets are adopted, other metrics, such as precision at recall, might be preferred. Furthermore, the TabNet model might perform significantly better with a balanced dataset. Additionally, as more and more data is collected, the use of DL methods will become increasingly relevant with increasing dataset sizes.

6 Conclusion and future work

In this study, we propose that deep neural networks are not always the optimal solution to a tabular dataset challenge. To test our hypothesis, we compare a baseline method, MLP, Tabular Network, and XGBoost. We also tested whether using eight different sampling-based methods would create significant improvements. To evaluate the effectiveness of our methodologies, we conducted experiments on 12 Android apps and used the Matthews correlation coefficient (MCC) and a Scott-Knott effect size difference statistical test.

Our results show that DL algorithms leveraging sampling methods perform significantly worse than a decision tree-based ensemble method. The XGBoost method is 116 times faster than the MLP method and has a 32% higher MCC than the baseline method. Our XGBoost pipeline takes the six input features, normalizes the features based on the training data using the StandardScaler method, and over-samples the training data using the SVMSMOTE algorithm to overcome the class imbalance challenges.

In our future work, we plan to adapt our methodology for different data sizes and different software applications. Recently, many deep learning algorithms have been developed (e.g., the transformer algorithm). These algorithms are combined in a different way for building numerous types of deep learning models. We plan to evaluate the effectiveness of these hybrid models in this problem and also investigate the efficiency of them. In addition, we will focus on the interpretable machine learning models, which are crucial in understanding the decision process of the models. Many feature engineering techniques that are available in machine learning and some of them will be also evaluated in the future work. Building deep learning models is time consuming and requires a lot of human effort; therefore, we will also focus on neural architecture search (NAS) and AutoML fields to minimize the efforts of building deep learning models. While shallow learning looks promising in this research, for different datasets, the case might be different, and therefore, several research dimensions are planned for the future.