In this section, we first evaluate the unified bug dataset’s basic properties like the number of source code elements and the number of bugs to gain a rough overview of its contents. Next, we show the metadata of the dataset like the used code analyzer or the calculated metric set. Finally, we present an experiment in which we used the unified bug dataset for its main purpose, namely for bug prediction. Our aim was not to create the best possible bug prediction model, but to show that the dataset is indeed a usable source of learning data for researchers who want to experiment with bug prediction models.
Datasets and Bug Distribution
Table 15 shows the basic properties about each dataset. In the SCE column, the number of source code elements is presented. Based on the granularity, it means the number of classes or files in the system. There are systems in the datasets with a wide variety of size from 2,636 Logical Lines of Code (LLOC)Footnote 8 up to 1,594,471.
SCEwBug means the number of source code elements which have at least one bug; SCEwBug% is the percentage of the source code elements with bugs in the dataset. The SCEwBug% as the percentage of buggy classes or files describes how well-balanced the datasets are. Since it is difficult to overview the numbers, Figs. 1 and 2 show the distribution of the percentages of faulty source code elements (SCEwBug%) for classes and files, respectively. The percentages are shown horizontally, and, for example, the first column means the number of systems (part of a dataset) that have between 0 and 10 percentages of their source code elements buggy (0 included, 10 excluded). In case of the systems that give bug information at class level, this number is 11 (5 at file level). We can see that there are systems where the percentages of the buggy classes are very high, for example, 98.79% of the classes are buggy for Xalan 2.7 or 92.20% for Log4J 1.2. Although the upper limit of SCEwBug% is 100%, the reader may feel that these values are extremely high and it is very difficult to believe that a release version of a system can contain so many bugs. The other end is when a system hardly contains any bug, for example, in case of MCT and Neo4j, less than 1% of the classes is buggy. From the project point of view, it is very good but on the other hand, these systems are probably less usable when we want to build bug prediction models. This phenomenon further strengthens the motivation to have a common unified bug dataset which can blur these extreme outliers. Although there are many systems with extremely high or low SCEwBug% values, we used them “as is” later in this research because their validation was not the aim of this work.
Meta data of the Datasets
Table 16 lists some properties of the datasets, which show the circumstances of the dataset, rather than the data content. Our focus is on how the datasets were created, how reliable the used tools and the applied methods were. Since most of the information in the table was already described in previous sections (Analyzer, Granularity, Metrics, and Release), in this section, we will describe only the Bug information row.
The Bug Prediction Dataset used the commit logs of SVN and the modification time of each file in CVS to collect co-changed files, authors, and comments. Then, they D’Ambros et al. (2010) linked the files with bugs from Bugzilla and Jira using the bug id from the commit messages. Finally, they verified the consistency of timestamps. They filtered out inner classes and test classes.
The PROMISE dataset used Buginfo to collect whether an SVN or CVS commit is a bugfix or not. Buginfo uses regular expressions to detect commit messages which contain bug information.
The bug information of the Eclipse Bug Dataset was extracted from the CVS repository and Bugzilla. In the first step, they identified the corrections or fixes in the version history by looking for patterns which are possible references to bug entries in Bugzilla. In the second step, they mapped the bug reports to versions using the version field of the bug report. Since the version of a bug report can change during the life cycle of a bug, they used the first version.
The Bugcatchers Bug Dataset followed the methodology of Zimmermann et al. (Eclipse Bug Dataset). They developed an Ant script using the SVN and CVS plugins of Ant to checkout the source code and associate each fault with a file.
The authors of the GitHub bug dataset gathered the relevant versions to be analyzed from GitHub. Since GitHub can handle references between commits and issues, it was quite handy to use this information to match commits with bugs. They collected the number of bugs located in each file/class for the selected release versions (about 6-month-long time intervals).
We evaluated the strength of bug prediction models built with the Weka (Hall et al. 2009) machine learning software. For each subject software system in the Unified Bug Dataset, we created 3 versions of ARFF files (which is the default input format of Weka) for the experiments (containing only the original, only OSA, and both set of metrics as predictors). In these files, we transformed the original bug occurrence values into two classes as follows: 0 bug → non buggy class, at least 1 bug occurrence → buggy class. Using these ARFF files, we could run several tests about the effectiveness of fault prediction models built based on the dataset.
Within-project Bug Prediction
As described in Section 4, we extended the original datasets with the source code metrics of the OSA tool and we created a unified bug dataset. We compared the bug prediction capabilities of the original metrics, the OSA metrics, and the extended metric suite (both metric suites together). First, we handled each system individually, so we trained and tested on the same system data using ten-fold cross-validation. To build the bug prediction models, we used the J48 (C4.5 decision tree) algorithm with default parameters. We used only J48 since we did not focus on finding the best machine learning method, but we rather wanted to show a comparison of the different predictors’ capabilities with this one algorithm. We chose J48, because it has shown its power in case of the GitHub Bug Dataset (Tóth et al. 2016) and because it is relatively easy to interpret the resulting decision trees and to identify the subset of metrics which are the most important predictors of bugs. Different machine learning algorithms (e.g., neural networks) might provide different results. It is also important to note that we did not use any sampling technique to balance the datasets, we used the datasets ‘as is’. We will outline some connections between the machine learning results and the characteristics of the datasets (e.g., SCEwBug%).
The F-measureFootnote 9 results can be seen in Table 17 for classes and in Table 18 for files. We also included the SCEwBug% characteristic of each dataset to gain a more detailed view of the results. The tables contain two average values since the GitHub Bug Dataset used SourceMeter, which is based on OSA to calculate the metrics.
The results of OSA and the Merged metrics would be the same. This could distort the averages, so we decided to detach this dataset from others when calculating the averages.
Results for classes need some deeper explanation for clear understanding. There are a few missing values, since there were less than 10 data entries; not enough to do the ten-fold cross-validation on (Ckjm and Forrest-0.6).
There are some outstanding numbers in the tables. Considering Xalan 2.7, for example, we can see F-measure values with 0.992 and above. This is the consequence of the distribution of the bugs in that dataset which is extremely high as well (98.79%). The reader could argue that resampling the dataset would help to overcome this deficiency; however, the corresponding AUC valueFootnote 10 for the Original dataset is not outstanding (only 0.654). Besides, later in this section, we will investigate the bug prediction capabilities of the datasets in a cross-project manner and it turns out that these extreme outliers generally performed poor in cross-project learning as we will describe it later in detail. Let us consider Forrest 0.8 as an example where the SCEwBug% is low (6.25); however, the F-measure values are still high (0.891-0.907). In this case, the high values came from the fact that there are 32 classes out of which only 2 are buggy, so the decision tree tends to mark all classes as non buggy. On the other hand, the AUC values range from very poor (0.100) to very high (0.900) which suggest that for small examples and small number of bugs, it is difficult build “reliable” models.
The averages of F-measure and AUC slightly change in case of class level datasets and the average of the merged dataset is only slightly better than the Original or the OSA. If we consider F-measure, the GitHub Bug Dataset performed 10% better generally, while the averages of UAC decreased by only 1% which is neglectable. One possible explanation can be that the PROMISE dataset includes all the smaller projects, while the GitHub Bug Dataset and the Bug Prediction Dataset rather contain larger projects.
Small differences in case of the GitHub Bug Dataset come from the difference of the versions of the static analyzers. The SourceMeter version used in creating the GitHub Bug Dataset is older than the OSA version used here, in which some metric calculation enhancements took place. (SourceMeter is based on OSA.)
Results at file level are quite similar to class level results in terms of F-measure and AUC. The only main difference is that the F-measure averages of OSA for Bugcatchers and Eclipse bug datasets are notably smaller than the other values. The reason for this might be that there are only a few file level metrics provided by OSA, and a possible contradiction in metrics can decrease the training capabilities (we saw that even LLOC values are very different). On the other hand, the corresponding AUC value is close to the others so deeper investigation is required to find out the causes. Besides, the average F-measure of the Merged model is slightly worse while the average AUC is slightly better than the Original which means that in this case the OSA metrics were not able to improve the bug prediction capability of the model. The small difference between the Original and the OSA results in case of the GitHub Bug Dataset comes from the slightly different metric suites, since OSA calculates the Public Documented API (PDA) and Public Undocumented API (PUA) metrics as well. Furthermore, the aforementioned static analyzer version differences also caused this small change in the average F-measure and AUC values.
Merged Dataset Bug Prediction
So far, we compared the bug prediction capabilities of the old and new metric suites on the systems separately, but we have not used its main advantage, namely, there are metrics that were calculated in the same way for all systems (OSA metric suite). Seeing the results of the within-project bug prediction, we can state that creating a larger dataset, which includes projects varying in size and domain, could lead to a more general dataset with increased usability and reliability. Therefore, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and by using 10-fold cross validation, we evaluated the bug prediction model built by J48 on this large dataset as well. The F-measure is 0.818 for the classes and 0.755 for the files while the AUC is 0.721 for classes and 0.699 for files. Although the F-measure values are, for example, a little worse than the average of GitHub results (0.892 for classes and 0.820 for files), the AUC values are better than the best averages (0.656 for class and 0.676 for files) even if they were measured on a much larger and very heterogeneous dataset.
After training the models at class and file levels, we dug deeper to see which predictors are the most dominant ones. Weka gives us pruned trees as results both at class and file levels. A pruned tree is a transformed one obtained by removing nodes and branches without affecting the performance too much. The main goal of the pruning is to reduce the risk of overfitting. We not only constructed a Unified Bug Dataset for all of the systems together with the unified set of metrics, but we also built one for each collected dataset, one for the Bug Prediction Dataset, one for the PROMISE dataset, etc. in which we could use a wider set of metrics (including the original ones).
Table 19 presents the most dominant metrics for the datasets extracted from the constructed decision trees. We marked those metrics with bold face that occur in multiple datasets. At class level, WMC (Weighted Methods per Class) and TNOS (Total Number of Statements) are the most important ones; however, they happened not to be the most dominant ones in the Unified Bug Dataset.
Regarding the Unified Bug Dataset, we found CLOC, TCLOC (comment lines of code metrics), DIT (Depth of Inheritance), CBO (Coupling Between Objects) and NOI (Number of Outgoing Invocations) as the most dominant metrics at class level. In other words, these metrics have the largest entropy. This set of metrics being the most dominant is reasonable, since it is easier to modify and extend classes that have better documentation. Furthermore, coupling metrics such as CBO and NOI have already demonstrated their power in bug prediction (Gyimóthy et al. 2005b; Briand et al. 1999). DIT is also an expressive metric for fault prediction (El Emam et al. 2001).
At file level, the most dominant predictors in the different datasets show overlapping with the ones being dominant in the Unified Bug Dataset. LOC (Lines of Code) and McCC (McCabe’s Cyclomatic Complexity) were the upmost variables to branch on in the decision tree. These two metrics are reasonable as well, since the larger the file, the more it tends to be faulty. Complexity metrics are also important factors to include in fault prediction (Zimmermann et al. 2007).
The diversity of the most dominant metrics shows the diversity of the different datasets themselves.
Cross-project Bug Prediction
As a third experiment, we trained a model using only one system from a dataset and tested it on all systems in that dataset. This experiment could not have been done without the unification of the datasets, since a common metric suite is needed to perform such machine learning tasks. The result of the cross training is an NxN matrix where the rows and the columns of the matrix are the systems of the dataset and the value in the ith row and jth column shows how well the prediction model performed, which was trained on the ith system and was evaluated on the jth one.
We used the OSA metrics to test this criterion, but the bug occurrences are derived from the original datasets which are transformed into buggy and non buggy labels. The matrix for the whole Unified Bug Dataset would be too large to show here; thus, we will only present submatrices (the full matrices for file and class levels are available in the A). A submatrix for the PROMISE dataset can be seen in Tables 20 and 21. Only the first version is presented for each project except for the ones where an other version has significantly different prediction capability (for example, Xalan 2.7 is much worse than 2.4 or Pbeans 2 is much better than 1). The values of the matrix are F-measure and AUC values, provided by the J48 algorithm. Avg. F-m. and Avg. AUC columnsFootnote 11 present the average of the F-measures and AUC values the given model achieved on the other systems. For a better overview, we repeat SCEwBug% value here as well (see Table 15).
We can observe (see Table 20) that models built on systems (highlighted bold in the table) having lots of buggy classes (more than 70%) performed very poor in the cross validation if we consider the average F-measure (the values are under 0.4). Even more, these models do not work well on other very buggy systems either, for example, model trained on Xalan 2.7 (the most buggy system) achieved only 0.078 F-measure on Log4J 1.2 (the second most buggy system). Besides, in case of Ckjm 1.8, Lucene 2.2, and Poi 1.5, the rate of buggy classes is between 50 and 70 percent, and their models are still weak, the average F-measure is between 0.4 and 0.6 only. And finally, the other systems can be used to build such models that achieve better result, namely higher F-measures, on other systems, even the ones having lots of bugs. On the other hand, this trend cannot be observed on the AUC values (see Table 21), because they range between 0.483 and 0.629 but there is no “white line” which means that there is no model whose performance is very poor for all other systems. However, there are 0.000 and 0.900 AUC values in the table that suggest that the bug prediction capabilities heavily depend on the training and testing dataset. For example, models evaluated on Forrest 0.6 have extreme values, and it is probably only a matter of luck whether the model takes into account the appropriate metrics or not.
Table 22 shows the cross training F-measure values for the GitHub Bug Dataset. Testing on Android Universal Image Loader is the weakest point in the matrix, as it is clearly visible. However, the values are not critical, the lowest value is still 0.611. Based on F-measure, Elasticsearch performed slightly better than the others in the role of a training set. This might be because of the size of the system, the average amount of bugs, and the adequate number of entries in the dataset. On the other hand, the AUC values in the column of Android Universal Image Loader are not significantly worse than any other value (see Table 23), but rather the values in its row seems a little bit lower. From AUC point of view, Mission Control T. is the most critical test system because it has the lowest (0.181) and the highest (0.821) AUC values. The reason can be the same as for Forrest 0.6. In general, the AUC values are very diverse ranging from 0.181 to 0.821, and the average is 0.548.
Tables 24 and 25 show the F-measure and AUC values of cross training for the Bug Prediction Dataset. Based on F-measure, Eclipse JDT Core passed the other systems in terms of training but its AUC values, or at least the first two, seem worse than the others. Equinox performed the worst in the role of being a training set, i.e., having the lowest F-values, but from AUC point of view, Equinox is the most difficult test set because 3 out of 4 of its AUC values are much worse than the average. From testing and training, and from F-measure and AUC value point of view, Mylyn is the best system.
The above described results were calculated for class level datasets. Let us now consider the file level results. The three file level datasets are the GitHub, the Bugcatchers, and the Eclipse bug datasets.
The GitHub Bug Dataset results can be seen in Tables 26 and 27. As in the case of class level, the Android Universal Image Loader project performed the worst in the role of being the test set based on F-measure and the worst system for building models (low AUC values). It would be difficult to select the best system, but the Mission Control T. system is the most critical test system again because it has the lowest and highest AUC values.
The cross-training results of the Bugcatchers Bug Dataset are listed in Tables 28 and 29. The table contains 3 high and 3 low F-measure values and they range in a wide scale from 0.234 to 0.800 while the AUC values are much closer to each other (0.500 − 0.606). Since only three systems are used and there is no significantly best or worst system, we cannot state any conclusion based on these results.
In the Eclipse Bug Dataset, there are only three systems as well but they are three different versions of the same system, namely Eclipse; therefore, we expected better results. As we can see in Tables 30 and 31, in this case, the F-measure and AUC values coincide which means that either both of them are high or both of them are low. The model built on version 2.0 performed better (0.705 and 0.700 F-measures and 0.723 and 0.708 AUC values) on the other two systems than the models built on the other two systems and evaluated on version 2.0 (0.638 and 0.604 F-measures and 0.639 and 0.633 AUC values). This is perhaps caused by the fact that this version contains the least number of bug entries. The other two systems are “symmetrical”, the results are almost the same when one is the train and the other one is the test system.
We also performed a full cross-system experiment involving all systems from all datasets. This matrix is, however, too large to present here; consequently, it can be found in the A and Tables 32 and 33 show only the average of the F-measures and AUC values of the models performed on other datasets. More precisely, we trained a model using each system separately and tested this model on the other systems, and we calculated the averages of the F-measures and AUC values to see how they perform on other datasets. For example, we can see that the average F-measures of the models trained and tested on PROMISE is only 0.607 while if these models are validated on GitHub dataset the average is 0.729. Examining F-measures, we can see that in general, models trained on PROMISE dataset perform the worst, and what is surprising is that they gave the worst result (0.607) on themselves. On the other hand, if we consider AUC values, the model trained on PROMISE achieved the best result on PROMISE (0.554 vs 0.544). This means that in this case AUC values do not help us to select the best or worst predictor/test dataset or even compare the results. Another interesting observation is that the models perform better on GitHub dataset no matter which dataset was used for training (GitHub column). On the other hand, GitHub models achieve only slightly worse F-measures on the other datasets than the best one on the given dataset (0.678 vs 0.680 on PROMISE and 0.727 vs 0.742 on Bug pred.), which suggests that the good testing results are not a consequence of some unique feature of the dataset because it can also be used to train “portable” bug prediction models.
Table 34 shows the average F-measures for the file level datasets. We can observe a similar trend, namely, all models performed the best on GitHub dataset. Besides, the results of Bugcatchers on itself is poor and the other two bug prediction models sets performed better on it. On the other hand, the AUC values do not support this observation (see Table 35). Compared with class level, the AUC values range on a wider scale, from 0.525 to 0.684, and suggest that the Eclipse is the best dataset for model building and this observation is supported by the F-measures as well.
To sum up the previous findings based on the full cross-system experiment, it may seem like file level models performed slightly better than class level predictors because the average F-measure is 0.717 and the average AUC value is 0.594 for files while they are only 0.661 and 0.552 for classes. On the other hand, this might be the result of the fact that we saw that several systems of PROMISE contain too many bugs; therefore, they cannot be used alone to build sophisticated prediction models. Another remarkable result is that all models performed notably better on GitHub dataset, which requires further investigation in the future.