A public unified bug dataset for java and its assessment regarding metrics and bug prediction

Bug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. We investigated the diversion of metric definitions and values of the different bug datasets. Finally, we used a decision tree algorithm to show the capabilities of the dataset in bug prediction. We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction related investigations, researchers can make their studies reproducible, thus able to be validated and verified.

from mistakes committed in the past and build a prediction model to leverage the location and amount of future bugs. Many research papers were published on bug prediction, which introduced new approaches that aimed to achieve better precision values (Zimmermann et al. 2009;Xu et al. 2000;Hall et al. 2012;Weyuker et al. 2010). Unfortunately, a reported bug is rarely associated with the source code lines that caused it or with the corresponding source code elements (e.g., classes, methods). Therefore, to carry out such experiments, bugs have to be associated with source code which in and of itself is a difficult task. It is necessary to use a version control system and a bug tracking system properly during the development process, and even in this case, it is still challenging to associate bugs with the problematic source code locations.
Although several algorithms were published on how to associate a reported bug with the relevant, corresponding defective source code (Dallmeier and Zimmermann 2007;Wong et al. 2012;Cellier et al. 2011), only few such bug association experiments were carried out. Furthermore, not all of these studies published the bug dataset or even if they did, closed source systems were used which limits the verifiability and reusability of the bug dataset. In spite of these facts, several bug datasets (containing information about open-source software systems) were published and made publicly available for further investigations or to replicate previous approaches (Weyuker et al. 2011;Robles 2010). These datasets are very popular; for instance, the NASA and the Eclipse Bug Dataset ) were used in numerous experiments Jureczko and Madeyski 2010;Shepperd et al. 2013).
The main advantage of these bug datasets is that if someone wants to create a new bug prediction model or validate an existing one, it is enough to use a previously created bug dataset instead of building a new one, which would be very resource consuming. It is common in these bug datasets that all of them store some specific information about the bugs, such as the containing source code element(s) with their source code metrics or any additional bug-related information. Since different bug prediction approaches use various sources of information as predictors (independent variables), different bug datasets were constructed. Defect prediction approaches and hereby bug datasets can be categorized into larger groups based on the captured characteristics (D'Ambros et al. 2012): -Datasets using process metrics (Moser et al. 2008;Nagappan and Ball 2005).
Different bug prediction approaches use various public or private bug datasets. Although these datasets seem very similar, they are often very different in some aspects that is also true within the categories mentioned above. In this study, we gather datasets that can be found, but we will focus on datasets that use static source code metrics. Since this category itself has grown so immense, it is worth studying it as a separate unit. This category has also many dissimilarities between the existing datasets including the granularity of the data (source code elements can be files, classes, or methods, depending on the purpose of the given research or on the capabilities of the tool used to extract data) and the representation of element names (different tools may use different notations). For the same reason, the set of metrics can be different as well. Even if the names or the abbreviations of a metric calculated by different tools are the same, it can have different meanings because it can be defined or calculated in a slightly different way. The bug-related information given for a source code element can also be contrasting. An element can be labeled whether it contains a bug, sometimes it shows how many bugs are related to that given source code element. From the information content perspective, it is less important, but not negligible that the format of the files containing the data can be CSV (Comma Separated Values), XML, or ARFF (which is the input format of Weka (Hall et al. 2009)), and these datasets can be found on different places on the Internet.
In this paper, we collected 5 publicly available datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. Appendix explains the structure of the unified bug dataset which is available online.
To make it easier to imagine how a dataset looks like, Table 1 shows an excerpt of an example table where each row contains a Java class with its basic properties like Name or Path, which are followed by the source code metrics (e.g., WMC, CBO), and the most important property, the number of bugs.
After constructing the unified bug dataset, we examined the diversity of the metric suites. We calculated Pearson correlation and Cohen's d effect size, and applied the Wilcoxon signed-rank test to reveal these possible differences. Finally, we used a decision tree algorithm to show the usefulness of the dataset in bug prediction.
We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction-related investigations, researchers can make their studies reproducible, thus able to be validated and verified.
Our contributions can be listed as follows: -Collection of the public bug datasets and source code.
-Unification of the contents of the collected bug datasets.
-Calculation of a common set of source code metrics.
-Comparison of the metrics suites.
-Assessment of the meta data of the datasets.
-Assessment of bug prediction capabilities of the datasets.
-Making the results publicly available.
The first version of this work was published in our earlier conference paper (Ferenc et al. 2018). In this extended version, we collected recent systematic literature review papers in the field and considered their references as well. Moreover, some further related work was used in the paper. As another major extension, we investigated whether the differences in metric values are statistically significant. Finally, we built and evaluated within-project, merged, and cross-project bug prediction models on the unified dataset.     The paper is organized as follows. First, in Section 2, we give a brief overview about how we collected the datasets. We also introduce the collected public datasets and present the characteristics they have. Next, Section 3 presents the steps needed to be done for the sake of unification. We show the original metrics for each dataset and propose the extended metrics suite in Section 4, where we also compare the original and extended metric suites with a statistical method. In Section 5, we summarize the metadata of the datasets and we empirically assess the constructed unified bug dataset by showing its bug prediction capabilities. Afterwards, we list the threats to validity in Section 6. We conclude the paper and discuss future work in Section 7. Finally, Appendix describes the structure of the unified bug dataset and shows its download location.

Data collection
In this section, we give a detailed overview about how we collected and analyzed the datasets. We applied a snowballing-like technique (Wohlin 2014) as our data collection process. In the following, we will describe how our start set was defined, what were the inclusion criteria, and how we iterated over the relevant papers.

Start set
Starting from the early 70's (Randell 1975;Horning et al. 1974), a large number of studies was introduced in connection with software faults. According to Yu et al. (2016), 729 studies were published until 2005 and 1564 until 2015 on bug prediction (the number of studies has doubled in 10 years). From time to time, the enormous number of new publications in the topic of software faults made it unavoidable to collect the most important advances in literature review papers (Hosseini et al. 2019;Herbold et al. 2017;Wahono 2015).
Since these survey or literature review papers could serve as strong start set candidates, we used Scopus and Google Scholar to look for these papers. We used these two search sites to fulfill the diversity rule and cover as many different publishers, years, and authors as possible. We considered only peer-reviewed papers. Our search string was the following: '(defect OR fault OR bug) AND prediction AND (literature OR review OR survey)'. Based on the title and the abstract, we ended up with 32 candidates. We examined these papers and based on their content, we narrowed the start set to 12 (Catal and Diri 2009;Hall et al. 2012;Radjenović et al. 2013;Herbold et al. 2017;Strate and Laplante 2013;Catal 2011;Malhotra and Jain 2011;Jureczko and Madeyski 2011b;Malhotra 2015;Wahono 2015;Adewumi et al. 2016;Li et al. 2018). Other papers were excluded since they were out of scope, lacked peer review, or were not literature reviews. The included literature papers cover a time interval from 1990 to 2017.

Collecting bug datasets
Now we have the starting set of literature review papers, next, we applied backward snowballing to gather all the possible candidates which refer to a bug dataset. In other words, we considered all the references of the review papers to form the final set of candidates. Only one iteration of backward snowballing was used, since the survey papers have already included the most relevant studies in the field and sometimes they have also included reviews about the used datasets.
After having the final set of candidates (687), we filtered out irrelevant papers based on keywords, title, and abstract and we also searched for the string 'dataset' or 'data set' or 'used projects'. Investigating the remained set of papers, we took into consideration the following properties: -Basic information (authors, title, date, publisher). -Accessibility of the bug dataset (public, non public, partially public).
-Availability of the source code.
The latter two are extremely important when investigating the datasets, since we cannot construct a unified dataset without obtaining the appropriate underlying data.
From the final set of papers, we extracted all relevant datasets. We considered the following list to check whether a dataset meets our requirements: -the dataset is publicly available, -source code is accessible for the included systems, -bug information is provided, -bugs are associated with the relevant source code elements, -included projects were written in Java, -the dataset provides bug information at file/class level, and -the source code element names are provided and unambiguous (the referenced source code is clearly identifiable).
If any condition is missing, then we had to exclude the subject system or the whole dataset from the study, because they cannot be included in the unified bug dataset. Initially, we did not insist on examining Java systems; however, relevant research papers mostly focus on Java language projects (Sayyad Shirabad and Menzies 2005;Catal 2011;Radjenović et al. 2013). Consequently, we narrowed our research topic to datasets capturing information about systems written in Java. This way, we could use one static analysis tool to extract the characteristics from all the systems; furthermore, including heterogeneous systems would have added a bias to the unified dataset, since the interpretation of the metrics, even more the interpretable set of metrics themselves, can differ from language to language.
The list of found public datasets we could use for our purposes is the following (references are pointing to the original studies in which the datasets were first presented):  (Tóth et al. 2016) It is important to note that we will refer the Jureczko dataset as the PROMISE dataset throughout the study; however, the PROMISE repository itself contains more datasets such as the NASA MDP (Sayyad Shirabad and Menzies 2005) dataset (had to be excluded, since the source code is not accessible).

Public Datasets
In the following subsections, we will describe the chosen datasets in more details and investigate each dataset's peculiarities and we will also look for common characteristics. Before introducing each dataset, we show some basic size statistics about the chosen datasets, which is presented in Table 2. We used the cloc (https://www.npmjs.com/package/cloc) program to measure the Lines of Code. We only took Java source files into consideration and we neglected blank lines. PROMISE (Sayyad Shirabad and Menzies 2005) is one of the largest research data repositories in software engineering. It is a collection of many different datasets, including the NASA MDP (Metric Data Program) dataset, which was used by numerous studies in the past. However, one should always mistrust the data that comes from an external source (Petrić et al. 2016;Gray et al. 2011Gray et al. , 2012Shepperd et al. 2013). The repository is created to encourage repeatable, verifiable, refutable, and improvable predictive models of software engineering. This is essential for maturation of any research discipline. One main goal is to extend the repository to other research areas as well. The repository is community based; thus, anybody can donate a new dataset or public tools, which can help other researchers in building state-of-the-art predictive models. PROMISE provides the datasets under categories like code analysis, testing, software maintenance, and it also has a category for defects.

PROMISE
One main dataset in the repository is the one from Jureczko and Madeyski (2010) which we use in our study. The dataset uses the classic Chidamber and Kemerer (C&K) metrics (Chidamber and Kemerer 1994) to characterize the bugs in the systems. Zimmermann et al. (2007) mapped defects from the bug database of Eclipse 2.0, 2.1, and 3.0. The resulting dataset lists the number of pre-and post-release defects on the granularity of files and packages that were collected from the BUGZILLA bug tracking system. They collected static code features using the built-in Java parser of Eclipse. They calculated some features at a finer granularity; these were aggregated by taking the average, total, and maximum values of the metrics. Data is publicly available and was used in many studies since then. Last modification on the dataset was submitted on March 25, 2010.

Bug Prediction Dataset
The Bug prediction dataset (D'Ambros et al. 2010) contains data extracted from 5 Java projects by using inFusion and Moose to calculate the classic C&K metrics for class level. The source of information was mainly CVS, SVN, Bugzilla and Jira from which the number of pre-and post-release defects were calculated. D'Ambros et al. also extended the source  Hall et al. presented the Bugcatchers (Hall et al. 2014) Bug Dataset, which solely operates with bad smells, and found that coding rule violations have a small but significant effect on the occurrence of faults at file level. The Bugcatchers Bug Dataset contains bad smell information about Eclipse, ArgoUML, and some Apache software systems for which the authors used Bugzilla and Jira as the sources of the data.

GitHub Bug Dataset
Tóth et al. selected 15 Java systems from GitHub and constructed a bug dataset at class and file level (Tóth et al. 2016). This dataset was employed as an input for 13 different machine learning algorithms to investigate which algorithm family performs the best in bug prediction. They included many static source code metrics in the dataset and used these measurements as independent variables in the machine learning process.

Additional Bug Datasets
In this section, we show additional datasets which could not be included in the chosen set of datasets. Since this study focuses on datasets that fulfilled our selection criteria and could be used in the unification, we only briefly describe the most important but excluded datasets here.

Defects4J
Defects4J is a bug dataset which was first presented at the ISSTA conference in 2014 (Just et al. 2014). It focuses on bugs from software testing perspective. Defects4J encapsulates reproducible real world software bugs. Its repository 1 includes software bugs with their manually cleaned patch files (irrelevant code parts were removed manually) and most importantly it includes a test suite from which at least one test case fails before the patch was applied and none fails after the patch was applied. Initially, the repository contained 357 software bugs from 5 software systems, but it reached 436 bugs from 6 systems owing to active maintenance.

IntroClassJava
IntroClassJava (Durieux and Monperrus 2016) dataset is a collection of software programs each with several revisions 2 . The revisions were submitted by students and each revision is a maven project. This benchmark is interesting since it contains C programs transformed into Java. Test cases are also transformed into standard JUnit test cases. The benchmark consists of 297 Java programs each having at least one failing test case. The IntroClassJava dataset is very similar to Defects4J; however, it does not provide the manually cleaned fixing patches.

QuixBugs
QuixBugs is a benchmark for supporting automatic program repair research studies (Lin et al. 2017). QuixBugs consists of 40 programs written in both Python and Java 3 . It also contains the failing test cases for the one-line bugs located in each program. Defects are categorized and each defect falls in exactly one category. The benchmark also includes the corrected versions of the programs.

Bugs.jar
Bugs.jar (Saha et al. 2018) is a large scale, diverse dataset for automatic bug repair 4 . Bugs.jar falls into the same dataset category as the previously mentioned ones. It consists of 1,158 bugs with their fixing patches from 8 large open-source software systems. This dataset also includes the bug reports and the test suite in order to support reproducibility.

Bears
Bears dataset (Madeiral et al. 2019) is also present to support automatic program repair studies 5 . This dataset makes use of the continuous integration tool named Travis to generate new entries in the dataset. It includes the buggy state of the source code, the test suite, and the fixing patch as well.

Summary
All the above described datasets are focusing on bugs from the software testing perspective and also support future automatic program repair studies. These datasets can be good candidates to be used in fault localization research studies as well. These datasets capture buggy states of programs and provide the test suite and the patch. Our dataset is fundamentally different from these datasets. The datasets we collected gather information from a wider time interval and provide information for each source code element by characterizing them with static source code metrics.

Data Processing
Although the found public datasets have similarities (e.g., containing source code metrics and bug information), they are very inhomogeneous. For example, they contain different metrics, which were calculated with different tools and for different kinds of code elements. The file formats are different as well; therefore, it is very difficult to use these datasets together. Consequently, our aim was to transform them into a unified format and to extend them with source code metrics that are calculated with the same tool for each system. In this section, we will describe the steps we performed to produce the unified bug dataset. First, we transformed the existing datasets to a common format. This means that if a bug dataset for a system consists of separate files, we conflated them into one file. Next, we changed the CSV separator in each file to comma (,) and renamed the number of bug column in each dataset to 'bug' and the source code element column name to 'filepath' or 'classname' depending on the granularity of the dataset. Finally, we transformed the source code element identifier into the standard form (e.g. org.apache.tools.ant.AntClassLoader).

Metrics calculation
The bug datasets contain different kinds of metric sets, which were calculated with different tools; therefore, even if the same metric name appears in two or more different datasets, we cannot be sure they mean exactly the same metric. To eliminate this deficiency, we analyzed all the systems with the same tool. For this purpose, we used the free and opensource OpenStaticAnalyzer 1.0 (OSA) 6 tool that is able to analyze Java systems (among other languages). It calculates more than 50 different kinds (size, complexity, coupling, cohesion, inheritance, and documentation) of source code metrics for packages and classlevel elements, about 30 metrics for methods, and a few ones for files. OSA can detect code duplications (Type-1 and Type-2 clones) as well, and calculates code duplication metrics for packages, classes, and methods. OpenStaticAnalyzer has two different kinds of textual outputs: the first one is an XML file that contains, among others, the whole structure of the source code (files, packages, classes, methods), their relationships, and the metric values for each element (e.g. file, class, method). The other output format is CSV. Since different elements have different metrics, there is one CSV file for each kind of element (one for packages, one for classes, and so on).
The metrics in the bug datasets were calculated with 5 different tools (inFusion Moose, ckjm, Visitors written for Java parser of Eclipse, Bad Smell Detector, SourceMeter -which is a commercial product based on OSA; see Table 16). From these 5 tools, only ckjm and SourceMeter/OSA are still available on the internet, but the last version of ckjm is from 2012 and the Java language evolved a lot since then. Additionally, ckjm works on the bytecode representation of the code, which makes it necessary to compile the source code before analysis. Consequently, we selected OSA because it is a state-of-the-art analyzer, which works on the source code that besides being easier to use, enables also more precise analysis.
Of course, further tools are also available, but it was not the aim of this work to find the best available tool.
For calculating the new metric values, we needed the source code itself. Since all datasets belonged to a release version of a given software, therefore, if the software was open-source and the given release version was still available, we could manage to download and analyze it. This way, we obtained two results for each system: one from the downloaded bug datasets and one from the OSA analysis.

Dataset unification
We merged the original datasets with the results of OSA by using the "unique identifiers" of the elements (Java standard names at class level and paths at file level). More precisely, the basis of the unified dataset was our source code analysis result and it was extended with the data of the given bug dataset. This means that we went through all elements of the bug dataset and if the "unique identifier" of an element was found in our analysis result, then these two elements were conjugated (paired the original dataset entry with the one found in the result of OSA); otherwise, it was left out from the unified dataset. Tables 3 and 4 show  the results of this merging process at class and file level, respectively: column OSA shows how many elements OSA found in the analyzed systems, column Orig. presents the number of elements originally in the datasets, and column Dropped tells us how many elements of the bug datasets could not be paired, and so they were left out from the unified dataset. The numbers in parentheses show the amount of dropped elements where the drop was caused because of the original sources were not real Java sources, such as package-info.java and Scala files (which are also compiled to byte code and hence included in the original dataset). Although these numbers are very promising, we had to "modify" a few systems to achieve this, but there were cases where we simply could not solve the inconsistencies. The details of the source code modifications and main reasons for the dropped elements were the following: Camel 1.2: In the org.apache.commons.logging, there were 13 classes in the original dataset that we did not find in the source code. There were 5 package-info.java 7 files in the system, but these files never contain any Java classes, since they are used for package level Javadoc purposes; therefore, OSA did not find such classes. Ckjm 1.8: There was a class in the original dataset that did not exist in version 1.8.
Forrest-0.8: There were two different classes that appeared twice in the source code; therefore, we deleted the 2 copies from the etc/test-whitespace subdirectory.
Log4j: There was a contribs directory which contained the source code of different contributors. These files were put into the appropriate sub-directories as well (where they belonged according to their packages), which means that they occurred twice in the analysis and this prevented their merging. Therefore, in these cases, we analyzed only those files that were in their appropriate subdirectories and excluded the files found in the contribs directory.
Lucene: In all three versions, there was an org.apache.lucene.search.Remote-Searchable Stub class in the original dataset that did not exist in the source code.
Velocity: In versions 1.5 and 1.6 there were two org.apache.velocity.app.event. implement.EscapeReference classes in the source code; therefore, it was impossible to conjugate them by using their "unique identifiers" only.

Xerces 1.4.4:
Although the name of the original dataset and the corresponding publication state that this is the result of Xerces 1.4.4 analysis, we found that 256 out of the 588 elements did not exist in that version. We examined a few previous and following versions as well, and it turned out that the dataset is much closer to 2.0.0 than to 1.4.4, because only 42 elements could not be conjugated with the analysis result of 2.0.0. Although version 2.0.0 was still not matched perfectly, we did not find a "closer version"; therefore, we used Xerces 2.0.0 in this case.
Eclipse JDT Core 3.4: There were a lot of classes which appeared twice in the source code: once in the "code" and once in the "test" directory; therefore, we deleted the test directory.
Eclipse PDE UI 3.4.1: The missing 6 classes were not found in its source code.

Equinox 3.4:
Three classes could not be conjugated, because they did not have a unique name (there are more classes with the same name in the system) while two classes were not found in the system.

Lucene 2.4 (BPD)
: 21 classes from the original dataset were not present in the source code of the analyzed system.
Mylyn 3.1: 457 classes were missing from our analysis that were in the original dataset; therefore, we downloaded different versions of Mylyn, but still could not find the matching source code. We could not achieve better result without knowing the proper version.
ArgoUML 0.26 Beta: There were 3 classes in the original dataset that did not exist in the source code.
Eclipse JDT Core 3.1: There were 25 classes that did not exist in the analyzed system.

GitHub Bug Dataset
Since OSA is the open-source version of SourceMeter, the tool used to construct the GitHub Bug Dataset, we could easily merge the results. However, the class level bug datasets contained elements having the same "unique identifier" (since class names are not the standard Java names in that case), so this information was not enough to conjugate them. Luckily, the paths of the elements were also available and we used them as well; therefore, all elements could be conjugated. Since they performed a machine learning step on the versions that contain the most bugs, we decided to select these release versions and present the characteristics of these release versions. We also used these versions of the systems to include in the unified bug dataset. As a result of this process, we obtained a unified bug dataset which contains all of the public datasets in a unified format; furthermore, they were extended with the same set of metrics provided by the OSA tool. The last lines of Tables 3 and 4 show that only 1.29% (624 out of 48,242) of the classes and 0.06% (28 out of 43,772) of the files could not be conjugated, which means that only 0.71% (652 out of 92,014) of the elements were left out from the unified dataset in total. In many cases, the analysis results of OSA contained more elements than the original datasets. Since we did not know how the bug datasets were produced, we could not give an exact explanation for the differences, but we list the two main possible causes: -In some cases, we could not find the proper source code for the given system (e.g., Xerces 1.4.4 or Mylyn), so two different but close versions of the same system might be conjugated. -OSA takes into account nested, local, and anonymous classes while some datasets simply associated Java classes with files.

Original and extended metrics suites
In this section, we present the metrics proposed by each dataset. Additionally, we will show a metrics suite that is used by the newly constructed unified dataset.

PROMISE
The authors Promiserepo (2018) calculated the metrics of the PROMISE dataset with the tool called ckjm. All metrics, except McCabe's Cyclomatic Complexity (CC), are class level metrics. Besides the C&K metrics, they also calculated some additional metrics shown in Table 5.

Eclipse Bug Dataset
In the Eclipse Bug Dataset, there are two types of predictors. By parsing the structure of the obtained abstract syntax tree, they Zimmermann et al. (2007) calculated the number of nodes for each type in a package and in a file (e.g., the number of return statements in a file). By implementing visitors to the Java parser of Eclipse, they also calculated various complexity metrics at method, class, file, and package level. Then they used avg, max, total avg, and total max aggregation techniques to accumulate to file and package level. The complexity metrics used in the Eclipse dataset are listed in Table 6.

Bug Prediction Dataset
The Bug Prediction Dataset collects product and change (process) metrics. The authors (D'Ambros et al. 2010) produced the corresponding product and process metrics at class level. Besides the classic C&K metrics, they calculated some additional object-oriented metrics that are listed in Table 7.

Bugcatchers Bug Dataset
The Bugcatchers Bug Dataset is a bit different from the previous datasets, since it does not contain traditional software metrics, but the number of bad smells for files. They used five bad smells which are presented in Table 8. Besides, in the CSV file, there are four source code metrics (blank, comment, code, codeLines), which are not explained in the corresponding publication (Hall et al. 2014).

GitHub Bug Dataset
The GitHub Bug Dataset (Tóth et al. 2016) used the free version of SourceMeter (2019) static analysis tool to calculate the static source code metrics including software product metrics, code clone metrics, and rule violation metrics. The rule violation metrics were not used in our research; therefore, Table 9 shows only the list of the software product and code clone metrics at class level. At file level, only a narrowed set of metrics is calculated, but there are 4 additional process metrics included as Table 10 shows.

Unified Bug Dataset
The unified dataset contains all the datasets with their original metrics and with further metrics that we calculated with OSA. The set of metrics calculated by OSA concurs with the metric set of the GitHub Bug Dataset, because SourceMeter is a product based on the free and open-source OSA tool. Therefore, all datasets in the Unified Bug Dataset are extended with the metrics listed in Table 9 except the GitHub Bug Dataset, because it contains the same metrics originally.
In spite of the fact that several of the original metrics can be matched with the metrics calculated by OSA, we decided to keep all the original metrics for every system included in the unified dataset, because they can differ in their definitions or in the ways of their calculation. One can simply use the unified dataset and discard the metrics that were calculated by OSA if s/he wants to work only with the original metrics. Furthermore, this provides an opportunity to confront the original and the OSA metrics.   Instead of presenting all the definitions of metrics here, we give an external resource to show metric definitions because of the lack of space. All the metrics and their definitions can be found in the Unified Bug Dataset file reachable as an online appendix (see Appendix).

Comparison of the Metrics
In the unified dataset, each element has numerous metrics, but these values were calculated by different tools; therefore, we assessed them in more detail to get answers to questions like the following ones: -Do two metrics with the same name have the same meaning? -Do metrics with different names have the same definition? -Can two metrics with the same definition be different? -What are the root causes of the differences if the metrics share the definition?
Three out of the five datasets contain class level elements, but unfortunately, for each dataset, a different analyzer tool was used to calculate the metrics (see Table 16). To be able to compare class level metrics calculated by all the tools used, we needed at least one dataset for which all metrics of all three tools are available. We were already familiar with the usage of the ckjm tool, so we chose to calculate the ckjm metrics for the Bug Prediction dataset. This way, we could assess all metrics of all tools, because the Bug Prediction dataset was originally created with Moose, so we have extended it with the OSA metrics, and also -for the sake of this comparison -with ckjm metrics.
In the case of the three file level datasets, the used analyzer tools were unavailable; therefore, we could only compare the file level metrics of OpenStaticAnalyzer with the results of the other two tools separately on Eclipse and Bugcatchers Bug datasets.
In each comparison, we merged the different result files of each dataset into one, which contained the results of all systems in the given dataset and deleted those elements that did not contain all metric values. For example, in case of the Bug Prediction Dataset, we calculated the OSA and ckjm metrics, then we removed the entries which were not identified by all three tools. Because we could not find the analyzers used in the file level datasets, we used the merging results seen in Section 3.2. For instance, in case of the Bugcatchers Bug Dataset, the new merged (unified) dataset has 14,543 entries (491 + 1,752 + 12,300) out of which 2,305 were in the original dataset and not dropped (191 + 1,582 + 560 -28).
The resulting spreadsheet files can be found in the Appendix. Table 11 shows how many classes or files were in the given dataset and how many of them remained. We calculated the basic statistics (minimum, maximum, average, median, and standard deviation) of the examined metrics and compared them (see Table 12). Besides, we calculated the pairwise differences of the metrics for each element and examined its basic statistics as well. In addition, the Equal column shows the percentage of the classes for which the two examined tools gave the same result (for example, at class level OSA and Moose calculated the same WMC value for 2,635 out of the 4,167 elements, which is 63.2%, see Table 12).
Since the basic statistic values give only some impression about the similarity of the metric sets, we performed a Wilcoxon signed-rank test (Myles et al. 2014), which determines whether two dependent samples were selected from populations having the same distribution. In our test, the H 0 hypothesis is that the difference between the pairs follows a symmetric distribution around zero, while the alternative H 1 hypothesis is that the difference between the pairs does not follow a symmetric distribution around zero. We used 95% confidence level in the tests to calculate the p-values. This means that if a p-value is higher than 0.05, we accept the H 0 hypothesis; otherwise, we reject it and accept the H 1 alternative hypothesis instead. In all cases, the p-values were less than 0.001; therefore, we had to reject the H 0 and accept that the difference between the pairs does not follow a symmetric distribution around zero.
Although from statistical point of view the metric sets are different, we see that in many cases, there are lot of equal metric values. For example, in case of the file level dataset of Eclipse (see Table 14), only 11 out of 25,199 metrics are different, but 10 out of these 11 are larger only by 1 than their pairs so the test recognizes well that it is not symmetric. On the other hand, in this case, we can say that the two groups can be considered identical because less than 0.1% of the elements differ, and the difference is really neglectable. Therefore, we calculated Cohen's d as well which indicates the standardized difference between two means (Cohen 1988). If this difference, namely Cohen's d value, is less than 0.2, we say that the effect size is small and more than 90% of the two groups overlap. If the Cohen's d value is between 0.2 and 0.5, the effect size is medium and if the value is larger than 0.8, the effect size is large. Besides, to see how strong or weak the correlations between the metric values are, we calculated the Pearson correlation coefficient. The correlation coefficient ranges from -1 to 1, where 1 (or -1) means that there is linear equation between the two sets, while 0 means that there is no linear correlation at all. In general, the larger the absolute correlation coefficient is, the stronger the relationship is. In the literature, there are different recommendations for the threshold above the correlation is considered strong (Gyimothy et al. 2005a;Jureczko 2011a), in this research, we used 0.8.

Class Level Metrics
The unified bug dataset contains the class level metrics of OSA and Moose on the Bug Prediction dataset. We downloaded the Java binaries of the systems in this dataset and used ckjm version 2.2 to calculate the metrics. The first difference is that while OSA and Moose calculate metrics on source code, ckjm uses Java bytecode and takes "external dependencies" into account; therefore, we expected differences, for instance, in the coupling metric values.
We compared the metric sets of the three tools and found that, for example, CBO and WMC have different definitions. On the other hand, efferent coupling metric is a good example for a metric which is calculated by all three tools, but with different names (see Table 12, CBO row). In the following paragraphs, we only examine those metrics whose definitions coincide in all three tools even if their names differ. Table 12 shows these metrics where the Metric column contains the abbreviation of the most widely used name of the metric. The Tool column presents the analyzer tools, in the Metric name column, the metric names are given using the notations of the different datasets. The "tool 1 −tool 2 " means the pairwise difference where, for each element, we extracted the value of tool 2 from the value of tool 1 and the name of this "new metric" is diff. The following columns present the basic statistics of the metrics. The Equal column denotes the percentage of the elements having  Software Quality Journal the same metric value (i.e., the difference is 0), and the last two columns are the Cohen's d value and the Pearson correlation coefficient (where it is appropriate). We highlighted with bold face those values that suggest that the two metric set values are close to each other from a given aspect: -if more than half of the element pairs are equal (i.e., Equal is above 50%), -if the effect size is small (i.e., Cohen's d is less than 0.2), -if there is strong linear correlation between the elements (i.e., the absolute value of the Pearson correlation coefficient is larger than 0.8).
Next, we will analyze the metrics one at a time.
WMC: This metric expresses the complexity of a class as the sum of the complexity of its methods. In its original definition, the method complexity is deliberately not defined exactly; and usually the uniform weight of 1 is used.

CBO:
In this definition, CBO counts the number of classes the given class depends on.
Although it is a coupling metric, it counts efferent (outer) couplings; therefore, the metric values should have been similar. On the other hand, based on the statistical values and the pairwise comparison including Equal and Cohen values, we can say that these metrics differ significantly. We can observe strong linear correlation among them (Pearson values are close or above 0.8) but since there are few equal values we can suspect that they differ by a constant in most cases. The reasons can be, for example, that ckjm takes into account "external" dependencies (e.g., classes from java.util) or it counts coupling based on generated elements too (e.g., generated default constructor), but further investigation would be required to determine all causes.

CBOI:
It counts those classes that use the given class. Although the basic statistics of OSA and ckjm are close to each other, its pairwise comparison suggests that they are different because there are large outliers and the averages of the diffs are commensurable with the averages of the tools. Based on Equal, Cohen, and Pearson values, it seems that the metric values of OSA and ckjm are close to each other but the metric values of Moose are different. The main reason can be, for example, that OSA found two times more classes; therefore, it is explicable that more classes use the given class or ckjm takes into account the generated classes and connections as well that exist in the bytecode, but not in the source code. LOC: Lines of code should be the most unambiguous metric, but it also differs a lot.

RFC:
Although this metric has several variants and it is not defined exactly how Moose and ckjm counts it, we used the closest one from OSA based on the metric values. The very large value of ckjm is surprising, but it counts this value from the bytecode; therefore, it is not easy to validate it. Besides, OSA and Moose have different values, in spite of the fact that both of them calculate LOC from source code. The 0 minimal value of Moose is also interesting and suggests that either Moose used a different definition or the algorithm was not good enough. We found the fewest equal metric pairs for LOC metric (Equal values are 3.1%, 0.6%, 0.1%) but the large Pearson correlation values (close or above 0.9%) suggest that the metrics differ mainly by a constant value only.
NPM: Based on the statistical results, the number of public methods metrics seems to be the most unambiguous metric. Both the basic statistics and the Equal, Cohen, and Pearson triplet imply that the metrics are close to each other. OSA and ckjm are really close to each other (75% of the values are the same and the rest is also close to each other because the Pearson correlation coefficient is 0.999), while Moose has slightly different results. However, the average difference is around 1, which can be caused by counting implicit methods (constructors, static init blocks) or not. The comparison of the three tools revealed that, even though they calculate the same metrics, in some cases, the results are very divergent. A few of the reasons can be that ckjm calculates metrics from bytecode while the other two tools work on source code, or ckjm takes into account external code as well while OSA does not. Besides, we could not compare the detailed and precise definitions of the metrics to be sure that they are really calculated in the same way; therefore, it is possible that they differ slightly which causes the differences.

File level metrics
Bugcatchers, Eclipse, and GitHub Bug Dataset are the ones that operate at file level (GitHub Bug Dataset contains class level too). Unfortunately, we could make only pairwise comparisons between file level metrics, since we could not replicate the measurements used in the Eclipse Bug Dataset (custom Eclipse JDT visitors were used) and in the Bugcatchers Bug Dataset (unknown bad smell detector was used).
In case of Bugcatchers Bug Dataset, we compared the results of OSA and the original metrics which were produced by a code smell detector. Since OSA only calculates a narrow set of file level metrics, Logical Lines of Code (LLOC) is the only metric we could use in this comparison. Table 13 presents the result of this comparison. Min, max, and median values are likely to be the same. Moreover, the average difference between LLOC values is less than 1 with a standard deviation of 6.05 which could be considered insignificant in case of LLOC at file level. Besides, more than 90% of the metric values are the same, and the remaining values are also close because the Cohen's d is almost 0 and the Pearson correlation coefficient is very close to 1. This means that the two tools calculate almost the same LLOC values.
There is an additional common metric (CLOC) which is not listed in Table 13 since OSA returned 0 values for all the files. This possible error in OSA makes it superfluous to examine CLOC in further detail.
In case of the Eclipse Bug Dataset, LLOC values are the same in most of the cases (see Table 14). OSA counted one extra line in 10 cases out of 25,210 and once it missed 7 lines which is a negligible difference. This is the cause of the "perfect" statistical values. Unfortunately, there is a serious sway in case of McCabe's Cyclomatic Complexity. There is a case where the difference is 299 in the calculated values, which is extremely high for this metric. We investigated these cases and found that OSA does not include the number of methods in the final value. There are many cases when OSA gives 1 as a result while the Eclipse Visitor calculates 0 as complexity. This is because OSA counts class definitions but not method definitions. There are cases where OSA provides higher complexity values. It turned out that OSA took the ternary operator (?:) into consideration, which is correct, since these statements also form conditions. Both calculation techniques seem to have some minor issues or at least we have to say that the metric definitions of cyclomatic complexity differ. This is why there are only so few matching values (4%) but the Cohen's d and the Pearson correlation coefficient suggest that these values are still related to each other. The significant differences both at class and file levels show that tools interpret and implement the definitions of metrics differently. Our results further strengthen the conclusion reported by Lincke et al. (2008), who described similar findings.

Evaluation
In this section, we first evaluate the unified bug dataset's basic properties like the number of source code elements and the number of bugs to gain a rough overview of its contents. Next, we show the metadata of the dataset like the used code analyzer or the calculated metric set. Finally, we present an experiment in which we used the unified bug dataset for its main purpose, namely for bug prediction. Our aim was not to create the best possible bug prediction model, but to show that the dataset is indeed a usable source of learning data for researchers who want to experiment with bug prediction models. Table 15 shows the basic properties about each dataset. In the SCE column, the number of source code elements is presented. Based on the granularity, it means the number of classes     SCEwBug means the number of source code elements which have at least one bug; SCEwBug% is the percentage of the source code elements with bugs in the dataset. The SCEwBug% as the percentage of buggy classes or files describes how well-balanced the datasets are. Since it is difficult to overview the numbers, Figs. 1 and 2 show the distribution of the percentages of faulty source code elements (SCEwBug%) for classes and files, respectively. The percentages are shown horizontally, and, for example, the first column means the number of systems (part of a dataset) that have between 0 and 10 percentages of their source code elements buggy (0 included, 10 excluded). In case of the systems that give bug information at class level, this number is 11 (5 at file level). We can see that there are systems where the percentages of the buggy classes are very high, for example, 98.79% of the classes are buggy for Xalan 2.7 or 92.20% for Log4J 1.2. Although the upper limit of SCEwBug% is 100%, the reader may feel that these values are extremely high and it is very difficult to believe that a release version of a system can contain so many bugs. The other end is when a system hardly contains any bug, for example, in case of MCT and Neo4j, less than 1% of the classes is buggy. From the project point of view, it is very good but on the other hand, these systems are probably less usable when we want to build bug prediction models. This phenomenon further strengthens the motivation to have a common unified bug dataset which can blur these extreme outliers. Although there are many systems with extremely high or low SCEwBug% values, we used them "as is" later in this research because their validation was not the aim of this work.  Table 16 lists some properties of the datasets, which show the circumstances of the dataset, rather than the data content. Our focus is on how the datasets were created, how reliable the used tools and the applied methods were. Since most of the information in the table was already described in previous sections (Analyzer, Granularity, Metrics, and Release), in this section, we will describe only the Bug information row.

Meta data of the Datasets
The Bug Prediction Dataset used the commit logs of SVN and the modification time of each file in CVS to collect co-changed files, authors, and comments. Then, they D' Ambros et al. (2010) linked the files with bugs from Bugzilla and Jira using the bug id from the commit messages. Finally, they verified the consistency of timestamps. They filtered out inner classes and test classes.
The PROMISE dataset used Buginfo to collect whether an SVN or CVS commit is a bugfix or not. Buginfo uses regular expressions to detect commit messages which contain bug information. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%  The bug information of the Eclipse Bug Dataset was extracted from the CVS repository and Bugzilla. In the first step, they identified the corrections or fixes in the version history by looking for patterns which are possible references to bug entries in Bugzilla. In the second step, they mapped the bug reports to versions using the version field of the bug report. Since the version of a bug report can change during the life cycle of a bug, they used the first version.
The Bugcatchers Bug Dataset followed the methodology of Zimmermann et al. (Eclipse Bug Dataset). They developed an Ant script using the SVN and CVS plugins of Ant to checkout the source code and associate each fault with a file.
The authors of the GitHub bug dataset gathered the relevant versions to be analyzed from GitHub. Since GitHub can handle references between commits and issues, it was quite handy to use this information to match commits with bugs. They collected the number of bugs located in each file/class for the selected release versions (about 6-month-long time intervals).

Bug Prediction
We evaluated the strength of bug prediction models built with the Weka (Hall et al. 2009) machine learning software. For each subject software system in the Unified Bug Dataset, we created 3 versions of ARFF files (which is the default input format of Weka) for the experiments (containing only the original, only OSA, and both set of metrics as predictors). In these files, we transformed the original bug occurrence values into two classes as follows: 0 bug → non buggy class, at least 1 bug occurrence → buggy class. Using these ARFF files, we could run several tests about the effectiveness of fault prediction models built based on the dataset.

Within-project Bug Prediction
As described in Section 4, we extended the original datasets with the source code metrics of the OSA tool and we created a unified bug dataset. We compared the bug prediction capabilities of the original metrics, the OSA metrics, and the extended metric suite (both metric suites together). First, we handled each system individually, so we trained and tested on the same system data using ten-fold cross-validation. To build the bug prediction models, we used the J48 (C4.5 decision tree) algorithm with default parameters. We used only J48 since we did not focus on finding the best machine learning method, but we rather wanted to show a comparison of the different predictors' capabilities with this one algorithm. We chose J48, because it has shown its power in case of the GitHub Bug Dataset (Tóth et al. 2016) and because it is relatively easy to interpret the resulting decision trees and to identify the subset of metrics which are the most important predictors of bugs. Different machine learning algorithms (e.g., neural networks) might provide different results. It is also important to note that we did not use any sampling technique to balance the datasets, we used the datasets 'as is'. We will outline some connections between the machine learning results and the characteristics of the datasets (e.g., SCEwBug%).
The F-measure 9 results can be seen in Table 17 for classes and in Table 18 for files. We also included the SCEwBug% characteristic of each dataset to gain a more detailed view 9 F-measure is the harmonic mean of precision and recall.  The results of OSA and the Merged metrics would be the same. This could distort the averages, so we decided to detach this dataset from others when calculating the averages. Results for classes need some deeper explanation for clear understanding. There are a few missing values, since there were less than 10 data entries; not enough to do the ten-fold cross-validation on (Ckjm and Forrest-0.6).
There are some outstanding numbers in the tables. Considering Xalan 2.7, for example, we can see F-measure values with 0.992 and above. This is the consequence of the distribution of the bugs in that dataset which is extremely high as well (98.79%). The reader could argue that resampling the dataset would help to overcome this deficiency; however, the corresponding AUC value 10 for the Original dataset is not outstanding (only 0.654). Besides, later in this section, we will investigate the bug prediction capabilities of the datasets in a cross-project manner and it turns out that these extreme outliers generally performed poor in cross-project learning as we will describe it later in detail. Let us consider Forrest 0.8 as an example where the SCEwBug% is low (6.25); however, the F-measure values are still high (0.891-0.907). In this case, the high values came from the fact that there are 32 classes out of which only 2 are buggy, so the decision tree tends to mark all classes as non buggy. On the other hand, the AUC values range from very poor (0.100) to very high (0.900) which suggest that for small examples and small number of bugs, it is difficult build "reliable" models.
The averages of F-measure and AUC slightly change in case of class level datasets and the average of the merged dataset is only slightly better than the Original or the OSA. If we consider F-measure, the GitHub Bug Dataset performed 10% better generally, while the averages of UAC decreased by only 1% which is neglectable. One possible explanation can be that the PROMISE dataset includes all the smaller projects, while the GitHub Bug Dataset and the Bug Prediction Dataset rather contain larger projects.
Small differences in case of the GitHub Bug Dataset come from the difference of the versions of the static analyzers. The SourceMeter version used in creating the GitHub Bug Dataset is older than the OSA version used here, in which some metric calculation enhancements took place. (SourceMeter is based on OSA.) Results at file level are quite similar to class level results in terms of F-measure and AUC. The only main difference is that the F-measure averages of OSA for Bugcatchers and Eclipse bug datasets are notably smaller than the other values. The reason for this might be that there are only a few file level metrics provided by OSA, and a possible contradiction in metrics can decrease the training capabilities (we saw that even LLOC values are very different). On the other hand, the corresponding AUC value is close to the others so deeper investigation is required to find out the causes. Besides, the average F-measure of the Merged model is slightly worse while the average AUC is slightly better than the Original which means that in this case the OSA metrics were not able to improve the bug prediction capability of the model. The small difference between the Original and the OSA results in case of the GitHub Bug Dataset comes from the slightly different metric suites, since OSA calculates the Public Documented API (PDA) and Public Undocumented API (PUA) metrics as well. Furthermore, the aforementioned static analyzer version differences also caused this small change in the average F-measure and AUC values.

Merged Dataset Bug Prediction
So far, we compared the bug prediction capabilities of the old and new metric suites on the systems separately, but we have not used its main advantage, namely, there are metrics that were calculated in the same way for all systems (OSA metric suite). Seeing the results of the within-project bug prediction, we can state that creating a larger dataset, which includes projects varying in size and domain, could lead to a more general dataset with increased usability and reliability. Therefore, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and by using 10-fold cross validation, we evaluated the bug prediction model built by J48 on this large dataset as well. The Fmeasure is 0.818 for the classes and 0.755 for the files while the AUC is 0.721 for classes and 0.699 for files. Although the F-measure values are, for example, a little worse than the average of GitHub results (0.892 for classes and 0.820 for files), the AUC values are better than the best averages (0.656 for class and 0.676 for files) even if they were measured on a much larger and very heterogeneous dataset. After training the models at class and file levels, we dug deeper to see which predictors are the most dominant ones. Weka gives us pruned trees as results both at class and file levels. A pruned tree is a transformed one obtained by removing nodes and branches without affecting the performance too much. The main goal of the pruning is to reduce the risk of overfitting. We not only constructed a Unified Bug Dataset for all of the systems together with the unified set of metrics, but we also built one for each collected dataset, one for the Bug Prediction Dataset, one for the PROMISE dataset, etc. in which we could use a wider set of metrics (including the original ones). Table 19 presents the most dominant metrics for the datasets extracted from the constructed decision trees. We marked those metrics with bold face that occur in multiple datasets. At class level, WMC (Weighted Methods per Class) and TNOS (Total Number of Statements) are the most important ones; however, they happened not to be the most dominant ones in the Unified Bug Dataset.
Regarding the Unified Bug Dataset, we found CLOC, TCLOC (comment lines of code metrics), DIT (Depth of Inheritance), CBO (Coupling Between Objects) and NOI (Number of Outgoing Invocations) as the most dominant metrics at class level. In other words, these metrics have the largest entropy. This set of metrics being the most dominant is reasonable, since it is easier to modify and extend classes that have better documentation. Furthermore, coupling metrics such as CBO and NOI have already demonstrated their power in bug prediction (Gyimóthy et al. 2005b;Briand et al. 1999). DIT is also an expressive metric for fault prediction (El Emam et al. 2001).
At file level, the most dominant predictors in the different datasets show overlapping with the ones being dominant in the Unified Bug Dataset. LOC (Lines of Code) and McCC (McCabe's Cyclomatic Complexity) were the upmost variables to branch on in the decision tree. These two metrics are reasonable as well, since the larger the file, the more it tends to be faulty. Complexity metrics are also important factors to include in fault prediction .
The diversity of the most dominant metrics shows the diversity of the different datasets themselves.

Table 20
Cross

Cross-project Bug Prediction
As a third experiment, we trained a model using only one system from a dataset and tested it on all systems in that dataset. This experiment could not have been done without the unification of the datasets, since a common metric suite is needed to perform such machine learning tasks. The result of the cross training is an NxN matrix where the rows and the columns of the matrix are the systems of the dataset and the value in the i th row and j th column shows how well the prediction model performed, which was trained on the i th system and was evaluated on the j th one.
We used the OSA metrics to test this criterion, but the bug occurrences are derived from the original datasets which are transformed into buggy and non buggy labels. The matrix for the whole Unified Bug Dataset would be too large to show here; thus, we will only present submatrices (the full matrices for file and class levels are available in the Appendix). A submatrix for the PROMISE dataset can be seen in Tables 20 and 21. Only the first version is presented for each project except for the ones where an other version has significantly different prediction capability (for example, Xalan 2.7 is much worse than 2.4 or Pbeans 2 is much better than 1). The values of the matrix are F-measure and AUC values, provided by the J48 algorithm. Avg. F-m. and Avg. AUC columns 11 present the average of the Fmeasures and AUC values the given model achieved on the other systems. For a better overview, we repeat SCEwBug% value here as well (see Table 15).
We can observe (see Table 20) that models built on systems (highlighted bold in the table) having lots of buggy classes (more than 70%) performed very poor in the cross validation if we consider the average F-measure (the values are under 0.4). Even more, these models do not work well on other very buggy systems either, for example, model trained on Xalan 2.7 (the most buggy system) achieved only 0.078 F-measure on Log4J 1.2 (the second most buggy system). Besides, in case of Ckjm 1.8, Lucene 2.2, and Poi 1.5, the rate of buggy classes is between 50 and 70 percent, and their models are still weak, the average F-measure is between 0.4 and 0.6 only. And finally, the other systems can be used to build such models that achieve better result, namely higher F-measures, on other systems, even the ones having lots of bugs. On the other hand, this trend cannot be observed on the AUC values (see Table 21), because they range between 0.483 and 0.629 but there is no "white line" which means that there is no model whose performance is very poor for all other systems. However, there are 0.000 and 0.900 AUC values in the table that suggest that the bug prediction capabilities heavily depend on the training and testing dataset. For example, models evaluated on Forrest 0.6 have extreme values, and it is probably only a matter of luck whether the model takes into account the appropriate metrics or not. Table 22 shows the cross training F-measure values for the GitHub Bug Dataset. Testing on Android Universal Image Loader is the weakest point in the matrix, as it is clearly visible. However, the values are not critical, the lowest value is still 0.611. Based on F-measure, Elasticsearch performed slightly better than the others in the role of a training set. This might be because of the size of the system, the average amount of bugs, and the adequate number of entries in the dataset. On the other hand, the AUC values in the column of Android Universal Image Loader are not significantly worse than any other value (see Table 23), but rather the values in its row seems a little bit lower. From AUC point of view, Mission Control T. is the most critical test system because it has the lowest (0.181) and the highest    Tables 24 and 25 show the F-measure and AUC values of cross training for the Bug Prediction Dataset. Based on F-measure, Eclipse JDT Core passed the other systems in terms of training but its AUC values, or at least the first two, seem worse than the others. Equinox performed the worst in the role of being a training set, i.e., having the lowest Fvalues, but from AUC point of view, Equinox is the most difficult test set because 3 out of 4 of its AUC values are much worse than the average. From testing and training, and from F-measure and AUC value point of view, Mylyn is the best system.
The above described results were calculated for class level datasets. Let us now consider the file level results. The three file level datasets are the GitHub, the Bugcatchers, and the Eclipse bug datasets.
The GitHub Bug Dataset results can be seen in Tables 26 and 27. As in the case of class level, the Android Universal Image Loader project performed the worst in the role of being the test set based on F-measure and the worst system for building models (low AUC values). It would be difficult to select the best system, but the Mission Control T. system is the most critical test system again because it has the lowest and highest AUC values.
The cross-training results of the Bugcatchers Bug Dataset are listed in Tables 28 and 29. The table contains 3 high and 3 low F-measure values and they range in a wide scale from 0.234 to 0.800 while the AUC values are much closer to each other (0.500−0.606). Since only three systems are used and there is no significantly best or worst system, we cannot state any conclusion based on these results.
In the Eclipse Bug Dataset, there are only three systems as well but they are three different versions of the same system, namely Eclipse; therefore, we expected better results. As we can see in Tables 30 and 31, in  . This is perhaps caused by the fact that this version contains the least number of bug entries. The other two systems are "symmetrical", the results are almost the same when one is the train and the other one is the test system.
We also performed a full cross-system experiment involving all systems from all datasets. This matrix is, however, too large to present here; consequently, it can be found in the Appendix and Tables 32 and 33 show only the average of the F-measures and AUC values of the models performed on other datasets. More precisely, we trained a model using each system separately and tested this model on the other systems, and we calculated the averages of the F-measures and AUC values to see how they perform on other datasets. For example, we can see that the average F-measures of the models trained and tested on PROMISE is only 0.607 while if these models are validated on GitHub dataset the average is 0.729. Examining F-measures, we can see that in general, models trained on PROMISE dataset perform the worst, and what is surprising is that they gave the worst result (0.607) on themselves. On the other hand, if we consider AUC values, the model trained on PROMISE achieved the best result on PROMISE (0.554 vs 0.544). This means that in this case AUC values do not help us to select the best or worst predictor/test dataset or even compare the results. Another interesting observation is that the models perform better on GitHub dataset no matter which dataset was used for training (GitHub column). On the other hand, GitHub models achieve only slightly worse F-measures on the other datasets than the best one on    , which suggests that the good testing results are not a consequence of some unique feature of the dataset because it can also be used to train "portable" bug prediction models. Table 34 shows the average F-measures for the file level datasets. We can observe a similar trend, namely, all models performed the best on GitHub dataset. Besides, the results of Bugcatchers on itself is poor and the other two bug prediction models sets performed better on it. On the other hand, the AUC values do not support this observation (see Table 35). Compared with class level, the AUC values range on a wider scale, from 0.525 to 0.684, and suggest that the Eclipse is the best dataset for model building and this observation is supported by the F-measures as well.
To sum up the previous findings based on the full cross-system experiment, it may seem like file level models performed slightly better than class level predictors because the average F-measure is 0.717 and the average AUC value is 0.594 for files while they are only 0.661 and 0.552 for classes. On the other hand, this might be the result of the fact that we saw that several systems of PROMISE contain too many bugs; therefore, they cannot be used alone to build sophisticated prediction models. Another remarkable result is that all models performed notably better on GitHub dataset, which requires further investigation in the future.

Threats to validity
First of all, we accepted the collected datasets "as is", which means that we did not validate the data, we just used them to create the unified dataset and to examine the bug prediction capabilities of the different bug datasets. Since the bug datasets did not contain the source code, neither a step-by-step instruction on how to reproduce the bug datasets, we had to accept them, even if there were a few notable anomalies in them. For example, Camel 1.4 contains classes with LOC metrics of 0; in the Bugcatchers dataset, there are two Mes-sageChains metrics and in several cases, the two metric values are different; or there are more datasets with extreme SCEwBug% (buggy classes percentage) values (more than 90% high or less than 1% low). Although the version information was available for each system, in some cases, there were notable differences between the result of OSA and the original result in the corresponding bug dataset. Even if the analyzers would parse the classes in different ways, the number of files should have been equal. If the analysis result of OSA contains the same number of elements or more, and (almost) all elements from the corresponding bug dataset could be paired, we can say that the unification is acceptable, because all elements of the bug dataset were put into the unified dataset. On the other hand, for a few systems, we could not find the proper source code version and we had to leave out a negligible number of elements from the unified dataset.
Many systems were more than 10 years old when the actual Java version was 1.4 and these systems were analyzed according to that standard. The Java language has evolved a lot since then and we analyzed all systems according to the latest standard, which might have caused minor but negligible mistakes in the analysis results.
In Section 4.7, we used ckjm 2.2 to analyze the projects included in the Bug Prediction Dataset. We chose version 2.2 since the original paper did not mark the exact version of ckjm (D'Ambros et al. 2010); consequently, we experimented with different ckjm versions (1.9, 2.0, 2.1, 2.2) and we experienced version 2.2 to be the best candidate, since it produced the smallest differences in metric values compared with the original metric values in the Bug Prediction Dataset.
We used a heuristic method based on name matching to conjugate the elements of the datasets. Although there were cases where the conjugation was unsuccessful, we examined these cases manually and it turned out that the heuristics worked well and the cause of the problem originated from the differences of the two datasets (all cases are listed in Section 3). We examined the successful conjugations as well and all of them were correct. Even though the heuristics could not handle elements having the same name during the conjugation, only a negligible amount of such cases happened.
Even when the matching heuristics worked well, the same class name could have different meanings in different datasets. For example, OSA handles nested, local, and anonymous classes as different elements, while other datasets did not take into account such elements. Even more, the whole file was associated with its public class. This way, a bug found in a nested or inner class is associated with the public class in the bug datasets, but during the matching, this bug will be associated with the wrong element of the more detailed analysis result of OSA.
The Unified Bug Dataset 1.2 which was created during this work is available as an online appendix at: https://doi.org/10.5281/zenodo.3693685 and at http://www.inf.u-szeged.hu/ ∼ ferenc/papers/UnifiedBugDataSet The UnifiedBugDataset-1.2.zip file contains -the original bug datasets in their original form, -the list of projects contained in each dataset, -the source code of the systems that was used to develop the datasets, -the unified dataset in CSV and ARFF format at file/class level, -the dataset containing only the results of OpenStaticAnalyzer in ARFF format at file/class level, -description of the OpenStaticAnalyzer metrics, -metrics comparisons in spreadsheet format of the PROMISE (2018) For a more exhaustive description of the exact contents of the files and usage information, one should refer to the 'README.txt' file which is located in the root folder of the Unified Bug Dataset package.