1 Introduction

Reproducing (building) executable programs from the source code of past snapshots of a project is fundamental for practitioners, to maintain old versions still in production, and for researchers, to study those old versions. To our knowledge, Tufano et al Tufano et al. (2017) presented the most complete study on the buildability of past snapshots. It analyzes all past snapshots for 100 Java projects of one organization (the Apache Software Foundation), determining how many of them could be built, and the main causes of failure in building. In that study, only 38% of snapshots could be built, and almost all projects contained snapshots that could not be built (96%).

However, not only the main executable program is of interest: both for maintenance and research reasons, building and executing tests of past snapshots is also important. For maintenance, because that allows developers to fix bugs/vulnerabilities Bartel (2019) and backport functionality to some old version with some confidence Tian (2017), if all tests can be run and the result is successful, meaning the test did not find any misbehavior in the exercised part of the system. For researchers, because that allows for a more complete understanding of the project at those points in the past Santos et al. (2019).

Nowadays, tests are so important to projects whose code is used in production, that they are usually an integral part of the process for accepting changes to the code base. In compiled languages such as Java, it is usual to run any new source code snapshot through a process which involves, in this order: compiling the main source code, compiling tests, running tests, and finally, building the package (e.g., a jar file) suitable for distribution. Only if all those steps succeed, the source code is deemed for production.

However, we are not aware of studies that systematically analyze to which extent tests present in past snapshots can be built, and run with a result of success, thus leaving practitioners without a clue of whether the tests will still build and pass, or which practices they should follow to guarantee that tests will still work in the future. Leveraging on the previous research on buildability of past snapshots, in this paper we present a study that extends that research taking into account if testing of those past snapshots is still possible. Since there is no precedent on software testability in the past, we will start this paper with a case study of project history testability to understand in depth how to measure it.

Our main aims are to study, for past snapshots of a project:

(a) The buildability of tests: to which extent tests are still buildable from source code. This leads to the following research question:

\(\textbf{RQ}_\textbf{1}\): “On how many snapshots of the change history can we build tests?”

(b) The testability of the code: to which extent running tests of a snapshot results in success. This leads to the following research question:

\(\textbf{RQ}_\textbf{2}\): “On how many snapshots from the change history we can run all tests with a ‘success’ result?”

We wanted to answer these questions in a manner that recognizes the diversity of software projects. For practical reasons, we considered only Java projects, which is already a reduction of scope. The projects have been selected from the ManySStuBs4J Karampatsis and Sutton (2020) dataset.

The method we followed is in summary as follows: for each snapshot of each project, we automatically build its main code, its tests, and then we run tests whenever possible. We have performed this process for a total of 86 different projects.

To define the testability of a snapshot, we will verify if all its tests pass or fail, categorizing it as fully testable or not. A fully testable snapshot would allow a developer to modify the code, and run all the tests again to check if the change breaks something. To validate the usefulness of the testability defined to describe how a project tests behave throughout its change history, we will conduct a case study on a subset of projects. The main result is that there is a large variation from project to project, with a few cases showing a high testability with all test succeeding in almost all their snapshots.

The rest of the paper is structured as follows: Section 2 discusses previous research. Section 3 presents the methodology used in the studies and defines the terminology. Section 4 shows the case study performed. The results of applying the methodology are reported in Section 5, while Section 6 offers a more detailed analysis of these results. Section 7 discusses the results, and explores threats to their validity. Finally, Section 8 draws conclusions and presents further research.

2 Related work

2.1 Software buildability

Buildability of software projects is fundamental to be able to execute tests. Studying the buildability of past snapshots has been a topic of several publications. However, as far as the authors can tell, there are no studies of the testability of past snapshots. Among the works on buildability, Tufano et al. Tufano et al. (2017) is one of the most recent studies on the topic. They analyzed 100 Java Apache projects, trying to build all the commits of each project. They concluded that the buildability degraded over time, being the oldest commits the ones that more commonly failed to build. Their work also included a categorization of the build failures, with missing dependencies ranking first among the different reasons collected. This work has been validated Sulír et al. (2020) and extended Maes-Bermejo et al. (2022) by other authors, confirming the degradation of buildability over time.

Rausch et al.Rausch et al. (2017) specifically addressed the reasons why builds fail in the context of CI environments. They analyzed Travis logs from 14 open-source projects, finding that a significant fraction of errors corresponded to tests that failed because of a failure in a previous build. The study analyzed the build logs from the point of view of continuous integration systems (snapshot build and test execution), but it did not include a reproduction of the builds. Furthermore, due to the nature of CI systems, where data is removed after some time, not all past snapshots of the analyzed projects were considered.

The study of past snapshots to detect bug-introducing commits is tackled by Querel et al. Querel and Rigby (2021) The authors partially replicate the work by Tufano et al., by trying to improve the buildability of 8 of the Maven-based projects. They do so by solving the missing dependency problem, selecting a dependency close to the commit date. Through this technique a buildability of 78.4% is reported – doubling in buildability compared to prior work. For those commits whose build fail, they run the findbugs tool Ayewah et al. (2007) in an attempt to understand if the failure could be caused by a bug. Their results were negative, although developers found the reports offered by findbugs useful.

Build breakage repair techniques have been also proposed. Macho et al. derived three automatic repair techniques from 37 broken Maven builds Macho et al. (2018). Using these techniques, they were able to automatically fix 45 of an additional 84 broken builds. Similarly, Lou et al. focused on repairing build failures with the HireBuild tool Lou et al. (2019). HireBuild analyzes the history of the project and other external sources, and automatically proposes build fixing techniques to repair broken builds.

Trautsch et al. studied the history of commits of 54 projects, reporting the trends in warnings raised by the automated static analysis tool called PMD Trautsch et al. (2020). They found that large changes in ASAT (Automated Static Analysis Tools) warnings are due to changes in the coding style. They also proved that the presence of PMD has no influence in removing warnings, although it does in defect density.

Other authors used a different strategy to achieve higher buildability in their research. Behnamghader et al. focus solely on a module within the software (the one that changes) Behnamghader et al. (2018). They achieved high buildability for the three different datasets considered (Apache, Google and Netflix), above 94% in all cases. Surprisingly, another work by He et al. has found that the presence of broken builds did not result in an increase of commit frequency (thus working on solving the breakage) He et al. (2020). The authors also reported a correlation between tags like feature add or refactoring linked to a broken build, after studying 68 Java repositories.

2.2 Software testability

Garousi et al. performed a survey on testability Garousi et al. (2019), finding several uses of this term in the research literature. In this survey we find numerous definitions of testability that come from a standard: IEEE standard 610.12-1990 Ieee standard glossary of software engineering terminology (1990) (the degree to which a system or component facilitates the establishment of test criteria and the performance of tests to determine whether those criteria have been met), ISO standard 9126-1-2001 Iso standard 9126-1 (2001) (attributes of software that bear on the effort needed to validate the software product), ISO standard 25010-2011 Iso standard 25010 (2011) (degree of effectiveness and efficiency with which test criteria can be established for a system, product or component and tests can be performed to determine whether those criteria have been met). The last definition, oriented towards software maintainability, is the closest to our definition offered by a standard.

Binder et al. Binder (1994) define software testability for object-oriented software identifying six facets of the same: implementation features, implementation features, built-in test capabilities, the test suite (test cases and associated information), the test support environment and the software process in which testing is conducted. Bruntink et al. Bruntink and Van Deursen (2004) have extended the definition of Binder et al., by focusing on an implementation-centric model which includes automation and process capability. Zhu et al. Zhu et al. (2021) focus on observability as an important property of testability, which, following Staats et al. definition Staats et al. (2011), is a measure of how well internal states of a system can be inferred, usually through the values of its external outputs. Whalen et al. Whalen et al. (2013) also formally define observability (in relation to the tests): they state that an expression in a program is observable in a test case if we can modify its value, leaving the rest of the program intact, and observe changes in the output of the system.

Software testability can also be defined as the degree to which a software system or a unit under test supports its own testing Garousi et al. (2019), or in other words, the ease with which a system can be tested Software testability (2023).

None of these definitions consider testability of past snapshots. In this regard, the way we refer to testability in this paper could be understood as project history testability, which is the ability to run the tests successfully in all snapshots of a project. This definition is intended to approximate the conception of what is expected in a continuous integration system. Beller et al. offer insight into the testing practices that are common in CI-based software development, finding that testing is the single most important reason why builds fail Beller et al. (2017).

Pecorelli et al. chose eight Apache projects to study the relation of test-related factors on software quality, based on the buildability information provided by Tufano et al. Tufano et al. (2017) for these projects. On these projects, they performed several analyses to determine the quality of the tests and their impact in production failures. Some of the test-related factors considered require the execution of the tests; therefore the authors carefully selected projects with high buildability. However, they did not run all the commits in the history – just those commits which were connected with a failure in a product release.

Terragni et al. Terragni et al. (2020) analyze software testability with particular attention to test quality (quantified as test code coverage and test mutation score).

The Defects4J Just et al. (2014) dataset is a collection of Java projects where the regressions in the code are well documented. They offer a reproducible scenario where researchers can check the correct behavior (or not) of the tests in different commits of the project history. This scenario has been provided by the authors themselves, encapsulating the project dependencies together with the source code, as well as having all the necessary tools for the execution of the tests in the form of binaries (and in their most recent versions, using Docker to isolate the execution). Although they run the tests on past commits, they do not run them on their entire history to check their testability. It is worth mentioning the remarkable effort of the authors to make previous commits buildable and to run tests on them.

3 Definitions, data sources and methods

Let us define some terminology used throughout this paper, the collection of projects that we analyze, and the methods used for the analysis.

3.1 Definitions

Following the approach by Sulir and Poruban Sulír and Porubän (2016), the building process of software written with compiled languages follows these steps: (1) read the project build (configuration) file, (2) download third party components defined in the build file, (3) execute the compiler to generate binary files from source code, and (4) package the program in a suitable format for deployment.

A specific snapshot is buildable if these steps can be executed to generate a valid executable. We extend this terminology to distinguish between main source code (code that is used to compile and build the package) and testing code (code that verifies the correct behavior of the main source code, and will not be included in the package). We will also consider a further step consisting on running the tests, and collecting their results. The possible outcomes of a test execution are: success, if all test assertions are met, failure, if at least one of the assertions is not met, or error, if there is a problem in the test execution. In this way, step 3 defined above would have three sub-steps: (3.1) compile the source code (3.2) compile the test code, and (3.3) execute the tests.

With this in mind, we define:

  • Commit: a snapshot of the source code of a project, obtained by checking out a commit hash from git. Only commits on a given branch (usually main or master) will be considered, to keep history linear.

  • Source-buildable: a commit (snapshot) whose main source code can be successfully built.

  • Test-buildable: a commit (snapshot) with at least one test, and for which all tests can be successfully built.

  • Full Testable: a commit (snapshot) for which all tests run with a success result.

Based on this, we define the following metrics, that will characterize a project (for the considered git branch):

  • Source Buildability: ratio of source-buildable commits with respect to the total number of commits.

  • Test Buildability: ratio of test-buildable commits with respect to the total number of commits.

  • Full Testability: ratio of fully testable commits with respect to the total number of commits.

Each of these metrics tries to capture a ratio of progress when we advance in the process towards running the tests. Source Buildability informs on the ratio of snapshots that passed the first blocker: it only makes sense to test a snapshots if its main source could be built. This metric is quite similar to the one provided by Tufano et al. Tufano et al. (2017), when studying buildability. From this point on, we introduce Test Buildability, to determine the ratio that pass the next blocker: tests can be built.

Thus, we define Testability at the project level (as we did with buildability), different from others who define it at the class or module level  (2006).

Processing the result of all tests in binary form gives us information on whether a change in that commit would be supported by the tests. This way of characterizing a commit as Full Testable or not is also used in Continuous Integration (CI) systems to validate if a change (commit) can be integrated with the rest of the code of a project. We define Full Testability as a metric with respect to the total number of commits of the project. This metric measures how often these commits can be fully testable.

3.2 Dataset

We decided, mainly for practical reasons, to work with Java projects that are built with Maven. Even when this focuses our study on a fraction of all software, Java is a very popular language, with many mature projects used in production. Maven is recognized as the most widely used system for project building and test execution in Java and, as we have seen in a previous research Maes-Bermejo et al. (2022), projects that use this technology are more likely to have a high source buildability.

However, we did not consider all Java-Maven projects. Android applications written in Java require dependencies on the Android SDK, and have some peculiarities that make those projects different from regular Maven projects. So, we discarded Android projects, following Sulir et al. Sulír and Porubän (2016) in this respect. We also discarded projects for which tests lasted too much to run until completion: we set a per-project limit of 60 real-time days for running tests for all snapshots.

To carry out the experiment, we selected a subset of projects from the ManySStuBs4J dataset Karampatsis and Sutton (2020), a well-known set of 100 Java Maven projects used for Program Repair research. The ManySStuBs4J dataset does not include the git repositories, only their references to GitHub. As we have seen in a previous work Maes-Bermejo et al. (2022), this results in some projects being unrecoverable if they have been deleted, made private or migrated to another git repository. From this dataset we discarded two projects that are no longer available in the public repository and 11 Android projects. In addition, one project has not been included as it has not finished the execution of its experiment, which has taken longer than any other project. Therefore, 86 projects have been selected.

3.3 Methods

For each of the 86 projects we proceed as follows. First, we clone the git repository of the project (from GitHub), and we perform the following steps for each commit in its master/main branch:

  1. 1.

    Checkout the commit.

  2. 2.

    Build the main source code by executing mvn clean compile -X. If the command fails (the source build fails), the process stops. Otherwise, the commit is tagged as source-buildable.

  3. 3.

    Build the test code by executing mvn test-compile. If the command fails (the test build fails), or we detect there is no test in the snapshot, the process stops. Otherwise, the commit is tagged as test buildable.

  4. 4.

    Run the project tests by executing mvn test, checking if all tests passed (labeling the commit as fully testable). In addition, we save the individual reports for each test class in XML format.

For each project, the following metrics are generated:

  • Age: Number of years since the first commit

  • LoC: Number of lines of code of the project in its last commit

  • Total Commits: Number of commits in the master branch

  • Source-buildable commits: Number of commits that were source-buildable

  • Test-buildable commits: Number of commits that were test-buildable

  • Testable commits: Number of testable commits

We also generate for each project the metrics defined at Section 3.1: Source Buildability, Test Buildability and Full Testability. We include as well metrics such as project age, lines of code (LoC) and total commits because these are widely used when analyzing software projects Yamashita et al. (2015); Mannan et al. (2016); Rosen et al. (2015), and can bring an idea of the size of the projects.

Fig. 1
figure 1

Set of projects #1: DiskLRUCache, Spark and JSoup

4 Case Study

After performing the experimentation (following the methodology described in the previous section) and after doing a preliminary analysis of the results, we realized that for some projects the proposed metrics may not reflect the full reality of the projects’ testability. In order to gain a deeper understanding of the testability of the commit history of software projects, we will conduct a case study, which is aimed to be an exploratory research with quantitative data analysis.

In this case study, we will make an exploratory analysis of different projects, checking if the proposed testability metric (Full Testability) is able to represent the reality of the testability of the projects in all their history. To quantitatively analyze the projects, we will use a graph per project that represents the results per commit of the steps described in the methodology.

As can be seen in Fig. 1, each sub-figure represents the commit history of a project, where the x-axis shows the commit number (0 for the first commit of the project, 1 for the second, etc.). On the y-axis we observe a stacked bar for each commit, with the height being the number of tests executed in that commit (denoted on the left y-axis) and the colors (red, orange and green) indicating the results of each test (failure, error and success respectively). Additionally, the graph includes for each commit on the x-axis the number of lines of code (in blue) and test code (in purple) for that commit, with the values denoted on the right y-axis.

The first set of projects that we are going to study (Fig. 1) has been selected as a representative set of projects that, taking a look at their visual representation, it seems that it is possible to run the tests in the past in almost all their history. Attending to the Source (Src) Buildability, Test Buildability and Full Testability metrics of these projects (shown in Table 1), we observe that the results are in line with the charts. For instance, in the sub-figure of the DiskLruCache project there are commits with failures or errors in their tests, which affects to some extent its Full Testability. Specifically, this metric drops to 70.58%, which means that in almost 30% of the commits, at least one test results in a failure or error.

Table 1 Metrics of set of projects #1: JSoup, Spark and DiskLRUCache

The second set of projects that we are going to study (Fig. 2) has been selected as a representative set of projects that, taking a look at their visual representation, it seems that it is possible to run the tests in the past in almost all their history, although in some of them we find multiple commits where failures or errors are located. However, if we look again at the Full Testability values (Table 2), we find that their values are surprisingly low. For example, for the Checkstyle project, despite the fact that a portion of its commits at the beginning of the project are not compilable, the figure seems to indicate that almost all of its tests pass. Its low Full Testability value is due to a few tests failing consistently (in the order of 10 out of 1000) along the project history. The other two projects expose a similar behavior, in the case of HikariCP the failures are more visible in the graph (there are more tests per commit resulting in failures or errors) despite having the highest Full Testability value of this set.

Considering the above-mentioned results, Full Testability is a very restrictive metric, as soon as a single test does not pass consistently in several commits, it drastically reduces the value of the metric. We can consider that we are missing information about the real testability of the project. From the point of view of a practitioner or a researcher who needs to add a change to a past version and has to evaluate if the tests allow him to be sure not to introduce any regression in the code, knowing that in a commit 99.99% of the tests pass may be enough. This happened in the case of the Checkstyle project, where there are commits with 3,506 tests but only 1 fails.

Following this discussion, we propose a new and less restrictive way of measuring testability, which allows us to capture the information in a non-binary way. Instead of using a binary value (a commit is either Full Testable or not) we will use a ratio. Hence, a commit has a Testable Rate, defined as the ratio of success tests with respect to the total number of tests run for that commit. For a project, we would obtain the Testability Rate as the mean of the Testable Rate value of all commits in the history.

Table 3 shows the Testability Rate results for the second set of projects. This new metric reflects a higher testability value for this set of projects, which is in line with the reality of the tests executed.

The third set of projects is shown in Fig. 3. In this case, we have chosen projects that appear to be very testable but only in a part of their history (due to problems in building the source code or test code). Table 4 shows the testability values for these projects.

Fig. 2
figure 2

Set of projects #2: Fastjson, Checkstyle and HikariCP

Table 2 Metrics of set of projects #2: Fastjson, Checkstyle and HikariCP
Table 3 Metrics of set of projects #2: Fastjson, Checkstyle and HikariCP (including the new Testability Rate metric)
Fig. 3
figure 3

Set of projects #3: Zxing, Okio and Closure

If the standard deviation of the Testable Rate of all commits is calculated, the values of 47.97%, 49.02% and 43.53% are obtained for the okio, closure-compiler and zxing projects, respectively. These values indicate a high deviation, mainly due to the fact that when considering all the commits, those that are not even buildable are included (therefore their value is always 0), affecting the mean (Testability Rate). These projects are characterized by a low Test Buildability. The Test Buildability metric is greatly affected by Source Buildability; if the code does not build, the tests cannot be built. The reasons why source code cannot be built have been studied in previous work Tufano et al. (2017); Sulír and Porubän (2016), results show that this happens mainly because of the impossibility to reproduce the context and retrieve the dependencies of a particular commit. For the okio, closure and zxing projects, the Source Buildability values are very close to Test Buildability values. So, we can state that in these projects, if the source code builds, in most cases the test code builds as well. We can capture this idea in a new metric: \(\textbf{Test Buildability}_\textbf{S}\), the ratio of test-compilable commits with respect to the source-compilable commits. The Test Buildability metric, when calculated considering all commits in the project history, is renamed to \(\mathbf {Test\, Buildability}_\textbf{A}\). Table 5 shows the values of these two variants of Test Buildability, in addition to the Source Buildability values mentioned above.

Table 4 Metrics of set of projects #3: Zxing, Okio and Closure
Table 5 Extended metrics of set of projects #3: Zxing, Okio and Closure (new metrics in bold)
Table 6 Extended metrics of set of projects #3: Zxing, Okio and Closure (including the new Testability RateT and Full TestabilityT metrics)

If the standard deviation of the Testable Rate for the projects okio, closure-compiler and zxing is respectively recalculated, considering only those commits where tests can be built, the results are 0.24%, 0.07% and 0.04%. These values are considerably lower than those previously calculated by leaving out those commits where the TestableRate value was 0 because the tests could not be built.

Following the idea of focusing on those commits where we can build the test code, we should consider that it is possible that the commits that do not build, never did, and therefore their testability information is not useful to us. For projects such as those in set #3, the testability of the commits where we can build the tests gives us valuable information that neither the Full Testability nor the Testability Rate capture, as both metrics are calculated from all commits in the history of the project.

To capture the testability of Test Buildable commits we define the following metrics:

  • \(\mathbf {Testability\, Rate}_\textbf{T}\): mean of testable rate for test-buildable commits in the project.

  • \(\mathbf {Full\, Testability}_\textbf{T}\): ratio of fully testable commits with respect to the number of test-buildable commits of the project.

The Full Testability and Testability Rate metrics, when calculated considering all the commits in the history, are renamed to \(\mathbf {Full\, Testability}_\textbf{A}\) and \(\mathbf {Testability\, Rate}_\textbf{A}\).

In Table 6 we include the new metrics defined for the third set of projects. We note that for the three projects, the new metrics capture more closely the testability of the commits we can build. The Full TestabilityT shows us that for those commits where we can build the tests, a high percentage can run all their tests with a success result. For this same set of commits, the Testability RateT shows us values close to 100%; almost all of the tests in these commits offer a success result.

This case study helped us to define metrics that better describe a project’s testability. These new metrics lead to a new research question:

\(\textbf{RQ}_\textbf{3}\): “On average, how many tests of a snapshot can be run with a ‘success’ result?”

5 Experimental Results

Once we performed our case study and determined the metrics that might better represent the testability of the projects, in this section we resort to describe the results in detail using the metrics defined in the previous section.

After running the experiment over the 86 projects, we detected 20 projects in which not a single test was run on any commit in the history. The learning-spark project does not contain any tests in any commit of its history. The Mycat-Server project has a dependency on software that must be installed on the machine for it to be built (therefore its Source Buildability is 0%). The Clojure project is a programming language and, as such, its tests require a different execution method. The remaining 17 projects correspond to projects that contain multiple Maven modules, and cannot be built and tested with the proposed methodology. In general, when the multimodule parent project has the proper configuration, we are able to build and run the tests from the parent module, which is from where we run the Maven command. However, in some cases, this approach does not work; building and running tests would require a parent project or components of the same framework that are no longer available (because SNAPSHOT versions were used or because the repository is no longer accessible). This is due to dependencies among modules that force some modules to be built before others. Since we cannot execute any of the tests on any of these projects, we have decided to leave them out of our study, and from this point on we will only consider the remaining 66 projects.

Table 7 shows a summary of the experiment for the 66 projects. In the Count column we can see the magnitude of the study for the metrics defined in the methodology. The Mean (\(\bar{x}\)) and Median (\(\tilde{x}\)) columns show the trend at the project level. We run a normality test for each metric in this table, finding that except for Age, none of the metrics shows a normal distribution. In general, differences between mean and median show large internal variability in each of the samples.

Table 7 Absolute results for the 66 projects

Table 8 shows information on the buildability and testability metrics of the projects of the dataset. For a better understanding of the information in Table 8, the box plots of the relative values of the testability metrics for the 66 projects are shown in Fig. 4.

Table 8 Mean (x̄) and Median (x̃) values for Buildability and Testability of the projects
Fig. 4
figure 4

Box plots of the relative values of the testability metrics for the 66 projects

To illustrate the diversity of the results, we have divided the 66 projects into three groups of the same size (22) according to their number of commits in its history: large-history, medium-history and short-history. Figures 5 (large-history projects), 6 (medium-history projects) and 7 (short-history projects) show the results for each project for each of the integer metrics as overlapping bars. It is easy to appreciate, just by checking colors, how each project tells a very different story. Due to the number of projects, we will report on 3 representative projects (showing different results) from each of the sets.

For large-history projects (Fig. 5), it is noticeable that in many of them there are hardly any commits where the tests can even be built. The first project selected of this set is hadoop, which illustrates this fact. This project has only 106 source-buildable commits out of 25,039. The main error that prevents the build is the resolution of dependencies, given that in many cases these dependencies were located in a file system in the cloud (Amazon S3) that is no longer accessible. The second project selected of this set is checkstyle. In this project, in most of the commits the tests can be compiled. However, not all tests are passed in every commit. This project has a Full Testability close to 0% and belongs to the group of projects that motivated in the Case Study the definition of the new metric "Testability Rate", in which it has a value of 74% when considering all commits (Testability Rate A) and a value of 100% when considering only the commits where the tests can be compiled (Testability Rate T). The third project selected of this set is closure-compiler, also a project shown in the case study. This project has a Fully Testability A of 60%, having 2 fundamental problems that prevent its compilation. The first is that at the beginning of the project Ant was used as the build system and later Maven was used. The second period where it was not buildable was due to the resolution of a dependency annotated as SNAPSHOT.

Fig. 5
figure 5

Project metrics (Large-history projects)

Fig. 6
figure 6

Project metrics (Medium-history projects)

For medium-history projects (Fig. 6), we find that in many more projects tests can be compiled in at least a considerable part of their history. The first project selected of this set is android-volley, in which all tests can be executed in a small fraction of their commits. This fraction of commits represents the most recent commits in the project. This is due to a change in the build tool, from Ant to Maven. The second project selected of this set is fastjson. In this project, in most of the commits the tests can be compiled. However, not all tests are passed in every commit. This project has a Full Testability of 0% and belongs to the group of projects that motivated in the Case Study the definition of the new metric “Testability Rate”, in which it has a value of 87.81%. The third project selected of this set is swagger-core. This project shows a Full Testability A of 63% and a Full Testability T of 92%. The commits where it is not possible to build the source code correspond to the first commits of the project when using Ant as build system (replaced by Maven some time later).

Fig. 7
figure 7

Project metrics (Short-history projects)

For short-history projects (Fig. 7), again, we faced with several projects where you can run all your tests in a considerable percentage of the commits. The first project selected of this set is auto, which is an interesting case, because although its source code appears to compile, its tests do not. It is a multi-module project where the source and test code is built from a parent pom.xml. However, at a certain point in the project, modules are disabled in this configuration file. The build command still works correctly (although it only downloads dependencies, it does not compile code), but it is no longer able to build any tests, as it can no longer find them. The second project selected of this set is okio, also discussed in the Case Study. In this case, we note that the project has a low number of source-buildable commits because the structure of the project in the repository changes over time (the pom.xml and source code is moved to another folder). The third project selected of this set is webmagic. This project again evidences the usefulness of the Testability Rate metric for characterizing projects. In this case, its Full Testability A is 36% but its Testability Rate A is 83% (reaching 98% in its Testability Rate T). This is because the project is a web scraping library that has a few tests that depend on external URLs that are no longer accessible.

6 Analysis

Let’s now analyze in some more detail the results presented in the previous section.

The mean Source Compilability of the projects (47.29%), although low, is slightly higher than in previous studies on Java projects such as that of Tufano et al (38.13%). This is partly due to excluding 20 projects whose compilability was 0. But what is more interesting is the differences from project to project, something that was expected, and seen in previous studies on the topic Tufano et al. (2017); Sulír et al. (2020); Querel and Rigby (2021).

The mean value of the \(\mathbf {Test\, Buildability}_\textbf{A}\) is significantly lower (41.73%), because Source Buildability is a threshold for this metric. Moreover, considering only those commits where the source code is buildable, \(\mathbf {Test\, Buildability}_\textbf{S}\) offers a considerably higher value on average (88%). This value is reasonable, since once the main source code was built, it is more likely that the test code can also be built. We also note that for this metric, 50% of the projects offer a value higher than 97%. Therefore, test compilation does not seem a problem in general, and efforts, in any case, should be put in source compilation.

The most clear result when analyzing testability of projects is its variability from project to project. Figure 8 illustrates this, by showing the shape of the distribution of all testability metrics we defined, for all projects.

Looking at the Testability Rate values, when we focus on those commits where the tests can be built (\(\mathbf {Testability\, Rate}_\textbf{T}\)) we observe an average value of 94.14%, a high percentage of the project’s tests are executed successfully. By calculating the Testability Rate considering all commits (\(\mathbf {Testability\, Rate}_\textbf{A}\)), this value gives an overview of the testability of the project and how Source Buildability impacts running tests on past commits.

Focusing on \(\mathbf {Full\, Testability}_\textbf{T}\), we could expect it would be usually high: developers should write code that does not break any of the tests. The fact is that in more than half of the commits that were test-buildable, some tests fail when they are run. This result, which may seem surprising, deserves more discussion that we will provide in Section 7.2.

\(\mathbf {Full\, Testability}_\textbf{A}\) gives quite interesting information: the fraction of commits that are testable, with respect to the total number of commits. When buildability is low, this fraction is necessarily low, just meaning that commits are not testable because they are not source- or test-buildable. When buildability is high, \({Full\, Testability}_{A}\) really shows how testable the project is, at least from the binary approach (all tests pass or fail) followed, for example, by continuous integration systems.

With the data presented up to here, we can already answer RQ1, RQ2 and RQ3:

figure a
figure b
figure c
Fig. 8
figure 8

Overview of testabilities - Each bar represents a project. The projects in the same column (A & T) are ordered by the corresponding Testability Rate (A & T) of each project

7 Discussion

After presenting the main results, and its analysis, in this section we discuss some details of our studies, including the threats to their validity.

7.1 Projects with high testability

During this study, projects with very high values in the different variants of testability have been detected. This section will analyze some of these projects in more detail in order to detect good practices. We will analyze those projects that have obtained high values of FullTestability and TestabilityRate, but only considering all commits. We consider that keeping the source code buildable is part of the best practices in keeping the tests buildable and runnable. Thus, trying to select projects based on a testability metric based exclusively on their buildable commits can lead to inconclusive results. As an example, the Camel project has a \(\text {TestabilityRate}_\text {T}\) and \(\text {FullTestability}_\text {T}\) of 100%, but it has only 21 commits where tests can be run from a total of 53,286 commits.

Table 9 shows the top three projects with the highest \({Full\, Testability}_{A}\). The first project, jsoup, is a popular programming library for manipulating HTML code, which has only unit and integration tests. This project was selected in the case study (see Section 4) as an example of a project where the tests could be executed in almost all of its commits.

The second project, retrofit, is a simple HTTP client, with the particularity that it is a multi-module project. A high \({Full\, Testability}_{A}\) indicates that the tests of all its modules pass completely in most of the commits. All its tests use mocks, defined in its own module. The robustness of testing functionality with mocks rather than with real endpoints (subject to change over time) may be one of the reasons for the good results of this project in this metric.

The third project, spark, is a lightweight web development framework. It has end-to-end tests to verify the correct behavior when HTTP requests are performed; we have observed that these requests do not depend on remote services, though, which we have found in other projects to be a reason for low testability (see Section 7.2).

Table 9 Top three projects with the highest \(Full\, Testability_{A}\) (in bold)

Table 10 shows the top three projects with the highest \({Testability\, Rate}_{A}\). The first project, DiskLRUCache was also selected in the case study as an example of a project where the tests could be executed in almost all of its commits. This project is a library for the implementation of an LRU cache. As can be seen in Fig. 1, it starts with a test collection that barely increases with its development. Its implementation is simple (only 1436 lines of code) and its number of commits is small (136), so not many conclusions can be drawn from the life cycle of its tests. It contains both unit and integration tests (the last ones with input/output to the file system of the operating system).

The second project, jsoup, has already been discussed in the previous set.

The third project, elastic-job, is a library for distributed job scheduling. Since it is a library that must interact with several distributed components, its tests are supported by mock libraries to facilitate the testing of its functionalities without depending on external services. Hence, we find again a usage of mocking that leads to a high testability.

Table 10 Top three projects with the highest \(Testability\, Rate_A\) (in bold)

7.2 Why do projects have low testability?

We have performed a preliminary study to identify the main reasons for some projects having low Full \(\text {Testability}_\text {A}\). These reasons can be categorized into three types:

7.2.1 Low buildability

\({Full\, Testability}_{A}\) of a project depends on the steps prior to the execution of the tests: building the main source code and the tests. In projects with low buildability, testability is constrained by it. Previous studies on buildability Tufano et al. (2017); Maes-Bermejo et al. (2022) showed how one of the main reasons why a commit is no longer buildable is because of an error in obtaining the project dependencies. These dependency problems include the inability to resolve downloadable dependencies from external repositories, obsolete system tools no longer available, or old and unavailable programming language versions. Therefore, it is possible that if those dependencies were available, the project not only built correctly, but also its tests could be executed successfully. Other relevant errors detected in these studies are problems derived from the compilation and parsing of the source code.

We found several projects where the \({Full\, Testability}_{A}\) is low due to low buildability, but \({Full\, Testability}_{T}\) is high, which is in line with this hypothesis.

7.2.2 Snapshots were never testable

When the project was being developed, not all of its commits were testable:

  • No tests in the snapshot: In some projects we have found no tests for some commits. We consider these commits without test as non-testable, resulting in low values if many snapshots are like that. In these cases, the project may not have developed the tests in the early stages of development.

  • Snapshots with failing tests: When building and running tests on snapshots, we decided to try all of them in the master branch, since a recent study by Kovalenko et al. Kovalenko et al. (2018) has shown that considering additional branches does not seem to have a significant impact compared to using only the master branch. But depending on the strategy used by the developers to merge changes into master, we may find commits with tests failing even when they were merged. For example, a commit includes a new feature which breaks some tests, but it is still merged, maybe because the next commit will fix the tests or the cause of the error. We found this case in the elastic-job project by taking advantage of commit comments to understand the development workflow. When refactoring, tests stop passing in some commits, being fixed in later commits (for example see commits 860, 929, 965, 1053 and 1120, where 0 is the first commit in the history). Therefore, in some snapshots the tests never passed and therefore we cannot make them pass in the present without the changes to fix it.

7.2.3 Context can not be reproduced

Without the right context, some tests cannot perform exactly as they did in the past. Here are some of the problems related to test execution context:

  • Network services: Most integration tests check how the software integrates with other network services. If these services need to be started prior to running the tests, we will find connection-related exceptions when running the tests without the network services. A good example is the Jedis project: it is a library for managing a Redis database. If this database has not been started before running the tests, the tests throw exceptions such as “ConnectException” or “ConnectionRefused”. To address this type of problem, a good practice would be for the test itself to start the necessary services.

  • Command line tools: There are tests that require command line tools like Python, Perl or G++ to be executed successfully. The antlr4 project has failed tests in the absence of these services. For example, this project requires in different commits different versions of Python (2.7 or 3.5). This issue could be addressed if developers provided and documented a Docker container in which to run the tests.

  • System resources: Some projects require access to certain system resources for their tests. For example, the Activiti and FastJSON projects require access to font-related resources (through the classes SunFontManager and X11FontManager respectively). The experiments have been run on machines that do not have a GUI and therefore these resources were not available. Some projects with similar requirements like SeleniumFootnote 1 provide their own Docker containers with the necessary resources (a fake GUI).

  • Remote resources: There are tests that require accessing remote services and making requests on them. If these resources have changed or are no longer accessible, they compromise the test results. We found an example in the checkstyle project, where an attempt is made to download an XML file and parse it, failing in this last step, probably because it has been modified over time or moved to a different location. This type of bugs has been characterized as extrinsic Rodríguez-Pérez et al. (2020a, 2020b, 2018), since the problem is not in the source code, but in an external service. The tests of a project should be self-contained, minimizing the external resources on which it depends in order to be more reliable. Developers could use contract testing to avoid depending on remote resources.

  • Reflection: In Java it is possible to load classes through reflection. These classes have not been checked by the compiler in the construction phase of the tests, so in case of error or absence of the class, the error appears at test runtime. An example of this case can be found in the Okhttp project, where we found the “Could not initialize class SSLExtension” error, being a class loaded by reflection. In detail, the class is not available because although a supported version of Java is used, the JDK distribution used does not have such a class.

So, when we find a non-testable snapshot, a question arises: Is it non-testable because it was originally non-testable, or because we cannot reproduce the test execution context properly? If a project had a continuous integration (CI) system where tests are executed on every snapshot, we could answer the question by analyzing the test results in CI logs. Unfortunately, it is usual to remove CI logs frequently to save storage space. However, although having this information could be useful, not all of the cases above could be solved by improving the context of test execution. For example, we have no chance when working with remote resources over which we have no control. Even so, the CI configuration files can provide information about how to reproduce the test execution context. It remains as future work to explore this line to improve project testability when the test context is not properly reproduced.

7.2.4 Issues with the build system

We have analyzed the prevalence of Maven as a build system in the projects (looking for the pom.xml file in the root of the project), because it is possible that some of them have migrated from one technology to another. We found that, on average, 88% of project commits use Maven as the build system. If we consider the median, this value increases to 98.99%.

We have analyzed the projects where the percentage of commits using Maven is lower, obtaining that in 9 cases there was a migration from the Ant build system (common in older projects) to Maven. Furthermore, two projects have migrated from Maven as a build system to two more modern systems (Gradle and Bazel). In other projects we found that there has been a re-structuring of their modules or sub-folders or a migration to another programming language (with its own build system).

These build system migrations are common in long-running Maven projects and have been previously detected in the literature Maes-Bermejo et al. (2022), being a consequence of limit the experiment to Maven. However, the number of affected snapshots is relatively low and localized to specific projects.

7.3 Implications for researchers

Studying testability of previous snapshots can help researchers understand how changes in the codebase impact existing tests. This knowledge can inform the development of more robust and maintainable test suites, less prone to breaking with future modifications. Also, by analyzing how frequently tests fail on older snapshots, researchers can develop methods to prioritize test cases during regression testing. This can streamline the testing process and identify critical tests that need to be run first. Finally, research in this area can lead to improved techniques for testing legacy systems that lack proper documentation or existing test suites. By analyzing past snapshots, researchers can develop methods to infer test cases or identify areas of the codebase most susceptible to regression.

7.4 Implications for practitioners

Understanding how code changes impact historical test runs allows developers to identify potential regressions early in the development cycle. This can help prevent bugs from being introduced into production releases. Also, by analyzing historical test run data, practitioners can estimate the effort required for regression testing after code modifications. This information can be valuable for planning development sprints and resource allocation. Finally, studying test flakiness across different versions can help identify areas of the codebase that are prone to breaking tests. This can inform technical debt management strategies, prioritizing areas for refactoring or improvement.

There are several legacy systems that are still in use. Many organizations, especially in critical infrastructure sectors, might rely on older versions. These systems may not be feasible or cost-effective to upgrade immediately. Having testable versions allows for regression testing before deploying changes to connected systems that depend on the older software.

From a security point of view, having a testable version allows for faster patching and potential mitigation strategies, even without an official update.

Bugs or compatibility problems could arise in the current version that might be related to changes made earlier. Having access to testable older versions can be invaluable for debugging and pinpointing the root cause. A recent work Maes-Bermejo et al. (2024) relies on the test runnability on past snapshots to detect the change that introduced a bug.

7.5 Threats to validity

Our results are subject to construct validity issues, mostly due to how we define testability of a project: as the mean percentage of test success with respect to the total number of tests and as percent of snapshots in which all tests could be successfully executed. As we have shown, testability may vary depending on the set of snapshots selected to compute it (all snapshots or just those that are test-buildable). It could be possible that in some cases our testability metrics hide the real behavior of the project in this respect. We attempt to mitigate this through the proposed case study, where we introduce new metrics that complement the vision of the project’s testability. It is also necessary to mention that only the commit history of the main branch (commonly called master, main or trunk) has been considered. In cases where a branch is merged with the main branch, these commits become part of the main branch and are considered in our experiment. However, if other branch mixing strategies are used (such as merge with squash), the history of changes that are added in a single commit is not considered. Since it is not easy to detect these cases automatically, the results obtained for certain commits do not correspond to a change but to a set of them.

Results are also subject to internal validity issues. For example, without the environment where the tests were originally run, they may give a different result (lack of libraries, binaries or specific configurations as defined in a continuous integration file). They are also subject to any possible bug in our extraction and analysis tool. Therefore, all results together with the tools used are available in the reproduction package for inspection.

Finally, we can also have external validity issues. The set of selected projects is a subset of 66 projects from another dataset of 100 Java projects. The exclusion of those 34 projects where we could not reproduce the original context (and therefore could not run the tests on any commit) or were Android projects may reduce the representativeness of the research. Extension to other languages is still a matter of further research. The set of projects is also limited to projects that use Maven as build system. Although Maven is one of the most popular project build systems, it should be mentioned that build systems such as Ant (still present in many older projects) or Gradle (more recent and gaining popularity over Maven) are not being considered.

Finally, the dataset is limited since most of the tests included in these projects are unit tests, which differ in scope and fragility from integration or system tests.

8 Conclusions and future Work

In this work we have started a path to analyze to which extent past snapshots of a project can be tested. For that, we have conducted an empirical analysis of many Java projects from a well-known dataset. The analysis has been driven by a case study, which proved useful to understand the different issues that can lead to low testability. By means of this case study, some new metrics were proposed to better characterize the testability of a project.

We also propose a framework for conducting further analysis, based on the different steps needed to successfully run tests for each snapshot. Using this framework, we have found that for half of the snapshots, not all tests could be run successfully. However, the main result is the high variability of testability from project to project, even within a relatively homogeneous dataset.

We found that many projects cannot rely on running tests in past commits, as these won’t run or even build. There is also a strong correlation between the buildability of a project and the ability to run its tests, meaning that in general, when the source code builds, test code builds as well, and tests can be run. Therefore, fixing buildability issues would increase testability of a project. We suggest some ways of increasing testability of past snapshots, based on the different issues that can lead to low testability.

Testing snapshots of the past is fundamental for the maintainability of old versions of the project which are still in production. Therefore, we expect more research in this area in the future. Fortunately, we have found some signals showing that good practices can be identified to increment testability of current snapshots (that will become past snapshots with time). Extending our study to other samples of Java code, and to other programming languages, will improve our knowledge in this area.