1 Introduction

A flaky test is a test case that can exhibit both passing and failing behavior without changes to the code of the test case or the program under test (Parry et al. 2021). They are a serious problem for software developers because they disrupt continuous integration, cause a loss of productivity, and limit the efficiency of testing. The pain of flaky tests is felt by developers in both the open-source domain (Durieux et al. 2020) and in large companies such as Google, Microsoft, and Facebook (Lam et al. 2019; Machalica et al. 2019; Memon et al. 2017). A survey of developers found that 56% observed flaky tests on at least a monthly basis in the projects on which they were currently working (Parry et al. 2022b).

Flaky tests that depend on the prior execution of other test cases in the test run order are known as order-dependent flaky tests. Another term for such flaky tests is victim, and the prior test cases that affect their outcome are known as polluters (Shi et al. 2019)Footnote 1. Victim flaky tests are very prevalent, with one study finding that 51% of the 422 flaky tests in 82 Java projects were victims (Lam et al. 2019). They are a major snag to techniques that split-up or reorder a test suite, such as test case prioritization, selection, and parallelization (Bell et al. 2015; Candido et al. 2017; Lam et al. 2020).

The research community has introduced a multitude of automated techniques to detect flaky tests. Many are rerunning-based, meaning they may require an excessive number of repeated test case executions, making them expensive for deployment in large software projects (Lam et al. 2019; Zhang et al. 2014). Alshammari et al (Alshammari et al. 2021) repeatedly executed the test suites of 24 Java projects and were still detecting non-order-dependent (NOD) flaky tests after 10,000 reruns. We estimated that the single-core time cost of detecting the 158 NOD flaky tests in our subject set of 89,668 test cases by rerunning them up to 2,500 times is 1.6 years. The time cost of rerunning-based detection led researchers to investigate techniques that do not require test case runs but are instead based on machine learning models trained on features of the test case code (Bertolino et al. 2021; Pinto et al. 2020). Later studies found that combining these with dynamic test case features, such as execution time and line coverage, increases detection performance at the cost of a single instrumented test suite run to measure these features (Alshammari et al. 2021; Parry et al. 2022a). Despite this, machine learning models in this domain offer only an approximate solution. For example, for detecting NOD flaky tests in Java projects, one previous study’s evaluation shows a Matthews correlation coefficient (MCC), a reliable metric for evaluating a machine learning model (Chicco and Jurman 2020), of 0.65 (Alshammari et al. 2021). Another, focusing on Python projects, shows an MCC of 0.53 (Parry et al. 2022a). For these results, we would expect a perfect machine learning model to score 1 and a model no better than random guessing to score 0. The prohibitive time cost of rerunning-based techniques and the limited performance of machine learning-based techniques leaves practitioners with a stark choice when it comes to detecting flaky tests.

This paper introduces CANNIER (maC hine leA rN iN g assI sted tE st R erunning), a high-level approach for reducing the time cost of rerunning-based detection techniques by combining them with machine learning models. It does this by using the output of the models as a heuristic to reduce the problem space for the rerunning-based technique. We demonstrate the applicability of CANNIER by instantiating it for three previously established detection techniques. We implemented these within an automated tool and empirically evaluated them using 30 Python projects as subjects. We found that CANNIER could significantly reduce time cost at the expense of only a minor reduction in detection performance. For example, by applying CANNIER to the Classification stage of iDFlakies (Lam et al. 2019) (that distinguishes NOD flaky tests from victim flaky tests), we were able to reduce its time cost by 84% at the expense of misclassifying just 8 flaky tests out of 1,130. Therefore, CANNIER represents a “best of both worlds” solution to flaky test detection.

In summary, the main contributions of this paper are:

Contribution 1: Approach. A novel approach, called CANNIER, that significantly reduces the time cost of rerunning-based flaky test detection with a minimal decrease in detection performance. See Section 3 for more details. Contribution 2: Tooling. To facilitate our empirical evaluation and allow for replication of our results, we developed an extensive framework of automated tools that we make freely available (CANNIER framework 2022). See Section 4 for more details. Contribution 3: Empirical Evaluation. A comprehensive empirical evaluation demonstrates the effectiveness of CANNIER’s combination of re-running and machine learning techniques, revealing further novel findings about machine learning-based flaky test detection, such as the performance of machine learning models for detecting polluter test cases. See Section 5 for more details. Contribution 4: Dataset. A dataset containing 89,668 tests from 30 Python projects taking over six weeks of compute time to produce. We make this available as part of our replication package (CANNIER experiment 2022). See Section 5.1 for more details.

2 Background

2.1 Rerunning-Based Flaky Test Detection

2.1.1 Rerun

The research community has presented many automated flaky test detection techniques that are based on rerunning test cases. The most straight-forward such technique is to repeatedly execute a test case until it exhibits both passing and failing behavior. In its most basic form, this technique involves rerunning the test cases of a test suite in the same test run order and under the same environmental conditions each time (Bell et al. 2018). We refer to this specific technique as Rerun. Since the test run order remains constant, Rerun can only identify non-order-dependent (NOD) flaky tests. As its only parameter, Rerun requires an upper-limit on the number of times to execute a test case without observing an inconsistent outcome. If the upper-limit is reached, the technique classifies the test case as non-flaky and stops rerunning it. Since many test cases may require hundreds or even thousands of runs to manifest their flakiness (Alshammari et al. 2021), this technique can become very expensive for long-running test suites with numerous tests, thus limiting the technique in practice.

2.1.2 iDFlakies

Lam et al. (2019) presented iDFlakies, a technique for detecting flaky tests and classifying them as NOD or a victim. The technique consists of three stages: Setup, Running, and Classification. In the Setup stage, iDFlakies repeatedly executes the test suite in its original order to identify and filter any consistently-failing test cases. In the Running stage, iDFlakies continues to rerun the test suite, but this time in modified test run orders. In the Classification stage, for every test case that failed during the Running stage, iDFlakies re-executes the test suite in both the original order and in the modified order that witnessed the failure, truncated up to and including the failing test case. We refer to this stage as iDFClass (iDFClass). Should the test case fail again in the truncated modified order and pass again in the truncated original order, iDFlakies classifies it as a victim. Otherwise, it classifies the test case as NOD. Should a test case fail multiple times during the Running stage, iDFlakies can repeat the Classification stage for a percentage of the additional failures for greater confidence in the final label.

iDFlakies has several parameters: the number of reruns during the Setup stage, the number of reruns during the Running stage, the method of generating the modified test run orders during the Running stage (e.g., shuffle), and the percentage of additional failures to recheck in the Classification stage. Depending on the choice of values for these parameters, iDFlakies can require a significant number of test executions and thus impose a prohibitive time cost.

2.1.3 Pairwise

While the iDFlakies technique can detect victim flaky tests, it cannot identify their associated polluters. Zhang et al (Zhang et al. 2014) proposed a technique that can detect a subset of a test suite’s victims and their polluters (although the authors designed the technique primarily for detecting victims). It involves executing every permutation of test cases of length two (every pair in both orders) in isolation, such as in separate Java Virtual Machine or Python interpreter processes. We refer to this technique as Pairwise. Initially, Pairwise requires an expected outcome for every test case. It could obtain these by executing each test case in isolation to observe their outcome independent of the possible side-effects of other test cases. Once every test case has an expected outcome, Pairwise executes every 2-permutation of test cases, such that each test case has a turn at being both the first and second to be executed in the pair — the candidate polluter and victim, respectively. For more reliable results, Pairwise ought to filter out any pairs with a known NOD flaky test as the candidate victim because they do not have a reliable expected outcome (although Zhang et al did not propose this filtering stage in their paper). For a given pair, if the second test yields an outcome different from expected, Pairwise classifies it as a victim and classifies the first test as one of its polluters.

Previous work has determined that an order-dependency can involve more than two test cases (Shi et al. 2019), though as part of their empirical study, Zhang et al found that 76% of order-dependencies did involve just two. Considering only pairs of test cases, the time complexity of Pairwise is already quadratic in the size of the test suite and hence very expensive, and so to consider longer permutations would quickly render the technique intractable.

2.2 The Flake16 Feature Set

Alshammari et al. (2021) introduced FlakeFlagger, a tool for detecting flaky tests using a machine learning model. To encode test cases for model training and evaluation, they used a feature set initially consisting of eight numerical test case features and eight boolean features indicating the presence of test smells (Garousi and Ku̇ċu̇k 2018). However, having found the eight test smell features to offer very little information gain, they eventually discarded them. Their empirical evaluation involving 24 Java projects showed that their machine learning model achieved a Matthews correlation coefficient (MCC) of 0.65. In our previous work, we introduced the Flake16 feature set for encoding test cases (Parry et al. 2022a). It subsumes the feature set used by the FlakeFlagger tool and introduces additional metrics such as the number of times the filesystem performed input/output operations during test case execution and the peak memory usage. Our previous evaluation involving 26 Python projects showed that models based on Flake16 generally outperformed models based on the FlakeFlagger feature set for detecting both NOD and Victim flaky tests.

3 The CANNIER Approach

CANNIER (maC hine leA rN iN g assI sted tE st R erunning) is a high-level approach that combines a rerunning-based flaky test detection technique and one or more machine learning models. The models must provide a predicted probability that a given test case is flaky. The general concept behind CANNIER is to use the predicted probabilities as a heuristic to reduce the problem space for the rerunning-based technique. As attested by our later empirical evaluation (see Section 5), this approach can dramatically reduce the number of test case executions, and therefore time cost, at the expense of only a minor decrease in detection performance. The specifics of how CANNIER uses the predicted probabilities depends on the nature of the rerunning-based technique. Figure 1 provides a visual summary of the application of CANNIER to the three rerunning-based detection techniques introduced in Section 2.1.

Fig. 1
figure 1

CANNIER uses the predicted probabilities from one or more machine learning models as a heuristic to reduce the problem space of a rerunning-based flaky test detection technique. A single machine learning model is suitable for Rerun and the Classification stage of iDFlakies (iDFClass) (a). Two machine learning models are required for Pairwise (b)

3.1 Motivating Example

We used the Airflow project, developed by the Apache Software Foundation, as one of the subjects in our empirical evaluation (apache/airflow at c743b95 2022). Its test suite contains 3,251 test cases as of version 1.10.14. We executed the test suite 2,500 times in its original order and identified 66 NOD flaky tests. Following our empirical evaluation, we found that the single-core time cost to detect these flaky tests, using Rerun with a maximum of 2,500 reruns per test case, is 1.69 × 106 seconds. This is based on the time cost of each individual test case that we measured on a machine with a 24-core AMD Ryzen 5900X CPU. Having the same number of virtual cores and a comparable single-core performance, m5zn.6xlarge is arguably the most similar cloud instance offered by Amazon Web Services (New EC2 M5zn instances 2022). As of August 2022, Amazon offers this instance at the on-demand hourly rate of 1.982 USD. This means that to detect the NOD flaky tests in Airflow using Rerun would take ((1.69 × 106) ÷ 24) ÷ 602 ≈ 19.56 hours and cost 19.56 × 1.982 ≈ 38.77 USD on this instance.

Given the cost in both time and money of using Rerun to detect flaky tests, a developer may instead opt to use a machine learning model. We trained an extra trees model, a variation of random forest (Geurts et al. 2006), to detect NOD flaky tests and evaluated it using stratified 10-fold cross validation. Within Airflow, we found that it misclassified 26 test cases that were flaky as non-flaky, and 26 test cases that were non-flaky as flaky. With only 40 of the 66 NOD flaky tests cases actually classified as such, the model achieved a precision of 40 ÷ (40 + 26) ≈ 61% and a recall of 40 ÷ (40 + 26) ≈ 61%. In this context, precision is the percentage of detected flaky tests that are genuinely flaky and recall is the percentage of genuinely flaky tests that were detected. Therefore, machine learning-based detection offers a very approximate solution. Because the model uses dynamic features, the time cost of applying it is approximately equal to the time cost of a single test suite run to produce a feature vector for each test case. We observed a single-core time cost for this of 7.77 × 102 seconds for Airflow. This would require ((7.77 × 102) ÷ 24) ÷ 602 ≈ 0.01 hours on an m5zn.6xlarge instance, costing 0.01 × 1.982 ≈ 0.02 USD. We do not consider the time cost associated with applying the extra trees model to each test case. This is because it is negligible relative to the time taken to execute the test suite (typically less than one second), as we found in our previous work (Parry et al. 2022a). We also do not consider the time taken to train the extra trees model. This is because the model only needs to be trained once and can then be applied any number of times. For this reason, we consider training to be an off-line stage that does not contribute to the time cost of applying the model to test cases.

Rerunning-based detection and machine learning-based detection represent opposite extremes. As shown in this example, Rerun is very expensive and the extra trees model is cheap but very approximate. By applying CANNIER to Rerun (CANNIER+Rerun), developers get a flaky test detection technique that is much cheaper than Rerun and much more accurate than the extra trees model. Following our empirical evaluation, we found that the single-core time cost to detect the 66 NOD flaky tests in Airflow using CANNIER+Rerun is 7.71 × 105 seconds. This would require ((7.71 × 105) ÷ 24) ÷ 602 ≈ 8.92 hours on an m5zn.6xlarge instance at a cost of 8.92 × 1.982 ≈ 17.68 USD. Therefore, CANNIER reduces the cost in USD of Rerun by 54%. We also found that it misclassified three flaky tests as non-flaky but correctly classified the remaining 62. This leads to a precision of 63 ÷ (63 + 0) = 100% and a recall of 63 ÷ (63 + 3) ≈ 95%. This is far more accurate than the extra trees model that only achieved a precision and recall of 61%.

Our empirical evaluation demonstrates that CANNIER is effective for multiple projects and the three rerunning-based detection techniques introduced in Section 2.1. For our whole dataset of 89,668 test cases from 30 projects, we found that CANNIER was able to reduce the time cost (and therefore monetary cost) by an average of 88% across the three techniques.

3.2 Single-Model CANNIER

Using CANNIER with a single machine learning model is suitable for reducing the time cost of Rerun and iDFClass (the classification stage of iDFlakies). In the case of Rerun, the flaky test classification problem is that of distinguishing NOD flaky tests from the rest of the test cases. Since it is a binary problem, NOD flaky tests are the positive class and the rest of the test cases are the negative. For iDFClass, it is telling apart NOD and victim flaky tests. In this case, NOD flaky tests are the positive class and victims are the negative. For both Rerun and iDFClass, the machine learning model should provide a predicted probability of belonging to the positive class for each test case. CANNIER assigns a positive predicted label to a test case if this probability is above an upper threshold and a negative predicted label if it is below a lower threshold. This leaves an ambiguous region between the two thresholds. CANNIER delegates any test cases with predicted probabilities within this ambiguous region to the rerunning-based technique.

3.3 Multi-Model CANNIER

Using two models, CANNIER can reduce the time cost of Pairwise. The first model is used to predict the probability of each test case being a victim. In other words, it addresses the classification problem of distinguishing victims from non-victims. The second is used to do the same but for being a polluter. For both models, CANNIER classifies every test case above a threshold as the positive class (a victim or a polluter) and every other test case as the negative (not a victim or not a polluter). In this way, CANNIER produces two non-mutually exclusive sets, one of victims, \(\mathcal {T}_{V}\), and one of polluters, \(\mathcal {T}_{P}\) (there is no reason why a test case cannot be both a victim and a polluter (Wei et al. 2022)). Then, CANNIER applies Pairwise with only the members of \(\mathcal {T}_{P}\) as the first test in each pair and only the members of \(\mathcal {T}_{V}\) as the second. Therefore, CANNIER can reduce the time complexity of Pairwise from \(O(|\mathcal {T}|^{2})\), where \(\mathcal {T}\) is the set of all test cases in the test suite, to \(O(|\mathcal {T}_{V}| \times |\mathcal {T}_{P}|)\), that is considerably faster even when \(\mathcal {T}_{V}\) and \(\mathcal {T}_{P}\) are not significantly smaller than \(\mathcal {T}\).

4 Tooling

To produce our dataset and facilitate our empirical evaluation, we developed our own suite of automated tools including a plugin for the Python testing framework pytest (Pytest 2022), named pytest-CANNIER (pytest-CANNIER 2022), and a command-line tool named CANNIER-Framework (CANNIER framework 2022). The purpose of pytest-CANNIER is to add the functionality to pytest necessary for our evaluation. This includes recording test case outcomes and measuring feature values. The purpose of CANNIER-Framework is to automate every aspect of our evaluation, including executing pytest-CANNIER on the subject test suites, collating raw data, and training and evaluating machine learning models.

4.1 pytest-CANNIER

We decided to target pytest due to its compatibility with test suites written for other frameworks such as unittest (Unittest 2022). pytest-CANNIER takes a test suite \(\mathcal {T}\) as input and offers four execution modes: Baseline, Shuffle, Features, and Victim. In the Baseline mode, the plugin executes the test suite as normal. For each test case \(t \in \mathcal {T}\), pytest-CANNIER records its outcome bt, that is either pass, bt = 0, or fail, bt = 1. In the Shuffle mode, the plugin randomizes the order of the test cases and records the outcome of every test case st.

In the Features mode, pytest-CANNIER produces a feature vector \(\mathbf {x}_{t} \in \mathbb {R}^{18}\), for each test case t. This contains the 16 features of Flake16 alongside two additional metrics. The first of these is Wait Time. This is the amount of time during test case execution spent waiting for input/output (I/O) operations to complete. Previous research identified I/O in test cases as being potentially associated with flakiness (Luo et al. 2014). The second additional feature is Max. Children. This measures the peak number of concurrently running child processes. A finding that many empirical studies have in common is that asynchronous operations and concurrency are very frequent causes of flaky tests (Eck et al. 2019; Lam et al. 2020; Luo et al. 2014; Romano et al. 2021). This was our rationale for the inclusion of Max. Threads into Flake16. However, due to the global interpreter lock implemented within the CPython interpreter (Glossary 2022), it may be necessary for developers to achieve concurrency with child processes. Table 1 offers a description of all 18 features. In the Victim mode, the plugin takes a test case v, executes the test sequence 〈v〉, and records the outcome of v, ov. This is to ascertain the expected outcome of v when executed in isolation from the rest of the test suite. Following this, pytest-CANNIER executes the sequences 〈p,v〉 for every test case p in \(\mathcal {T} - \{v\}\), while recording the outcome of v when executed immediately after each p, op,v. This is to identify the polluters of v where op,vov. For isolation between sequence runs, the plugin executes them in separate Python processes (Bell et al. 2018; Zhang et al. 2014). This implements the Pairwise technique with respect to a single candidate victim v. Figure 2 provides a visual summary of pytest-CANNIER.

Table 1 The 18 features measured by pytest-CANNIER
Fig. 2
figure 2

As input, pytest-CANNIER takes a test suite \(\mathcal {T} = \) (t1, t2, t3) and can be launched in four modes: Baseline, Shuffle, Features, or Victim. In the Baseline mode, the plugin runs the test suite in its original order and records the pass/fail outcome of every test case (b1, b2, b3). In the Shuffle mode, pytest-CANNIER executes the test suite in a random order and also records test case outcomes (s1, s2, s3). In the Features mode, the plugin produces a feature vector for each test case (x1, x2, x3). In the Victim mode, pytest-CANNIER takes a victim test case as an additional input (t1) and initially executes it in isolation to ascertain its expected outcome (o1). Then, the plugin executes every other test case in a separate process with the victim immediately following and records its outcome (o2,1, o3,1). This is to identify polluters of the victim

4.2 CANNIER-Framework

4.2.1 Model Training and Evaluation Data

As input, CANNIER-Framework takes a subject set of test suites \(\mathcal {U}\). With every test suite \(\mathcal {T} \in \mathcal {U}\) as input, the framework executes the plugin NB times in the Baseline mode, resulting in NB values of bt, (\(b_{t, 1}, b_{t, 2}, \dots , b_{t, {N_{B}}}\)), for each test case \(t \in \mathcal {T}\). Similarly, CANNIER-Framework runs every test suite NS times in the Shuffle mode, leading to NS values of st, (\(s_{t, 1}, s_{t, 2}, \dots , s_{t, {N_{S}}}\)). In both cases, the framework counts the number of times that every test case fails in the Baseline mode Bt, and the number of times in the Shuffle mode St. The definition of both values is given in the following equation.

$$ B_{t} = \sum\limits^{N_{B}}_{i=1} b_{t, i} \quad \quad S_{t} = \sum\limits^{N_{S}}_{i=1} s_{t, i} $$
(1)

CANNIER-Framework also executes each test suite NF times with pytest-CANNIER in the Features mode, resulting in NF feature vectors for every test case t (\(\mathbf {x}_{t, 1}, \mathbf {x}_{t, 2}, \dots , \mathbf {x}_{t, {N_{F}}}\)). As an additional input, the framework takes \(\mathcal {I}\), a random sample of nF indices ranging from 1 to NF inclusive without replacement. With this, the framework produces a mean feature vector \(\mathbf {X}_{t}(\mathcal {I})\), to encode each test case according to the following equation.

$$ \mathbf{X}_{t}(\mathcal{I}) = \frac{1}{n_{F}} \sum\limits_{i \in \mathcal{I}} \mathbf{x}_{t, i} $$
(2)

For each \(\mathcal {T} \in \mathcal {U}\), the framework runs the Victim mode of pytest-CANNIER with every test case that had a consistent outcome in the Baseline mode (Bv = 0 ∨ Bv = NB) and an inconsistent outcome in the Shuffle mode (BvSv) as the candidate victim v. The former condition is to ensure that every v has the reliable expected outcome that Pairwise requires. The latter is a time saving measure — if a test case is consistent in the Shuffle mode then it is very unlikely to be a victim and therefore would have no polluters. For the purposes of greater reproducibility and isolation, CANNIER-Framework executes the plugin in a separate Docker container for every run of a test suite (Docker documentation 2022). Our Dockerfile contains all the commands needed to reproduce our Docker image and is available as part of the replication package (CANNIER experiment 2022).

Once the plugin has finished performing the test suite runs, CANNIER-Framework determines a ground-truth labelyt,ϕ, for every test case t in the whole subject set, \(t \in \bigcup _{\mathcal {T} \in \mathcal {U}} \mathcal {T}\), and flaky test classification problem ϕ. Recall from Section 3 that these problems are: NOD flaky tests versus the rest of the test cases (NOD-vs-Rest, ϕ = 1), NOD flaky tests versus victim flaky tests (NOD-vs-Victim, ϕ = 2), victim flaky tests versus the rest (Victim-vs-Rest, ϕ = 3), and polluters versus the rest (Polluter-vs-Rest, ϕ = 4). Each problem has a domain \(\mathcal {T}_{\phi } \subseteq \mathcal {T}\), that is the subset of test cases in a given test suite \(\mathcal {T}\) that are relevant. Since the problems are binary classifications, they also have a positive class, \(\mathcal {T}^{+}_{\phi } \subset \mathcal {T}_{\phi }\), and a negative class, \(\mathcal {T}^{-}_{\phi } = \mathcal {T}_{\phi } - \mathcal {T}^{+}_{\phi }\). The ground-truth label for a test case is positive if it is in the positive class of a problem (yt,ϕ = 1) and negative otherwise (yt,ϕ = 0). For a test case t belonging to test suite \(\mathcal {T}\), the following equation defines the ground truth label yt,ϕ.

$$ y_{t, \phi} = \left\{\begin{array}{cl} 0 & \quad \text{if } t \in \mathcal{T}^{-}_{\phi} \\ 1 & \quad \text{if } t \in \mathcal{T}^{+}_{\phi} \end{array}\right. $$
(3)

For the NOD-vs-Rest problem (ϕ = 1), the positive class is the set of NOD flaky tests, that we define as those with an inconsistent outcome in the Baseline mode (0 < Bt < NB). The only test cases that are relevant to the NOD-vs-Victim problem (ϕ = 2) are those that did not consistently fail during the runs in the Baseline mode (Bt < NB) and failed at least once in Shuffle mode (St > 0). The former condition corresponds to the Setup stage of iDFlakies where such test cases would be excluded from further analysis. The latter corresponds to the Running stage, where any test case that fails at least once goes on to the Classification stage. For this problem, the positive class is also the set of NOD flaky tests. For the Victim-vs-Rest problem (ϕ = 3), the positive class is the set of test cases with a consistent outcome in the Baseline mode (Bt = 0 ∨ Bt = NB) and an inconsistent outcome in the Shuffle mode (BtSt). This represents the set of victims. Finally, for the Polluter-vs-Rest problem (ϕ = 4), the positive class is the set of test cases that behaved as polluters in the Victim mode. Table 2 gives a definition of each problem.

Table 2 The four flaky test classification problems

4.2.2 Model Training and Evaluation Procedure

CANNIER-Framework follows a general machine learning pipeline for model training and evaluation. The pipeline leaves the specific model and data balancing technique unspecified, such that it can be instantiated with a choice for both of these components to create a concrete pipeline. The pipeline performs stratified 10-folds cross validation. This creates ten folds where 90% of the test cases in the whole subject set are for training and the other 10% are for evaluation. The class proportion of each fold roughly follows that of the whole subject set, and since that is highly imbalanced for every classification problem, the framework applies the data balancing technique to the training set only (Chawla et al. 2002). For each fold, the framework fits the machine learning model with the training set and applies it to every test case in the evaluation set. For a given problem ϕ, this results in a predicted probability \(P(y_{t, \phi } = 1 | \mathbf {X}_{t}(\mathcal {I}))\), of each test case in the evaluation set being of the positive class. Since the evaluation portion of every fold is unique, after ten folds each test case in the whole subject set has a prediction. Figure 3 offers an overview of the general pipeline. Given a lower-threshold ωl and an upper-threshold ωu on the predicted probability as further inputs, CANNIER-Framework assigns a predicted label \(z_{t, \phi }(\mathcal {I}, \omega _{l}, \omega _{u})\), to every test case, as previously shown in Fig. 1. The following equation defines the predicted label for a test case, denoted zt,ϕ.

$$ z_{t, \phi}(\mathcal{I}, \omega_{l}, \omega_{u}) = \left\{\begin{array}{cl} 0 & \quad \text{if } P(y_{t, \phi} = 1 | \mathbf{X}_{t}(\mathcal{I})) < \omega_{l} \\ 1 & \quad \text{if } P(y_{t, \phi} = 1 | \mathbf{X}_{t}(\mathcal{I})) \ge \omega_{u} \\ y_{t, \phi} & \quad \text{if } \omega_{l} \le P(y_{t, \phi} = 1 | \mathbf{X}_{t}(\mathcal{I})) < \omega_{u} \end{array}\right. $$
(4)
Fig. 3
figure 3

CANNIER-Framework performs stratified k-folds cross validation upon the set of all test cases in the subject set, \(\bigcup _{\mathcal {T} \in \mathcal {U}} \mathcal {T}\). Following this, it applies a data balancing technique to the training portion of each fold. The framework then trains a machine learning model using the mean feature vectors \(\mathbf {X}_{t}(\mathcal {I})\), and ground-truth labels yt,ϕ, of every test case t in each training portion. Finally, for each fold, CANNIER-Framework applies the trained model to the feature vectors of every test case in the evaluation portion. Since the evaluation portion of each fold is unique, every test case ends up with a predicted probability of being in the positive class, \(P(y_{t,\phi } = 1 | \mathbf {X}_{t}(\mathcal {I}))\)

Using the ground-truth and predicted labels for each test case in a given test suite \(\mathcal {T}\), the framework calculates the frequencies of the four confusion matrix categories: true-positive (TP), false-positive (FP), false-negative (FN), and true-negative (TN). From these, it calculates the Matthews correlation coefficient (MCC) to assess the detection performance of the machine learning model for a given problem ϕ. The possible values of MCC are the closed real range between -1 and 1, where 1 indicates a model with perfect agreement between the ground-truth labels and the predicted labels and 0 indicates a model that is no better than random guessing of the predicted labels. A model with an MCC of -1 indicates perfect disagreement between the ground-truth labels and the predicted labels, such that taking a model with an MCC of 1 and inverting the predicted labels would yield an MCC of -1. We selected MCC as the overall performance metric, as opposed to F1 score, because it only produces a high value if the model performs well in terms of all four confusion matrix categories, whereas F1 score ignores true-negatives (Chicco and Jurman 2020). See Fig. 4 for a summary of how CANNIER-Framework combines pytest-CANNIER and the general machine learning pipeline from Fig. 3 to produce this data. The following equation defines \(\textit {MCC}_{\mathcal {T}_{\phi }}\) with respect to the four confusion matrix categories respectively denoted as \(\textit {TP}_{\mathcal {T}_{\phi }}\), \(\textit {FP}_{\mathcal {T}_{\phi }}\), \(\textit {FN}_{\mathcal {T}_{\phi }}\), and \(\textit {TN}_{\mathcal {T}_{\phi }}\).

$$ \begin{array}{@{}rcl@{}} \textit{TP}_{\mathcal{T}_{\phi}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \sum\limits_{t \in \mathcal{T}_{\phi}} y_{t, \phi} z_{t, \phi}(\mathcal{I}, \omega_{l}, \omega_{u}), \\ \textit{FP}_{\mathcal{T}_{\phi}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \sum\limits_{t \in \mathcal{T}_{\phi}} [1 - y_{t, \phi}] z_{t, \phi}(\mathcal{I}, \omega_{l}, \omega_{u}), \\ \textit{FN}_{\mathcal{T}_{\phi}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \sum\limits_{t \in \mathcal{T}_{\phi}} y_{t, \phi} [1 - z_{t, \phi}(\mathcal{I}, \omega_{l}, \omega_{u})], \\ \textit{TN}_{\mathcal{T}_{\phi}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \sum\limits_{t \in \mathcal{T}_{\phi}} [1 - y_{t, \phi}] [1 - z_{t, \phi}(\mathcal{I}, \omega_{l}, \omega_{u})], \\ \textit{MCC}_{\mathcal{T}_{\phi}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \frac{ \textit{TP}_{\mathcal{T}_{\phi}} \textit{TN}_{\mathcal{T}_{\phi}} - \textit{FP}_{\mathcal{T}_{\phi}} \textit{FN}_{\mathcal{T}_{\phi}} }{ \sqrt{ (\textit{TP}_{\mathcal{T}_{\phi}} + \textit{FP}_{\mathcal{T}_{\phi}} ) (\textit{TP}_{\mathcal{T}_{\phi}} + \textit{FN}_{\mathcal{T}_{\phi}} ) (\textit{TN}_{\mathcal{T}_{\phi}} + \textit{FP}_{\mathcal{T}_{\phi}} ) (\textit{TN}_{\mathcal{T}_{\phi}} + \textit{FN}_{\mathcal{T}_{\phi}} )}}\\ \end{array} $$
(5)
Fig. 4
figure 4

An overview of how CANNIER-Framework combines pytest-CANNIER and the general machine learning pipeline, with subject set \(\mathcal {U}\), random sample \(\mathcal {I}\), and thresholds ωl and ωu as input. It references previously defined figures and equations (e.g., Equ. 1 through 5 and Fig. 2 and 3)

4.2.3 Technique Evaluation Procedure

CANNIER-Framework evaluates the application of CANNIER to Rerun (CANNIER+Rerun), the Classification stage of iDFlakies (CANNIER+iDFClass), and Pairwise (CANNIER+Pairwise). We developed a mathematical model, that we implemented within the framework, to estimate the detection performance and single-core time cost associated with a set of parameters for the three techniques. CANNIER-Framework uses the ground-truth labels and predicted probabilities for each test from the NOD-vs-Rest problem (ϕ = 1) to model CANNIER+Rerun. It uses the data from the NOD-vs-Victim problem (ϕ = 2) to model CANNIER+iDFClass in an equivalent fashion. In both of these cases, the ground-truth labels represent the output from the “Rerun/iDFClass” block and the predicted probabilities represent the output from the “Model” block in Fig. 1a. The parameters of CANNIER+Rerun are the lower- and upper-thresholds on the model prediction, ωl and ωu, the sample size to produce the mean feature vectors for each test case, denoted nF, and the maximum number of times to execute a test case without observing an inconsistent outcome, written as Rmax. For CANNIER+iDFClass, the parameters are ωl, ωu, nF, and the percentage of additional failures to recheck, denoted γ. The framework uses the outcomes from the Victim mode and the predicted probabilities from the Victim-vs-Rest (ϕ = 3) and Polluter-vs-Rest (ϕ = 4) problems to model CANNIER+Pairwise. The outcomes represent the “Pairwise” block and the predicted probabilities represent the “Victim model” and “Polluter model” blocks in Fig. 1b. For CANNIER+Pairwise, the parameters are the threshold for the victim model ωV, the threshold for the polluter model ωP, and nF.

Given a random sample \(\mathcal {I}\) of size nF along with ωl and ωu, CANNIER-Framework estimates the detection performance of CANNIER+Rerun and CANNIER+iDFClass as an MCC value. For every test case t in a given test suite \(\mathcal {T}\), the framework needs its individual time cost Ct, and the number of times Rerun is expected to execute it \(R_{t}(\mathcal {I}, \omega _{l}, \omega _{u})\), to estimate the time cost of CANNIER+Rerun, \(C^{\mathit {Rerun}}_{\mathcal {T}}(\mathcal {I}, \omega _{l}, \omega _{u})\). It can find Ct from the output of pytest-CANNIER in the Features mode, since this is the third feature in Table 1. As for \(R_{t}(\mathcal {I}, \omega _{l}, \omega _{u})\), when \(P(y_{t, 1} = 1 | \mathbf {X}_{t}(\mathcal {I}))\) is not in the ambiguous region between ωl and ωu, CANNIER+Rerun does not delegate to Rerun and so it never executes t (\(R_{t}(\mathcal {I}, \omega _{l}, \omega _{u}) = 0\)). Otherwise, when yt,1 = 0, Rerun would execute t exactly Rmax times since t is not NOD flaky and therefore Rerun would never observe an inconsistent outcome (\(R_{t}(\mathcal {I}, \omega _{l}, \omega _{u}) = R_{\mathit {max}}\)). If yt,1 = 1, Rerun would execute t until it either observes an inconsistent outcome or reaches a limit of Rmax runs. We refer to the final run number where either of these conditions are met as rt. In this case, \(R_{t}(\mathcal {I}, \omega _{l}, \omega _{u})\) is the expected value of the discrete, finite distribution P(rt = x). The probability of t giving an inconsistent outcome after exactly x runs is Exact(t,x). This is the probability of t failing x − 1 times and then passing once, or passing x − 1 times and then failing once. When x < Rmax, P(rt = x) = Exact(t,x). However, when x = Rmax, P(rt = x) is the probability of t giving an inconsistent outcome after exactly Rmax runs, Exact(t,Rmax), or not giving an inconsistent outcome after reaching the limit of Rmax runs. Where \(\mathbb {E}[P]\) is the expected value of the distribution P, the definition of the time cost of CANNIER+Rerun, denoted \(C^{\textit {Rerun}}_{\mathcal {T}}\), is given by the following equation.

$$ \begin{array}{@{}rcl@{}} C_t &=& \frac{1}{N_F} \sum\limits^{N_{F}}_{i = 1} \mathbf{x}_{t, i, 3}, \\ \mathit{Exact}(t, x) &=& \left( \frac{B_t}{N_{B}} \right)^{x - 1} \left( 1 - \frac{B_{t}}{N_{B}} \right) + \left( 1 - \frac{B_{t}}{N_{B}} \right)^{x - 1} \frac{B_{t}}{N_{B}}, \\ P(r_t = x) &=& \left\{\begin{array}{ c l } \mathit{Exact}(t, x) & \quad \text{if } 1 < x < R_{\textit{max}} \\ \mathit{Exact}(t, R_{\textit{max}}) + (1 - {\sum}_{r = 2}^{R_{\textit{max}}} \mathit{Exact}(t, x)) & \quad \text{if } x = R_{\textit{max}} \end{array} \right., \\ R_{t}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \left\{ \begin{array}{ c l } 0 & \quad \text{if } P(y_{t, 1} = 1 | \mathbf{X}_{t}(\mathcal{I})) < \omega_l \vee P(y_{t, 1} = 1 | \mathbf{X}_{t}(\mathcal{I})) \ge \omega_{u} \\ R_{\textit{max}} & \quad \text{if } \omega_l \le P(y_{t, 1} = 1 | \mathbf{X}_{t}(\mathcal{I})) < \omega_{u} \wedge y_{t, 1} = 0 \\ \mathbb{E}[P(r_t = x)] & \quad \text{if } \omega_l \le P(y_{t, 1} = 1 | \mathbf{X}_{t}(\mathcal{I})) < \omega_u \wedge y_{t, 1} = 1 \end{array} \right., \\ C^{\textit{Rerun}}_{\mathcal{T}}(\mathcal{I}, \omega_{l}, \omega_{u}) &=& \sum\limits_{t \in \mathcal{T}} C_{t} R_{t}(\mathcal{I}, \omega_{l}, \omega_{u}) \end{array} $$
(6)

To estimate the time cost of CANNIER+iDFClass, \(C^{\mathit {iDFClass}}_{\mathcal {T}}(\mathcal {I}, \omega _{l}, \omega _{u})\), for a given test suite \(\mathcal {T}\), CANNIER-Framework requires the number of times that iDFClass is expected to attempt to classify each test case \(t \in \mathcal {T}_{2}\) as either NOD or a victim, \({\Gamma }_{t}(\mathcal {I}, \omega _{l}, \omega _{u})\). As before, when \(P(y_{t, 2} = 1 | \mathbf {X}_{t}(\mathcal {I}, \omega _{l}, \omega _{u}))\) is not in the ambiguous region, CANNIER+iDFClass does not delegate to iDFClass and so it never classifies t (\({\Gamma }_{t}(\mathcal {I}, \omega _{l}, \omega _{u}) = 0\)). Otherwise, iDFClass will classify a test case after its first failure during the Classification stage and will reclassify a percentage of the additional failures as determined by γ. We assume that any test case undergoing classification by iDFClass has a uniform probability of appearing at any position in the original and modified test run orders. Under this assumption, the mean length of the truncated original and modified orders would both be equal to half the size of the test suite. Therefore, the mean time cost of classifying a single test case is equal to that of one full test suite run, as given by the following equation.

$$ \begin{array}{@{}rcl@{}} {\Gamma}_t(\mathcal{I}, \omega_l, \omega_u) &=& \left\{\begin{array}{cl} 0 & \quad \text{if} P(y_{t, 2} = 1 | \mathbf{X}_t(\mathcal{I})) < \omega_l \vee P(y_{t, 2} = 1 | \mathbf{X}_t(\mathcal{I})) \ge \omega_u \\ 1 + \gamma (S_t - 1) & \quad \text{if } \omega_l \le P(y_{t, 2} = 1 | \mathbf{X}_t(\mathcal{I})) < \omega_u \end{array} \right., \\ && C^{\textit{iDFClass}}_{\mathcal{T}}(\mathcal{I}, \omega_{l}, \omega_{u}) = \left( \sum\limits_{t \in \mathcal{T}_2} {\Gamma}_t(\mathcal{I}, \omega_l, \omega_u) \right) \sum\limits_{t \in \mathcal{T}} \textit{C}_t \end{array} $$
(7)

CANNIER-Framework estimates the detection performance of CANNIER+Pairwise as the ratio of victim-polluter pairs that would be detected by Pairwise to all such pairs in a given test suite \(\mathcal {T}\). We selected this simpler metric, as opposed to MCC, because we assume that CANNIER+Pairwise will never incorrectly label a pair of test cases as having a victim-polluter relationship when they do not (false-positive). Under this assumption, this metric is equivalent to true-positive rate (TPR), also known as sensitivity. It has a range between 0 and 1, where 0 indicates that CANNIER+Pairwise detected none of the victim-polluter pairs and 1 indicates that it detected all of them. Since we designed the framework to only consider non-NOD flaky tests as candidate victims, such that they all have a reliable expected outcome, we have sufficient assurance that the assumption holds. Recall from Section 3.3 that CANNIER+Pairwise builds a set of victims \(\mathcal {T}_{V}(\mathcal {I}, \omega _{V})\), and polluters \(\mathcal {T}_{P}(\mathcal {I}, \omega _{P})\), given victim- and polluter-thresholds ωV and ωP. It then executes Pairwise with only the pairs in \(\mathcal {T}_{P}(\mathcal {I}, \omega _{P}) \times \mathcal {T}_{V}(\mathcal {I}, \omega _{V})\). CANNIER-Framework builds these sets using the predicted probabilities from the Victim-vs-Rest (ϕ = 3) and Polluter-vs-Rest (ϕ = 4) problems. The framework calculates TPR by dividing the number of victim-polluter pairs in \(\mathcal {T}_{P}(\mathcal {I}, \omega _{P}) \times \mathcal {T}_{V}(\mathcal {I}, \omega _{V})\) by the number of such pairs in \(\mathcal {T} \times \mathcal {T}\). In other words, it divides the number of true-positives (TP) by the number of positives (P). To know how many pairs are in both sets, CANNIER-Framework relies on the outcomes recorded by pytest-CANNIER in the Victim mode. The framework estimates the time cost of CANNIER+Pairwise, \(C^{\mathit {Pairwise}}_{\mathcal {T}}(\mathcal {I}, \omega _{\mathcal {V}}, \omega _{\mathcal {P}})\), based on the sizes of both sets and the individual time costs of their members. The definition of \(\textit {TPR}_{\mathcal {T}}\) and the time cost of CANNIER+Pairwise, denoted \(C^{\textit {Pairwise}}_{\mathcal {T}}\), is provided by the following equation.

$$ \begin{array}{@{}rcl@{}} \mathcal{T}_V(\mathcal{I}, \omega_V) &=& \{v | v \in \mathcal{T}, P(y_{t, 3} = 1 | \mathbf{X}_t(\mathcal{I})) >= \omega_V\}, \\ \mathcal{T}_P(\mathcal{I}, \omega_P) &=& \{p | p \in \mathcal{T}, P(y_{t, 4} = 1 | \mathbf{X}_t(\mathcal{I})) >= \omega_P\}, \\ \textit{TP}_{\mathcal{T}}(\mathcal{I}, \omega_V, \omega_P) &=& \sum\limits_{p \in \mathcal{T}_P(\mathcal{I}, \omega_P)} |\{v | v \in \mathcal{T}_V(\mathcal{I}, \omega_V) - \{p\}, o_{p, v} \neq o_v\}|, \\ \textit{P}_{\mathcal{T}} &=& \sum\limits_{p \in \mathcal{T}} |\{v | v \in \mathcal{T} - \{p\}, o_{p, v} \neq o_v\}|, \\ \textit{TPR}_{\mathcal{T}}(\mathcal{I}, \omega_V, \omega_P) &=& \frac{\textit{TP}_{\mathcal{T}}(\mathcal{I}, \omega_V, \omega_P)}{\textit{P}_{\mathcal{T}}}, \\ C^{\textit{Pairwise}}_{\mathcal{T}}(\mathcal{I}, \omega_V, \omega_P) &=& \left( |\mathcal{T}_P(\mathcal{I}, \omega_P)| \sum\limits_{v \in \mathcal{T}_V(\mathcal{I}, \omega_V)} C_v \right) + \left( |\mathcal{T}_V(\mathcal{I}, \omega_V)| \sum\limits_{p \in \mathcal{T}_P(\mathcal{I}, \omega_P)} C_p \right) \end{array} $$
(8)

5 Empirical Evaluation

We conducted experiments to answer the following research questions:

RQ1. :

How effective is machine learning-based flaky test detection?

RQ2. :

What impact do mean feature vectors have on the performance of machine learning-based flaky test detection?

RQ3. :

What contribution do individual features have on the output values of machine learning models for detecting flaky tests?

RQ4. :

What impact does CANNIER have on the performance and time cost of rerunning-based flaky test detection?

5.1 Subject Set

For this paper’s subject set, we used the test suites of the 26 open-source Python projects studied in our previous work (Parry et al. 2022a)Footnote 2. We selected these at random from a list of projects critical to open-source infrastructure created by the Open Source Security Foundation of (Open source project criticality score (beta) 2022). For this paper, we randomly selected four more projects to improve the generalizability of the results. We used CANNIER-Framework to produce a dataset from these 30 test suites that contains 89,668 tests. We set the framework to perform 2,500 runs of each test suite in the Baseline mode of pytest-CANNIER (NB = 2500), 2,500 runs in the Shuffle mode (NS = 2500), and 30 runs in the Features mode (NF = 30). Table 3 shows each project’s GitHub repository; the total number of tests (\(|\mathcal {T}|\)); the number of NOD flaky tests (\(|\mathcal {T}^{+}_{1}|\)), victims (\(|\mathcal {T}^{+}_{3}|\)), and polluters (\(|\mathcal {T}^{+}_{4}|\)); the number of victim-polluter pairs (\({\sum }_{p \in \mathcal {T}} |\{v | v \in \mathcal {T} - \{p\}, o_{p, v} \neq o_{v}\}|\)); and the combined mean time cost of every test in seconds (\({\sum }_{t \in \mathcal {T}} \bigl [ \frac {1}{N_{F}} {\sum }^{N_{F}}_{i = 1} \mathbf {x}_{t, i, 3} \bigr ]\)).

Table 3 The 30 open-source Python projects examined in this paper’s study

The projects of our subject set cover a wide variety of topics. All are hosted on the Python Package Index () that allows developers to associate them with zero or more “topic classifiers”. Topic classifiers are multi-level, for example: Software Development :: Libraries :: Python Modules. A developer may also specify a parent classifier on its own (e.g., just Software Development). Table 4 lists the topic classifiers of the 30 Python subjects. It also provides the frequencies of each classifier, taking into account their hierarchical nature.

Table 4 The topic classifiers of the subject projects and their frequencies

5.2 Methodology

5.2.1 RQ1. How Effective is Machine Learning-Based Flaky Test Detection?

The motivation behind this question is to establish a baseline for the performance of machine learning models for detecting flaky tests. While several studies have addressed this question for NOD flaky tests (Alshammari et al. 2021; Bertolino et al. 2021; Parry et al. 2022a; Pinto et al. 2020), and we addressed it for victims in our previous work (Parry et al. 2022a), no previous study has addressed it for polluters. It is important to consider polluters when answering RQ1 since they offer developers useful information when repairing victim flaky tests and are a necessary input to techniques for mitigating them (Lam et al. 2020; Parry et al. 2020; Shi et al. 2019).

We used CANNIER-Framework to evaluate 24 concrete machine learning pipelines for each of the four flaky test classification problems. We derived these from the combination of two choices of model type, four choices of model configuration, and three choices of data balancing technique. These choices form the concrete instantiations of the “Data Balancing” and “Model” blocks in our general pipeline from Fig. 3. The two model types we considered were random forest (Breiman 2001; Shi and Horvath 2006) and extra trees (Geurts et al. 2006) (the latter being a more randomized variant of the former). These are ensemble models that fit a number of decision trees (Safavian and Landgrebe 1991) on subsets of the training data. We selected these particular model types due to their success in our previous work (Parry et al. 2022a) and the related work of other authors (Alshammari et al. 2021). The choices of model configuration were four values for the number of decision trees used by the random forest or extra trees model. These values were 25, 50, 75, and 100. In our previous work, we only considered random forest and extra trees models with 100 decision trees — the default value of our selected implementation (Scikit-learn 2022). Finally, for the three choices of data balancing, we evaluated the synthetic minority oversampling technique (SMOTE) (Chawla et al. 2002), SMOTE combined with edited nearest-neighbors (SMOTE+ENN), and SMOTE with Tomek links (Tomek 1976) (SMOTE+Tomek). SMOTE performs oversampling, meaning it produces synthetic data points of the minority class via interpolation. The ENN and Tomek techniques on their own perform undersampling, meaning they remove data points of the majority class based on similarity with their neighbors. The combination of these with SMOTE produces a hybrid balancing approach.

For each of the 24 × 4 = 96 concrete machine learning pipelines, we fixed the feature sample size at a single sample (nF = 1) and had the framework repeat the model training and evaluation procedure 30 times (see Fig. 3), using a different random sample \(\mathcal {I}\) to produce the mean feature vectors every time. In each instance, this resulted in 30 values of P(yt,ϕ = 1) for every test case t and problem ϕ. To evaluate the performance of the pipelines, CANNIER-Framework needed predicted labels to calculate the confusion matrix category frequencies and MCC against the ground-truth labels for each problem. To produce the predicted labels to address this research question, we substituted \(z_{t, \phi }(\mathcal {I}, \omega _{l}, \omega _{u})\) in Eq. 5 for the following definition of \(z_{t, \phi }(\mathcal {I})\) that assigns a test case to its most likely class:

$$ z_{t, \phi}(\mathcal{I}) = \left\{\begin{array}{cl} 0 & \quad \text{if } P(y_{t, \phi} = 1 | \mathbf{X}_{t}(\mathcal{I})) < 0.5 \\ 1 & \quad \text{if } P(y_{t, \phi} = 1 | \mathbf{X}_{t}(\mathcal{I})) \ge 0.5 \end{array} \right. $$
(9)

With these predicted labels, we used CANNIER-Framework to calculate the confusion matrix category frequencies and the MCC of the 96 pipelines with respect to each of the 30 subject test suites in turn. We also had the framework calculate this with respect to the whole subject set for each pipeline by summing the category frequencies for each project and calculating the overall MCC from this total. This is to provide an individual assessment with respect to each test suite as well as an overview for the whole subject set. For the per-project and overall evaluations, CANNIER-Framework calculated mean values for the category frequencies and the MCC over the 30 repeats of model training and evaluation. This is to offer an evaluation that is more reliable given the non-determinism inherent to the machine learning models, the data balancing techniques, and potentially the dynamic feature values.

5.2.2 RQ2. What Impact do Mean Feature Vectors have on the Performance of Machine Learning-Based Flaky Test Detection?

In previous studies on machine learning-based flaky test detection with dynamic test case features (Alshammari et al. 2021; Parry et al. 2022a), researchers performed only a single instrumented test suite run to create the feature vectors. The rationale for this question is to investigate the impact of using feature vectors that are the mean from multiple instrumented test suite runs. In the context of this study, that is multiple runs in the Features mode of pytest-CANNIER. This is to mitigate against the possible variance in the dynamic features. As an example, previous studies have found that the line coverage of test cases can vary across repeated executions (Hilton et al. 2018; Shi et al. 2019; Vysali et al. 2020). Since three features in Table 1 are based on line coverage, we expect there to be some degree of noise in their values for each test case that could impact the detection performance of the model.

We took the best machine learning pipeline (in terms of the overall MCC) for each classification problem from the previous research question and followed the same methodology for training and evaluation, except we gave CANNIER-Framework a range of values for nF to produce \(\mathcal {I}\) between 1 and 15 samples inclusive. With 30 repeats of model training and evaluation for each value of nF, this resulted in 15 × 30 = 450 rounds of stratified 10-fold cross validation for each problem. This process enabled us to investigate the correlation between the number of repeated measurements to produce the mean feature vectors and the MCC of the resultant model.

5.2.3 RQ3. What Contribution do Individual Features have on the Output Values of Machine Learning Models for Detecting Flaky Tests?

In the interest of model explainability, we set out to investigate the impact of each individual feature in Table 1. To address this question, we applied the Shapely Additive Explanations (SHAP) technique (Lundberg et al. 2020). It leverages concepts from game theory to quantify the contribution of an individual feature to the output value of a machine learning model for an individual data point. As inputs, SHAP takes a feature matrix and a model and returns a matrix of SHAP values in the same shape as the feature matrix. The SHAP value at (i,j) in the matrix represents the contribution of the j th feature on the model output for the i th data point relative to the mean output value over the dataset. This is such that summing the rows of the SHAP value matrix and adding the mean output value gives the original model output values.

In the context of this study, the features are those in Table 1, the data points are test cases, and the model output values are the predicted probabilities of each test case being in the positive class for a given flaky test classification problem. As the feature matrix, we used the mean feature vector for each test case over the 30 runs of pytest-CANNIER in the Features mode (nF = NF). As the machine learning model, we used CANNIER-Framework to train the best pipeline from RQ1 using the mean feature matrix. We did this for each of the four classification problems.

Once we had a SHAP value matrix for each problem, we ranked every feature in terms of their mean absolute SHAP value over every test case. A high value would indicate that the feature has a significant impact on the model’s decision (regardless of whether the impact is in favour of the negative class or the positive) and a low value would suggest the opposite. We then retrained the best pipeline for each problem with just the top 15, 12, 9, 6, and 3 features (with 30 repeats in each case). This is to observe the effect of dropping the less impactful features on the performance of the model.

5.2.4 RQ4. What Impact Does CANNIER have on the Performance and Time Cost of Rerunning-Based Flaky Test Detection?

The motivation behind this research question is to investigate if CANNIER is able to reduce the time cost of rerunning-based flaky test detection techniques while maintaining good detection performance. For the application of CANNIER to the three techniques from Section 2.1, we used CANNIER-Framework to calculate the detection performance and single-core time cost associated with every point in a sample of their parameter spaces. For CANNIER+Rerun and CANNIER+iDFClass, the space represents the values of the 3-tuple (ωl,ωu,nF), that is, the lower-threshold, the upper-threshold, and the number of samples to produce the mean feature vectors. In the case of CANNIER+Rerun, since Rmax (the maximum number of times to execute a test case without observing an inconsistent outcome) is a parameter of the underlying Rerun technique, rather than a parameter introduced by CANNIER, we kept its value fixed at NB (the number of test suite runs in the Baseline mode: 2,500). Similarly, for CANNIER+iDFClass, we fixed the value of γ (the percentage of additional failures to recheck) to 20% because it is a parameter of iDFClass and not one introduced by CANNIER. This particular value was recommended by the authors of iDFlakies (Lam et al. 2019). For the detection performance and time cost of a given point for CANNIER+Rerun/CANNIER+iDFClass, CANNIER-Framework calculated the mean over the 30 sets of predicted probabilities for the NOD-vs-Rest/NOD-vs-Victim problem from the 30 repeats of model training and evaluation for the given value of nF from RQ2. For CANNIER+Pairwise, the parameter space represents (ωV,ωP,nF), the victim-threshold, the polluter-threshold, and the number of samples once more. In this case, the framework calculated the mean detection performance and time cost over 30 random pairs of the 30 sets of predicted probabilities for the Victim-vs-Rest problem and the 30 sets for the Polluter-vs-Rest problem for the given value of nF.

For the sample of points in (ωl,ωu,nF), we used the values for ωl from 0 to 1 inclusive with a step of 0.01, the values for ωu from ωl to 1.01 inclusive with a step of 0.01, and the values for nF in the closed integer range from 1 to 15, except when ωl = 0 ∧ ωu = 1.01, in which case nF = 0. The reason for starting from ωl and going up to 1.01 for ωu is to ensure that ωlωu always holds and so that CANNIER-Framework evaluates the points where there is no upper-threshold on \(P(y_{t, 1} = 1 | \mathbf {X}_{t}(\mathcal {I}))\) (see the second clause of Eq. 4). The reason that nF = 0 when ωl = 0 ∧ ωu = 1.01 is to indicate that the machine learning model, and therefore feature collection, is redundant because the ambiguous region is the entire range of \(P(y_{t, 1} = 1 | \mathbf {X}_{t}(\mathcal {I}))\) under these conditions. Therefore, CANNIER+Rerun and CANNIER+iDFClass reduce to the original rerunning-based Rerun and iDFClass respectively (see the third clause of Eq. 4). As the sample of points in (ωV,ωP,nF), we used the values for both ωV and ωP from 0 to 1 inclusive with a step of 0.01. This excludes 1.01, since when one or both thresholds is greater than 1, the set of victims and/or polluters is empty and therefore Pairwise has nothing to do since \(\mathcal {T}_{V}(\mathcal {I}, \omega _{V}) \times \mathcal {T}_{P}(\mathcal {I}, \omega _{P}) = \emptyset \). For nF, the framework considers from 1 to 15, except when ωV = ωP = 0, where nF = 0. The reason that nF = 0 in this case is to indicate that the model is redundant because \(\mathcal {T}_{V}(\mathcal {I}, \omega _{V}) = \mathcal {T}_{P}(\mathcal {I}, \omega _{P}) = \mathcal {T}\) and thus CANNIER+Pairwise reduces to original Pairwise.

We had CANNIER-Framework add the time taken to collect features to the overall time cost for each point. Since many features are dynamic, they require nF test suite runs to measure, making the time cost of doing so \(n_{F} {\sum }_{t \in \mathcal {T}} \textit {C}_{t}\) for some test suite \(\mathcal {T}\). For the points where nF = 0, where the other parameters render the machine learning model redundant, this additional time cost is zero. We did not consider the time cost associated with applying the model to each test case because it is negligible relative to the time taken to execute the test suite (Parry et al. 2022a). We also did not consider the time taken to train the model as part of the time cost of applying it. This is because the model only needs to be trained once and can then be applied any number of times, making training an off-line stage with a cost that can be amortized across uses.

We used CANNIER-Framework to compute the two-dimensional Pareto fronts of detection performance and time cost, with respect to the whole subject set, for the sample of points for CANNIER+Rerun, CANNIER+iDFClass, and CANNIER+Pairwise. In this context, the Pareto front represents the subset of points such that, for each point, the detection performance is the greatest compared to all other points with the same time cost. To answer this research question, we compared the detection performance and time cost associated with the point representing the balanced application of CANNIER to the point where it reduces to the original rerunning-based detection technique, for each of the three fronts. As the point representing balanced CANNIER, we used the knee point. The knee point is the point with the smallest Euclidean distance to the utopia point on the Pareto front (Zavala and Flores-Tlacuahuac 2012). The utopia point represents a “perfect” solution that doesn’t necessarily exist. In the context of this study, that would be the point with a detection performance of 1, for either MCC or true-positive rate (TPR), and a time cost of 0 seconds. For CANNIER+Rerun and CANNIER+iDFClass, we also considered the point where they reduce to pure machine learning-based detection as an additional baseline. For this special case, we used the point on the Pareto front with the greatest MCC that also satisfies ωl = ωu. For all points that satisfy this condition, the techniques never defer to Rerun or iDFClass because there is no ambiguous region between the two thresholds. For CANNIER+Pairwise, there is no such point, because it only limits the problem space for Pairwise but nevertheless always defers to it.

5.3 Threats to Validity

When deciding the ground-truth labels, CANNIER-Framework could incorrectly label some flaky tests as non-flaky. We used the framework to execute every test suite 2,500 times in their original test run orders to identify NOD flaky tests and 2,500 times in shuffled orders to identify victims. Given the non-deterministic nature of flaky tests, it is generally not possible to label a test case as non-flaky with complete certainty (Harman and O’hearn 2018). We mitigated this issue by having CANNIER-Framework perform as many reruns as possible within the limits of our available computational resources. In total, this stage required over six weeks of computational time on a computer with a 24-core AMD Ryzen 5900X CPU. While confidence in the label increases with the number of reruns, so too does the computational cost. In our previous work (Parry et al. 2022a), we found the relationship between the number of detected flaky tests and the number of test suite reruns to be sublinear. This finding supports another previous study, the authors of which identified a similar relationship (Alshammari et al. 2021). This implies that continuing to re-execute a test suite gives diminishing returns with respect to the confidence of labelling a test case as non-flaky. This encourages us that the overall results of this paper would be the same had the plugin performed more reruns, because it’s unlikely that it would have detected significantly more flaky tests. Furthermore, pytest-CANNIER is unlikely to detect certain flaky test categories by rerunning alone. For example “implementation-dependent” flaky tests may require changes to standard library implementations to manifest (Shi et al. 2016; Zhang et al. 2021). The only category we made specific arrangements to detect were victims and their polluters; other special categories are out of the scope of this paper’s empirical study.

Our concrete machine learning pipelines of the random forest/extra trees model with SMOTE data balancing and the 18 features in Table 1 may unfairly represent machine learning-based flaky test detection. A whole host of previous studies (Alshammari et al. 2021; Camara et al. 2021a; 2021b; Haben et al. 2021; Pinto et al. 2020; Pontillo et al. 2022) identified random forest to be the most suitable type of machine learning model for detecting flaky tests. In our previous work (Parry et al. 2022a), we found that the extra trees model, a variant of random forest, was better suited for detecting flaky tests in some cases. Furthermore, the 18 features are based on the 16 features of Flake16 that we found to yield better detection performance when used to encode test cases compared to the previous state-of-the-art feature set (Alshammari et al. 2021). This implies that our choice of pipeline and features is among the most suitable for detecting flaky tests currently in the literature.

There is a chance that CANNIER-Framework and pytest-CANNIER contain bugs that may go on to influence the results of our evaluation. Naturally, it is impossible to be totally sure that any non-trivial software system is totally free of bugs. However, we made sure to use well-established Python libraries for the bulk of the framework’s important functionality. These included Coverage.py (Coverage.py 2022) to measure line coverage, psutil (Psutil documentation 2022) to measure many other dynamic test case properties, Radon (Welcome to radon’s documenation! 2022) to measure source code metrics, scikit-learn (Scikit-learn 2022) for an implementation of the random forest and extra trees model, and shap (Welcome to the SHAP documenation! 2022) to calculate the SHAP value matrices for RQ3. These are all popular open-source projects with many contributors, giving us confidence that any bugs would be identified, documented, and patched in a timely manner. We also wrote unit tests for greater confidence in the bespoke elements of CANNIER-Framework and pytest-CANNIER.

It is possible that the results of our study would not generalize to other Python projects outside of the 30 that we sampled, or to projects written in other programming languages. We randomly sampled 30 Python projects from a list of the top-200 most critical to open-source infrastructure, as determined by the Open Source Security Foundation (Open source project criticality score (beta) 2022). Part of their metric for determining the criticality of a project is based on how many other projects declare a dependency on it. Therefore, any issues caused by flaky tests in these projects could potentially impact a wider portion of the Python ecosystem. Of course, this does not guarantee that our sample generalizes to all Python projects, but does give us some assurance that the flaky tests we examined could represent a more serious problem compared to flaky tests in less critical projects. Without extending our subject set to include projects written in other languages, we cannot make any assurances that our results generalize outside of Python. Broadly speaking, however, our approach is language-agnostic. Considering Table 1, our 18 features could apply to almost any commonly used programming language. Therefore, we see no compelling reason to suggest that our results couldn’t be reproduced with projects written in other languages, such as Java. In addition, it is possible that individual projects in our subject set with significantly more test cases than others could bias the overall results. For example, Airflow had the highest number of NOD flaky tests at 66 — 264% of the second highest. To resolve this concern, CANNIER-Framework calculated performance metrics with respect to each individual project.

Given the empirical nature of this paper’s study, it may be difficult to reproduce our results. We took steps to make our methodology as repeatable as possible. Firstly, we included all scripts and software that we developed to facilitate this study in the replication package (CANNIER experiment 2022). This includes our Dockerfile and requirements files for generating Python virtual environments (Virtual environments and packages 2022). Secondly, any aspects of the study that could be impacted by non-determinism, such as producing the predicted probabilities of test cases being flaky, we repeated 30 times. As such, the final results reported in this paper involve taking the mean across these 30 repeats. Finally, where any aspects of CANNIER-Framework relied on random number generators (such as when instantiating machine learning models), we made sure to set the seed to a constant value to ensure that the results are the same across repeated runs.

6 Results

6.1 RQ1. How Effective is Machine Learning-Based Flaky Test Detection?

Table 5 shows the top-12 concrete machine learning pipelines (out of 24) for each flaky test classification problem in terms of overall MCC. Recall from Section 5.2.1 that these MCC values are with respect to the entire subject set and are the mean over 30 repeats of model training and evaluation (see Eq. 5). Extra trees appears to be the best model for the NOD-vs-Rest and Victim-vs-Rest problems, and pipelines using extra trees are consistently at the top of these tables. For NOD-vs-Victim and Polluter-vs-Rest, the most performant model appears to be random forest, though with less consistency. In terms of data balancing, the best pipelines for each problem used plain SMOTE. Unlike SMOTE+Tomek, SMOTE+ENN did not make it into the top-12 for any problem. In all cases, the negative gradient of detection performance going down the table is small, such that the difference in overall MCC between the best pipeline and the 12th best pipeline is not that significant.

Table 5 The top-12 pipelines (out of 24) for each flaky test classification problem in terms of overall MCC

Tables 6 and 7 show the per-project and overall confusion matrix category frequencies (TN, FN, FP, TP) and MCC of the best pipeline for each flaky test classification problem. Table 6a shows the performance for the NOD-vs-Rest problem. The table lists relatively few projects with a defined value for MCC because many in the subject set contain zero or only very few NOD flaky tests (see Table 3). The overall MCC for this problem is 0.53 and the mean per-project MCC is close at 0.52. Recall that CANNIER-Framework calculated the overall MCC from the overall confusion matrix category frequencies, that are the sum of the per-project frequencies. An MCC of 1 indicates a perfect model and an MCC of 0 indicates a model no better than random guessing. Therefore, the detection performance of the best pipeline for this problem was fairly lackluster. Furthermore, the standard deviation of the per-project MCC is relatively high at 0.29, suggesting that the performance of the pipeline is quite variable between projects. This is further evident from the wide range of MCC values among the different projects. Table 6b shows the results for the NOD-vs-Victim problem. Once again, the table contains relatively few projects with an MCC value for the same reason as before. At 0.69, the overall MCC for this problem is greater than that for NOD-vs-Rest. Also, the standard deviation of the per-project MCC is lower at 0.22. However, the mean of 0.55 is considerably lower than the overall MCC.

Table 6 The per-project and overall results of the best pipelines from Table 5 for the NOD-vs-Rest (a) and NOD-vs-Victim (b) problems
Table 7 The per-project and overall results of the best pipelines from Table 5 for the Victim-vs-Rest (a) and Polluter-vs-Rest (b) problems

Table 7a gives the performance for the Victim-vs-Rest problem. At 0.52, the overall MCC is very close to the mean per-project MCC of 0.51 and is comparable to that of NOD-vs-Rest. Unlike the previous two problems, there are many more projects with a defined value for MCC, since most test suites in the subject set contained victim flaky tests. Finally, Table 7b gives the results for the Polluter-vs-Rest problem. While the overall MCC is very high at 0.95, the mean per-project MCC is much lower at 0.46 and the standard deviation is the greatest of all four problems at 0.34.

figure e

6.2 RQ2. What Impact do Mean Feature Vectors have on the Performance of Machine Learning-based Flaky Test Detection?

Figure 5 shows the relationship between the sample size to produce the mean feature vectors (nF) and the overall detection performance (MCC) of the best pipeline for each classification problem. Recall from Section 4.2.2 that CANNIER-Framework encoded test cases with feature vectors that were the mean of a random sample (\(\mathcal {I}\)) of the output from 30 test suite runs in the Features mode of pytest-CANNIER. Figure 5a shows the relationship for the NOD-vs-Rest problem. At 0.86, the Spearman’s rank correlation coefficient (ρ) indicates that the relationship is positive. However, the gradient (a) of the line of best fit (in red) is small at just 0.0014. The MCC when nF = 15 on the line of best fit is only 4% greater than the MCC when nF = 1. For the NOD-vs-Victim problem, Fig. 5b indicates that the relationship is weaker with a correlation coefficient of 0.71. In this case, the gradient is even smaller (0.0007), with just a 1% increase in MCC from nF = 15 to nF = 1.

Fig. 5
figure 5

Plots showing that the relationship between the number of samples to produce the mean feature vectors (nF) and the overall detection performance (MCC) of the best machine learning pipeline is positive but variable in terms of strength and gradient across the four problems. MCC values are the mean over 30 repeats. Captions give the coefficients of the red least-squares best-fit line (MCC = a × nF + b) and the Spearman’s rank correlation coefficient (ρ)

Figure 5c shows the relationship for Victim-vs-Rest. The correlation coefficient of 1.00 indicates a very strong positive correlation, as is clear from the plot. The gradient of the line of best fit is comparable to NOD-vs-Rest (0.0019). In the case of the Polluter-vs-Rest problem, Fig. 5d also shows a very strong positive relationship between nF and MCC with a corresponding correlation coefficient of 1. However, the gradient is very small (0.0008).

figure f

6.3 RQ3. What Contribution do Individual Features have on the Output Values of Machine Learning Models for Detecting Flaky Tests?

Figure 6 shows the SHAP values for the four flaky test classification problems as beeswarm plots. In each plot, every feature in Table 1 is represented by a row, with each value in its corresponding column in the SHAP value matrix plotted as a colored dot, for which there is one for every test case in the whole subject set. The horizontal position of each dot represents the SHAP value itself, with negative SHAP values towards the left and positive SHAP values towards the right, as indicated by the x-axis labels. Recall from Section 5.2.3 that a positive SHAP value means the contribution of the feature to the model output value from the best pipeline for a given test case and problem was positive (increased it). Conversely, a negative SHAP value means the contribution was negative (decreased it). In this context, the output value is the predicted probability of the test case belonging to the positive class of the problem. This means if a feature contributes positively to the output, it “pushes” the model towards predicting the positive class, and if it contributes negatively, it pushes towards the negative class. The color of the dots represent the feature value relative to the mean feature value, with lower values colored blue and higher values colored red. For example, a blue dot on the left side of the x-axis indicates a test case with a relatively low feature value and a positive contribution. The vertical positions of the dots represent density, such that dots with similar SHAP values “swarm” around one another. From top-to-bottom, the features are in descending order of mean absolute SHAP value. In other words, the features closer to the top have a greater overall impact on the model output.

Fig. 6
figure 6

SHAP values for the four flaky test classification problems as beeswarm plots. These are based on the models from best pipelines for each problem from RQ1. Blue dots represent lower feature values and red dots represent higher feature values. Purple dots represent feature values closer to the mean value. The vertical positions of the dots represent density, such that dots with similar SHAP values “swarm” around one another. Features are in descending order of their mean absolute SHAP value, which each beeswarm plot gives in parentheses. This is a measure of their overall impact on the model’s decision

For the NOD-vs-Rest problem, the contribution of AST Depth, Run Time, Read Count, Context Switches, Write Count, Wait Time, Max. Children, and Test Lines of Code appears positive (towards predicting NOD flaky) when their values are high and negative when their values are low. This is evident from how the dots on the left side of their rows in Fig. 6a are mostly blue and those on the right are mostly red. Conversely, the contribution of Assertions appears negative when high and positive when low, as visualized by mostly red dots on the left and mostly blue on the right. The contribution of some features appears more nuanced. For example, when the contribution of Covered Change is negative its value is mostly high. However, when its contribution is positive its value is mixed. For the NOD-vs-Victim problem (Fig. 6)b, Context Switches, Run Time, Max. Threads, Read Count, Write Count, Cyclomatic Complexity, External Modules, Max. Children, and Halstead Volume appear to contribute positively (towards predicting NOD flaky) when their values are high and negatively when low. The contribution of the individual features for this problem appear considerably less well-defined compared to NOD-vs-Rest. There are some similarities between the results for these two problems, such as Run Time, Read Count, Context Switches, Write Count, and Max. Children mostly contributing positively when high and negatively when low.

As shown by Fig. 6c the contribution of Maintainability, Write Count, Read Count, and Wait Time features appear broadly positive when their values are high (towards predicting victim flaky) and negative when low. On the other hand, Source Covered Lines, Cyclomatic Complexity, and Halstead Volume show the opposite behavior with moderate consistency. The impact of the features for this problem differs significantly compared to the NOD-vs-Victim problem. For example, the Maintainability and Cyclomatic Complexity feature appears to have nearly the exact opposite contribution pattern. Finally, for the Polluter-vs-Rest problem (Fig. 6d), Run Time, Assertions, Halstead Volume, and Wait Time contribute positively when high. Covered Lines, Source Covered Lines, and Max. Children show the opposite contribution.

Figure 7 shows how the overall MCC of the best pipelines for each problem decreases as the number of features used by CANNIER-Framework to train the model are reduced, starting from the least impactful in terms of mean absolute SHAP value. For example, for the NOD-vs-Rest problem, the MCC value at 6 on the x-axis corresponds to a model that only considers AST Depth, Max. Threads, Run Time, Max. Memory, Read Count, and Context Switches. Initially, the detriment to detection performance is fairly small as only the least important features are pruned. However, at around 9 features, the overall MCC begins to plummet quite considerably for every problem.

figure g
Fig. 7
figure 7

The relationship between the overall MCC of the best machine learning pipelines for each flaky test classification problem and the number of top features used by CANNIER-Framework to train the model in terms of mean absolute SHAP value. On the left side of the plot, only the less impactful features are removed, which has little effect on detection performance. Towards the right, the more impactful features are dropped, resulting in a significant reduction of MCC. MCC values are the mean over 30 repeats of model training and evaluation

6.4 RQ4. What Impact Does CANNIER have on the Performance and Time Cost of Rerunning-Based Flaky Test Detection?

Figure 8 shows the Pareto fronts of overall detection performance and time cost for the application of CANNIER to the three rerunning-based detection techniques (see Eqs. 567, and 8). From right-to-left, the first pin on each curve is at the point representing the original rerunning-based technique (where the machine learning model becomes redundant). The second is at the point representing the balanced application of CANNIER (the knee point). Tables 89, and 10 give the per-project and overall results at this point. For CANNIER+Rerun and CANNIER+iDFClass, the third is at the point representing pure machine learning-based detection (greatest MCC where ωl = ωu). Above each pin in square brackets is the detection performance and time cost associated with the point (its coordinates on the axes). Below in parentheses are its parameters.

Fig. 8
figure 8

The Pareto fronts of detection performance and time cost for the application of CANNIER to the three rerunning-based detection techniques. From right-to-left, the first pin on each curve is at the point representing the original rerunning-based technique. The second is at the point representing the balanced application of CANNIER. For CANNIER+Rerun (a) and CANNIER+iDFClass (b), the third is at the point representing pure machine learning-based detection. There is no third pin for CANNIER+Pairwise (c) because it is not possible to use a pure machine learning-based approach in this context (see Section 5.2.4). Above each pin in square brackets is the detection performance and time cost with respect to the whole subject set. Below in parentheses are the parameters

Table 8 The per-project and overall results for CANNIER+Rerun
Table 9 The per-project and overall results for CANNIER+iDFClass
Table 10 The per-project and overall results for CANNIER+Pairwise

Figure 8a and Table 8a give the results for CANNIER+Rerun. As shown by the figure, the time cost associated with the point representing balanced CANNIER+Rerun (middle pin) is 89% lower than the time cost associated with the point representing original Rerun (right pin). At 0.92, the MCC at the balanced CANNIER+Rerun point is significantly greater than the MCC at the point representing pure machine learning-based detection (left pin), which is 0.55. As shown by the table, the per-project MCC is very consistent. Naturally, the MCC at the original Rerun point is exactly 1, since the predicted labels are the same as the ground-truth labels in this case (see Eq. 4). Furthermore, the time cost at the pure machine learning point is significantly lower than the time cost at the other points of interest. This is because the only time cost associated with this point is that of collecting feature data. These results demonstrate that applying CANNIER to Rerun can significantly reduce its time cost while maintaining a detection performance that is far greater than the extra trees model alone.

Figure 8b and Table 8b show the results for CANNIER+iDFClass. As shown by the figure, the general picture is similar to CANNIER+Rerun but somewhat attenuated. The reduction in time cost from original iDFClass to balanced CANNIER+iDFClass is 84%, slightly less than that for CANNIER+Rerun. In addition, the difference in MCC between the balanced CANNIER+iDFClass point (0.97) and the pure machine learning point (0.71) is slightly less significant. The per-project MCC is broadly consistent, as shown by the table. The overall implications of these results are the same as before, namely that applying CANNIER to iDFClass scarifies a minimal degree of detection performance for a considerable reduction in time cost.

Figure 8c and Table 8c give the results for CANNIER+Pairwise. Again, the overall story is similar to the two prior techniques. In this case, the drop in time cost between original Pairwise and balanced CANNIER+Pairwise is the greatest at 92%. Furthermore, the true-positive rate (TPR) at the point representing balanced CANNIER+Pairwise is very high at 0.94. Yet, the table shows that the per-project detection performance varies significantly, far more than the previous two techniques. This could be explained by the relatively high variance in the per-project detection performance of the machine learning pipeline for the Polluter-vs-Rest problem (see Table 7b).

figure h

7 Discussion

7.1 RQ1. How Effective is Machine Learning-based Flaky Test Detection?

As shown by Table 5, there is not much difference in terms of overall MCC between consecutive pipelines in the top-12 for each classification problem. Nonetheless, there are some patterns that have emerged from our choice of pipeline configurations. For NOD-vs-Rest and Victim-vs-Rest, it appears that extra trees is the clear winner for the type of model, consistently occupying the top positions in both tables. Extra trees is a more randomized variant of random forest, an ensemble model based on decision trees (Breiman 2001; Geurts et al. 2006; Safavian and Landgrebe 1991; Shi and Horvath 2006). Both fit individual trees on a random subset of the features from a random sample of the data points from the training data. The major difference between the two models is how nodes in the decision tree are split. Random forest uses an optimal split, whereas extra trees uses a random split. The additional randomness introduced by extra trees trades increased bias for reduced variance. Increased bias means the model may fail to recognize relationships between feature data and labels, known as underfitting. Reduced variance means the model may be less sensitive to noise and outliers, avoiding overfitting. The fact that extra trees was more performant with respect to NOD-vs-Rest and Victim-vs-Rest could suggest that this particular trade-off was more beneficial when tackling these two problems, compared to NOD-vs-Victim and Polluter-vs-Rest. The reason for this however would require further investigation.

The pipelines with more trees tended to yield greater detection performance than those of the same model type and balancing but with fewer trees. This is expected, since the motivation behind random forest and extra trees is to fit decision trees with decoupled prediction errors, such that taking an average of their individual predictions leads to some errors cancelling out. Therefore, it stands to reason that more trees would lead to greater performance. Of course, increasing the number of trees can only improve the model up to a point — and, moreover, there are some instances in our results where more trees did not lead to better performance.

Plain SMOTE (without additional underbalancing) appeared to yield better pipelines compared to SMOTE+ENN and SMOTE+Tomek. Recall from Section 5.2.1 that SMOTE (Chawla et al. 2002) synthetically increases the number of data points in the minority class via interpolation. However, the combination of SMOTE with additional underbalancing techniques produces both synthetic members of the minority class but also discards some members of the majority class. It could be that the removal of real data points was detrimental to the performance of the pipelines that used these techniques, though further investigation would be required to be sure.

Table 6b shows the per-project and overall results of the best pipeline for the NOD-vs-Victim problem. There is a fairly significant difference between the overall MCC of 0.69 and the per-project mean MCC of 0.55. Recall that CANNIER-Framework calculates the overall MCC from the sum of the per-project confusion matrix category frequencies. This disparity is probably caused by the individual results for IPython and Airflow having a disproportionate impact on the overall result since they have significantly more victim flaky tests than the other subject projects (see Table 3). This is also seen in Table 7b for Polluter-vs-Rest, though in this case the difference between the mean and overall MCC is much larger. Once again, this is likely due to the influence of individual projects with relatively many polluters.

The per-project MCC varies quite considerably, with a standard deviation ranging from 0.22 to 0.34 across the four problems. We would expect that projects with fewer flaky tests would have a poorer MCC than those with more, simply because they have fewer positive examples to train the model. However, our results do not appear to show this trend. Therefore, further investigation is required to fully understand why the MCC for some projects is so much greater than that of others.

7.2 RQ2. What Impact do Mean Feature Vectors have on the Performance of Machine Learning-based Flaky Test Detection?

Our conclusion for RQ2, as illustrated by Fig. 5, is that increasing the sample size to produce the mean feature vectors increases the overall MCC of the best pipeline for the four flaky test classification problems. This is not surprising, given how the literature has already established a degree of non-determinism in some of the dynamic features in Table 1 (Hilton et al. 2018; Shi et al. 2019; Vysali et al. 2020). What is more interesting is how weak the effect on MCC appears to be, despite being clearly positive, as illustrated by the very small gradient of the line of best fit. Despite this, at the point representing balanced CANNIER for all three flaky test detection techniques in RQ4, the number of samples to produce the mean feature vectors (nF) is fairly high (15, 14, and 9 for CANNIER+Rerun, CANNIER+iDFClass, and CANNIER+Rerun, respectively). This suggests that the added time cost of performing the extra feature measurements may be a worthwhile trade-off for the increased detection performance.

7.3 RQ3. What Contribution do Individual Features have on the Output Values of Machine Learning Models for Detecting Flaky Tests?

Figure 6 gives the SHAP value beeswarm plots based on the best pipelines for the four flaky test classification problems. These visualize the contribution of the 18 features in Table 1 towards the output value of the model for a given test case. It is important to remember that random forest and extra trees are not causal models and therefore it is not appropriate to infer causality by applying SHAP without considering confounding (Dillon et al. 2021). Furthermore, as demonstrated by our results for RQ1, the detection performance of the models is limited and therefore the SHAP values may not even offer a reliable insight into the correlations between the feature values and the probability of a test case being flaky. Despite this, some of our findings support general intuition and the consensus of the flaky test literature.

For the NOD-vs-Rest problem, we found that Wait Time appears to contribute positively to the extra trees model output (towards predicting NOD flaky) when its value is high and negatively when low. This feature measures the elapsed wall-clock time spent waiting for input/output (I/O) operations to complete. Many empirical studies have pointed to “asynchronous waiting” as a leading cause of NOD flaky tests (Eck et al. 2019; Lam et al. 2020; Luo et al. 2014; Romano et al. 2021), where a test case waits for an insufficient amount of time for an asynchronous operation, such as I/O, to complete. We also found Context Switches and Max. Children to have a similar contribution pattern. Both of these features are associated with concurrency, another leading cause of flakiness as attested by the same studies. Furthermore, Read Count and Write Count, that measure the number of times the filesystem performed input and output respectively, also appear to contribute positively to the model output when high and negatively when low. Previous work has identified I/O itself as a cause of flaky tests (Luo et al. 2014), but this behavior could also be related to asynchronous waiting, since Wait Time is time spent waiting for I/O and could correlated with Read Count and Write Count.

For NOD-vs-Rest and NOD-vs-Victim, Run Time has a positive contribution when high and a negative contribution when low and ranks highly in terms of overall contribution (i.e., the mean absolute SHAP value). In their evaluation of FlakeFlagger, Alshammari et al. (2021) also found the execution time of test cases to be correlated with the probability of being NOD flaky. However, they were unable to establish any casual link. For the Victim-vs-Rest problem, Write Count, Read Count, and Wait Time seem to contribute have a similar contribution pattern, but to varying degrees of consistency. Since these features are associated with I/O, this correlation could be explained by the relationship between filesystem activity and victim flaky tests established in previous studies (e.g., (Bell et al. 2015; Biagiola et al. 2019; Gambi et al. 2018; Luo et al. 2014; Zhang et al. 2014)).

Seven of the 18 features are static, meaning they are based on the test case code and do not require a test case execution to measure. One of these is AST Depth that measures the maximum depth of nested program statements. Figure 9 compares two test cases with different values for the AST depth feature. In terms of mean absolute SHAP value, AST Depth was the most impactful for the NOD-vs-Rest problem. While no previous study has examined the relationship between AST Depth and flakiness, intuitively we might expect a high AST Depth to be associated with a higher chance of flakiness. This is simply because a test case with a higher AST Depth is likely to be more complex and therefore offer more opportunities for flakiness to arise. The beeswarm plot for NOD-vs-Rest appear to broadly support this notion yet the plots for the other problems do not indicate a clear relationship. This suggests that AST Depth may be correlated with the probability of a test case being NOD flaky.

Fig. 9
figure 9

Two test cases with different values for the AST depth feature. This feature measures the maximum depth of nested program statements

There appear to be some tentative relationships between the contribution patterns of features for the four problems. For NOD-vs-Rest and NOD-vs-Victim, the contribution of Run Time, Read Count, Context Switches, Write Count, and Max. Children are broadly positive when high and negative when low. This could be due to the positive class being the same for both problems and the negative class of NOD-vs-Victim being a subset of the negative class of NOD-vs-Rest. Moreover, the contribution pattern of the features for the NOD-vs-Victim differs significantly that of the Victim-vs-Rest problem. As we saw in Section 6.3, the Maintainability and Cyclomatic Complexity features appear to have nearly opposite contribution patterns between the two problems. This is expected, because the positive class of Victim-vs-Rest is the negative of NOD-vs-Victim, and the negative class of Victim-vs-Rest is a superset of the positive of NOD-vs-Victim.

It is clear from Fig. 7 that dropping the less impactful features (in terms of mean absolute SHAP value) has little impact on the detection performance of the best pipeline for each problem. Since the time to fit a random forest/extra trees model grows linearly with the number of features, this is a useful result for expediting the training stage. This is not directly relevant to the conclusions of this paper’s study however, as we are not concerned with the time cost of model training since that is performed off-line from the perspective of a developer using the CANNIER approach.

7.4 RQ4. What Impact Does CANNIER have on the Performance and Time Cost of Rerunning-Based Flaky Test Detection?

We presented CANNIER+iDFClass as a drop-in replacement for the Classification stage of iDFlakies. In theory, the combination of the NOD-vs-Rest and Victim-vs-Rest models could be a substitute for the entire iDFlakies pipeline. This could be realized as CANNIER+iDFlakies, a multi-model approach with a multi-label output: NOD, Victim, or Rest (non-flaky). In practice, the difficulty arises when either of the models are ambiguous for a given test case. To delegate the prediction for such a test case to iDFlakies in this hypothetical scenario, CANNIER+iDFlakies would need to rerun the entire test suite in different orders until the test case fails or the upper-limit is reached. This corresponds to the Running stage of iDFlakies. As with the single-model CANNIER+iDFClass given in the paper, it would then execute the prefix of the failing test order, representing the Classification stage of iDFlakies. Naturally, with even a handful of ambiguous cases, the hypothetical multi-model CANNIER+iDFlakies would be unlikely to noticeably reduce the time cost of the Running stage, but would reduce the time cost of the Classification stage in the same way as the existing single-model CANNIER+iDFClass. Therefore, the benefit of CANNIER+iDFlakies is effectively the same as CANNIER+iDFClass, since the latter makes no attempt to expedite the Running stage. For these reasons, we opted to focus on CANNIER+iDFClass due to its simplicity and the fact that it would require fewer modifications to iDFlakies to implement.

As shown in Fig. 8a and Table 8, for the point representing balanced CANNIER+Rerun, the lower-threshold (ωl) is very low at 0.07 and the upper-threshold (ωu) is at its maximum value of 1.01. The latter means that there effectively is no upper-threshold on the predicted probability (see the second clause of Eq. 4). Figure 10 illustrates the distribution of predicted probabilities for test cases in, and gives the frequencies of, each confusion matrix category, for each of the four flaky test classification problems. We produced this figure from the results of RQ1, such that the figure for each classification problem corresponds to its respective table in Tables 6 and 7. Figure 10a focuses on the NOD-vs-Rest problem. The distribution for true-negatives (TN) is focused largely around 0 and represents the vast majority of test cases. Furthermore, the distribution for false-negatives (FN) appears highly separable from true-negatives. This might explain why ωl is so low, because it means CANNIER+Rerun labels most true-negative test cases as negative and prevents them from being delegated to Rerun, significantly reducing time cost. It also means CANNIER+Rerun labels only a handful of false-negatives as negative, limiting the reduction in detection performance. The distribution for true-positives (TP) is clearly different from false-positives (FP) but not as easily separable. However, there are few test cases in both categories relative to true-negatives. Therefore, by setting ωu to its maximum value, CANNIER+Rerun makes no false-positive predictions, ensuring no decrease in detection performance at the expense of a minor increase in time cost. This could explain why there are no false-positive predictions in Table 8.

Fig. 10
figure 10

The distribution of predicted probabilities for test cases in, and the frequencies of, each confusion matrix category, for each of the four flaky test classification problems. The data is based on the best pipelines from RQ1. Whiskers represent the range from the 5th to the 95th percentile and boxes represent the 25th to the 75th. Middle lines represent the median (50th)

Figure 10b illustrates the distribution of predicted probabilities for NOD-vs-Victim. The situation for this problem and the thresholds for the point representing balanced CANNIER+iDFClass is very similar to NOD-vs-Rest and CANNIER+Rerun. The biggest difference is that the frequency of the true-negative category for NOD-vs-Victim is two orders of magnitude smaller than that for NOD-vs-Rest. The distribution for true-negatives also spreads much further into the distribution for false-negatives. This may explain why the lower-threshold for CANNIER+iDFClass is greater at 0.18 and why the reduction in time cost from iDFClass to CANNIER+iDFClass is smaller.

Figure 10c and d are for Victim-vs-Rest and Polluter-vs-Rest respectively. Once again, the overall picture is similar for both problems. That is, the true-negative category contains the vast majority of test cases and its distribution is broadly separable from the false-negative category. This explains why the victim-threshold (ωV) and polluter-threshold (ωP) for the balanced CANNIER+Pairwise point are low at 0.06 and 0.08 respectively. Uniquely for Polluter-vs-Rest, the true-positive distribution appears very distinct from the false-positive distribution. Perhaps because this problem has significantly more positive examples in the dataset compared to the other problems, the machine learning model can discern unseen positive cases with greater confidence.

7.5 Implications

7.5.1 Researchers

Our findings for RQ1 extend the existing body of work in machine learning-based flaky test detection into the detection of polluter test cases. Identifying polluters is vital for mitigating test-order dependencies (Lam et al. 2020; Parry et al. 2020; Shi et al. 2019) and so our results demonstrate the wider applicability of machine learning models for tackling flaky tests. Our results for RQ2 (and supported by RQ4) demonstrate that using mean feature vectors can improve the detection performance of machine learning models. We therefore suggest that researchers consider the implications of this when evaluating machine-learning based techniques that use dynamic features. Our results for RQ3 tentatively identify correlations between test case metrics and the probability of a test case being flaky. This is an important foundation for future work in elevating flaky test detection techniques to comprehensive flaky test root causing techniques, a vital intermediate step towards automated flaky test repair. While such root causing and repair techniques exist (Lam et al. 2019; Terragni et al. 2020; Wei et al. 2022), they are expensive and limited in scope.

7.5.2 Developers

Our findings for RQ4 demonstrate that CANNIER is a “best of both worlds” approach between rerunning-based and machine learning-based flaky test detection. As shown by Fig. 8, CANNIER reduces time cost by an average of 88% across the three rerunning-based techniques while maintaining good detection performance. For developers, this means not having to trade high time cost for limited detection performance. Furthermore, while we used the knee-point of the Pareto front to represent CANNIER in our evaluation, developers could customize the approach towards lower time cost or greater detection performance by selecting a different point.

8 Related Work

Luo et al. (2014) performed one of the earliest empirical studies of test flakiness. Using 51 projects of the Apache Software Foundation as subjects, they classified 201 commits that repaired flaky tests into 10 categories based on the cause of the flakiness. The most common cause they identified was related to waiting for asynchronous operations. For example, a test case that launches a thread to perform input/output (I/O) and waits a fixed amount of time for it to finish may fail when it takes longer than expected. One of our findings for RQ3 was that the amount of time spent waiting for I/O operations to complete was positively correlated with the probability of a test case being NOD flaky.

Gruber et al. (2021) repeatedly executed the test suites of 22,352 open-source projects and automatically identified 7,571 flaky tests. Like our study, these projects were primarily written in the Python programming language. They randomly sampled 100 NOD flaky tests in their dataset to classify their causes using the categories introduced by Luo et al. (2014). Unlike Luo et al, they found causes related to networking and randomness to be the most prevalent.

Bell et al. (2018) presented an automated technique, called DeFlaker, for detecting NOD flaky tests. The key advantage of DeFlaker over Rerun is that it does not require repeated test case executions. Instead, the technique takes advantage of a project’s history in a version control system. When a test case that passed on a previous version of the software now fails, and does not cover modified code, DeFlaker labels it as flaky. Naturally, DeFlaker requires a test suite run with code instrumentation to measure coverage. Detecting flaky tests using extra trees models with CANNIER-Framework also requires an instrumented run to measure coverage and the other metrics in Table 1. In both cases, this test suite run introduces time overhead. However, DeFlaker requires a run every time a change is made, whereas CANNIER-Framework requires at least one to produce encodings for each test case that would likely remain relevant over a series of changes. Furthermore, DeFlaker can only detect flaky tests after they fail. In contrast, the models trained by the CANNIER-Framework can detect flaky tests preemptively.

Pinto et al. (2020) and Bertolino et al. (2021) both presented machine learning-based flaky test detection techniques based purely on static features of the test case code. Both techniques encoded test cases using a bag-of-words approach. This represents test cases as sparse vectors where each element corresponds to the frequency of a particular identifier or keyword in its source code. Pinto et al used additional static features such as the number of lines of code. Bertolino et al used a k-nearest neighbor classifier (Keller et al. 1985) for the machine learning model and Pinto et al evaluated a range of different models, including random forest. They found random forest to yield the best detection performance, of which we use the extra trees variant in this paper’s study, having found it to be the most effective in our prior work (Parry et al. 2022a). Alshammari et al. (2021) presented FlakeFlagger, a detection technique using a random forest model and encoding test cases with a feature set containing a mixture of static and dynamic test case metrics. Their evaluation showed that their feature set offered a 347% improvement in overall F1 score compared to Pinto et al’s purely static feature set at the cost of a single instrumented test suite run to measure the dynamic features. For this reason, we included both static and dynamic test metrics in our feature set instead of relying on purely static features.

Shi et al. (2019) presented iFixFlakies, a technique for automatically generating patches for victim flaky tests. Their approach uses delta-debugging (Zeller and Hildebrandt 2002) to identify a victim’s polluters and other test cases that may contain the statements needed to repair the victim, known as cleaners. CANNIER+Pairwise could provide a drop-in replacement for this aspect of iFixFlakies. However, we cannot say for certain if our approach would be faster than using delta-debugging because we have not yet evaluated it in this context.

Lam et al. (2019) presented iDFlakies, a technique for detecting flaky tests and classifying them as either NOD or Victim. The overall process involves repeatedly executing a test suite in a modified order (e.g., shuffled) to identify flaky test cases. Following this, the tool enters a Classification stage where it attempts to determine the category of each flaky test. In this paper’s study, we evaluated the application of CANNIER to the Classification stage of this tool (CANNIER+iDFClass). Our empirical results demonstrated that CANNIER was able to significantly reduce the execution time overhead of the Classification stage at minimal detriment to its detection performance.

9 Conclusion and Future Work

This paper expanded the existing work on machine learning-based flaky test detection and introduced CANNIER, an approach for significantly reducing the time cost of rerunning-based detection techniques by combining them with machine learning models. Initially, using a variety of machine learning pipelines and a feature set of 18 static and dynamic test case metrics, we performed a baseline evaluation of machine learning-based detection on our dataset of 89,668 test cases from 30 Python projects. We evaluated their performance with respect to detecting NOD flaky tests, victim flaky tests, and polluter test cases. Our results suggested that the performance of the machine learning models was lackluster and variable between projects. We then went on to investigate the impact of mean feature vectors on machine learning-based flaky test detection. We identified a positive relationship between the sample size to produce the mean feature vectors and the detection performance of the machine learning model. In the interest of model explainability, we applied the SHAP technique (Lundberg et al. 2020) to quantify the contribution of each individual feature to the output value of the model. While this technique can only reveal correlations and is not appropriate for inferring causality, we made several findings that support both the general intuition of developers and results from the flaky test literature. Finally, we evaluated CANNIER’s impact on three rerunning-based methods for flaky test detection Rerun, the Classification stage of iDFlakies, and Pairwise. We found that CANNIER was able to significantly reduce time cost at the expense of only a minor decrease in detection performance.

As future work, we intend to further investigate the features associated with test flakiness. In doing so, we will consider applying causal inference techniques (Yao et al. 2021) for a deeper understanding into the processes that lead to test flakiness. We will also consider evaluating the performance of machine-learning-based detection with respect to more specific categories of flaky tests, such as “implementation-dependent” flaky tests (Shi et al. 2016; Zhang et al. 2021). Finally, we plan to evaluate the efficiency and effectiveness of integrating CANNIER into a wider range of existing flaky test techniques, such as iFixFlakies (Shi et al. 2019).