figure a
figure b

1 Introduction

Testing programs with concurrency is a challenging problem for developers. Concurrency introduces non-determinism in the program, making bugs hard to find, re-produce and debug [25, 43]. In fact, concurrency is one of the main reasons behind flaky tests [34] (tests that may pass or fail without any code changes), causing a significant engineering burden on development teams [31]. As concurrency, in the form of multi-threading or distributed systems, is fundamental to how we build modern systems, solutions are required to help developers test their concurrent code for correctness.

There are two important challenges with testing concurrent programs. First is the problem of reproducibility or control. By default, a programmer does not have control over how concurrent workers interleave during execution.Footnote 1 The only programmatic control is through enforcing synchronization, but that is usually not enough to guarantee that certain interleavings can be reproduced. The second challenge is the state-space explosion problem. A concurrent program, even with a fixed test input, can have many possible behaviors; in fact, there can be exponentially many interleavings in terms of the length of the execution.

One line of work that attempts to solve these challenges is controlled concurrency testing (CCT) [53]. This approach proposes taking over the scheduling of concurrent workers and then using algorithms, either randomized or systematic, for searching over the space of interleavings. The former (i.e., taking over scheduling) is typically an engineering challenge. It requires understanding the language runtime and building solutions that are efficient, robust and usable. The latter (i.e., searching over the space of interleavings) requires algorithmic and empirical insights on finding bugs, and it has been the main topic of many research publications (e.g., [10, 13, 16, 19, 32, 40,41,42,43, 48, 53,54,55,56]). Both these aspects are essential for industrial adoption.

In this paper, we describe the design and implementation of the open-source tool \(\textsc {Coyote} \) [7] for controlled concurrency testing of \(\textsc {C}{} \texttt {\#} \) programs. \(\textsc {Coyote} \) aims to make testing of concurrent programs as easy and natural as testing of sequential programs.

Usage \(\textsc {Coyote} \) was released on GitHub on March 2020, and since then its release binaries have been downloaded from nuget.org over a million times. The project has extensive documentation as well as tutorials for developers [8]. \(\textsc {Coyote} \) has been used internally in \(\textsc {Microsoft}\) for testing multiple different services of the \(\textsc {Azure}\) cloud infrastructure. Through the use of lightweight telemetry [9], we have consistently seen over three million seconds of testing each month for the last 12 months, peaking at roughly 13 million seconds in a month. \(\textsc {Coyote} \) testing has been invoked 71K times per month on average, reporting around 10K test failures per month on average.

\(\textsc {Coyote} \) is also a testing backend for the P language [15], currently used in Amazon for the analysis of several core distributed systems [5]. A P program is compiled to a \(\textsc {C}{} \texttt {\#} \) program and fed to \(\textsc {Coyote} \) for testing.

Contributions This paper covers the design decisions that were necessary for supporting industrial usage. It is unreasonable to support all programs in a language as broad as \(\textsc {C}{} \texttt {\#} \), so the focus of \(\textsc {Coyote} \) has been on the task asynchronous programming (TAP) model [38] that is the recommended and most common way of expressing concurrency and asynchrony in \(\textsc {C}{} \texttt {\#} \). \(\textsc {Coyote} \) encapsulates multiple state-space exploration techniques from the literature in order to provide state-of-the-art testing to its users. \(\textsc {Coyote} \) is also designed to be extensible, both in supporting other programming models (it already supports an actor programming model [4, 12] and support for threads is straightforward), as well as other exploration strategies. This paper also describes a novel search technique specifically for TAP and its evaluation on industrial benchmarks.

Historical journey The origin of the \(\textsc {Coyote} \) code base can be traced back to an earlier system called \(\textsc {P}{} \texttt {\#} \) [11] that defined a restricted (domain-specific) programming model for communicating state machines. The \(\textsc {P}{} \texttt {\#} \) system has since then evolved into an actor framework that is still supported by \(\textsc {Coyote} \), however \(\textsc {Coyote} \) itself has generalized to focus on TAP, making it a very different tool compared to \(\textsc {P}{} \texttt {\#} \). Prior work with \(\textsc {Coyote} \) has either focused on exploration strategies [39, 40, 48] or on applications [11,12,13], but not on the tool itself.

\(\textsc {Coyote} \) is useful for practitioners looking for industrial-strength tools (for \(\textsc {C}{} \texttt {\#} \)), as well as researchers interested in evaluating new exploration algorithms for concurrency testing. This paper hopes to inspire and inform the reader towards contributing new ideas, features, and case-studies to \(\textsc {Coyote} \).

2 The \(\textsc {Coyote} \) Tool

The \(\textsc {C}{} \texttt {\#} \) task asynchronous programming (TAP) model revolves around the Task type that is used to encapsulate parallel computation. One can spawn a new task to execute in parallel with its parent, wait on an existing task to finish, or query for the result of a task once it has finished. Furthermore, the \(\textsc {C}{} \texttt {\#} \) language offers async and await keywords that make it very convenient to write efficient (non-blocking) programs [37]. Similar features are also mainstream in other languages such as Rust, Python, Javascript and Go, and even \(\textsc {C}{} \texttt {++} \) has support for them. Their semantics are fairly standard so we avoid them for space constraints, and instead just illustrate using an example.

Fig. 1 shows a typical concurrency test that we will use as a running example in this paper. The RunTest method creates two parallel tasks t1 and t2, waits for them to finish and asserts some condition. A programmer can run this test as-is with \(\textsc {Coyote} \) to find if the assertion can fail. There are two key points to note about this example. First, its behavior is interleaving dependent. The loop in SendMessages adds a string to the global list variable that is shared between the two tasks, so its final value will have a mix of strings of the form aN and bN, depending on the interleaving order. (This program has an unsynchronized access to list, but let us assume for simplicity that operations on List are atomic; in practice, one can guard these operations with locks). Second, while this code seemingly only has two tasks, at runtime it can have up to a 100 tasks created by the .NET runtime. The initial task created by SendMessages starts executing the async lambda code, but when it hits the await point, the runtime can (optionally) end the current task and spawn a new one to execute the rest of the code after the awaited expression finishes. (This “magic” happens when async methods get de-sugared by the \(\textsc {C}{} \texttt {\#} \) compiler into state machines [52]. This transformation is what allows the code to be non-blocking.) Note that the await in this code can be hit 100 times (50 for each of the call to SendMessages). We will revisit the complexity imposed by such implicit tasks, both for the tool to take control (§4.1) and on space-space exploration later (§3.2); for now, we focus on the user experience.

Fig. 1.
figure 1

Example test code in \(\textsc {C}{} \texttt {\#} \) with concurrency.

Fig. 2.
figure 2

Developer workflow when using \(\textsc {Coyote} \).

\(\textsc {Coyote} \) use is illustrated in Fig. 2. After the user compiles their \(\textsc {C}{} \texttt {\#} \) program containing one or more tests, they invoke the \(\texttt {coyoterewrite} \) command-line tool to rewrite their binaries. This automatic rewriting adds instrumentation to the original code to provide the necessary hooks and metadata for \(\textsc {Coyote} \) to control the (task-based) concurrency in the program (§3). Next, the user invokes the \(\texttt {coyotetest} \) command-line tool to run their tests with the \(\textsc {Coyote} \) test engine. The engine runs each test repeatedly for a user-specified number of iterations until a bug (failed assertion or unhandled exception) is found. The engine uses the instrumented hooks to intercept the execution of all workers in the test, and control them to allow only a single worker to execute at a time. The exact choice of which worker to enable in each step is left to an exploration strategy3.2).

When a bug is found, \(\textsc {Coyote} \) dumps out the sequence of all scheduling decisions taken in that test iteration. The user can replay the test failure using the \(\texttt {coyotereplay} \) command, as many times as they like, with the \(\textsc {C}{} \texttt {\#} \) debugger attached to step through the test deterministically.

Fig. 3.
figure 3

The architecture of \(\textsc {Coyote} \).

Architecture, Extensibility The architecture of \(\textsc {Coyote} \) is illustrated in Fig. 3. The test engine exposes an instrumentation API used for declaring the concurrency, and synchronization, used in the program (§3). For task-based programs, the experience is seamless because the rewriting engine takes care of adding calls to this API automatically (§4). One can also add a custom runtime to \(\textsc {Coyote} \). For instance, \(\textsc {Coyote} \) supports an actor-based programming model (to code at the level of actors instead of tasks) [12]. The actor runtime, in this case, performs the necessary calls into the \(\textsc {Coyote} \) test engine, again providing a seamless experience to users. For other programming models, say, a program using threads directly instead of tasks, these calls must either be inserted manually or a rewriting pass be added to \(\textsc {Coyote} \) to add these calls automatically for threads. Exploration strategies are also defined by a simple interface that makes it easy to implement multiple techniques.

The test engine is roughly 11K lines of \(\textsc {C}{} \texttt {\#} \) code, the rewriting engine and the actor runtime are 12K lines each, and \(\textsc {Coyote} \) is overall 45K lines of code. \(\textsc {Coyote} \) is heavily tested for robustness, with an additional 38K lines of code of unit tests.

Limitations, Requirements \(\textsc {Coyote} \) requires a test to be deterministic modulo scheduling between workers. This implies that, for instance, the program should not take a branch based on the current system time, or read data from an external service or a file that may change outside the scope of the test. \(\textsc {Coyote} \) also requires that tests be idempotent, that is, running the test twice has the same effect as running it once. This is because \(\textsc {Coyote} \) runs a test multiple times without re-starting the hosting process. Idempotence is easy to guarantee by avoiding static variables. Violating these requirements can imply that replay will fail. These are minor requirements, with users seldom complaining about them in our experience so far.

A more significant requirement is that \(\textsc {Coyote} \) be able to control all the concurrency created by a test. This may not happen when the program uses an unsupported programming model, or a library that cannot be rewritten because, say, it includes native code, which is outside the scope of \(\texttt {coyoterewrite} \). \(\textsc {Coyote} \) has partial defenses against this: when it detects concurrent activity outside its control, it tries to tolerate it by letting it finish on its own (§5), else throws an error to make the user aware.

\(\textsc {Coyote} \) does not currently support the detection of low-level data races, i.e., unsynchronized memory accesses, which can indicate concurrency bugs. Race detection requires instrumentation at the level of individual memory accesses, which \(\textsc {Coyote} \) avoids for engineering simplicity and lower maintenance costs. (\(\textsc {Coyote} \) only instruments at the level of task APIs or synchronization operations.) Nonetheless, \(\texttt {coyoterewrite} \) is extensible, and the door is open for any contributor to take on this responsibility and implement race detection [22, 23, 49,50,51].

3 \(\textsc {Coyote} \) Test Engine

Fig. 4.
figure 4

The \(\textsc {Coyote} \) test engine instrumentation API.

Fig. 5.
figure 5

Example wrappers for task creation (left) and waiting (right) that call into the \(\textsc {Coyote} \) test engine.

3.1 Instrumentation API

Fig. 4 lists the core instrumentation API that must be called from the user program to provide the \(\textsc {Coyote} \) test engine (CTE) with enough hooks for controlling its concurrency. CTE itself does not have a first-class understanding of TAP (or any programming model for that matter); all information about the program comes through this API, which allows us to keep CTE simple, and also allows easy addition of new programming models.

The instrumentation API takes inspiration from prior work [3] that demonstrated the generality of the API, even outside of \(\textsc {C}{} \texttt {\#} \), at capturing different programming models. Each worker created in the program must inform CTE when it is created (OnWorkerCreated), when it starts running (OnWorkerStarted), and when it completes (OnWorkerCompleted). A worker calls OnWorkerPaused with a predicate \(\mathcal {P}\) to notify CTE that it has paused its execution and will become unblocked when \(\mathcal {P}\) evaluates to true. For instance, when a worker pauses to acquire a lock, then \(\mathcal {P}\) becomes true when the lock is released by some other worker. A worker calls ScheduleNextWorker to ask CTE to consider running a different worker. A worker calls GetCurrentWorkerId to ask CTE for its unique identifier.

Fig. 5 shows wrapper methods for task creation (Run) and waiting on the completion of a set of tasks (WaitAll). These methods implement the original semantics, but additionally call the instrumentation APIs to notify CTE. We show this only for illustrating the instrumentation APIs. In practice, the developer does not have to add these calls. §4 demonstrates how the \(\textsc {Coyote} \) binary rewriting engine automatically inserts these calls to cover the broad TAP programming model. An approach that creates a substitute method for each TAP method does not scale. For actor-based programs, the \(\textsc {Coyote} \) actor runtime takes care of calling the CTE without the need for binary rewriting.

Any time the program invokes CTE via one of these APIs (referred to as a scheduling point or step), CTE blocks the current worker, then looks at the list of workers that are enabled (by inspecting their pause-predicates, if any). It will then query the exploration strategy to select one worker from this list. The selected worker is unblocked (rest all workers remain blocked) and is allowed to execute until it hits a scheduling point again, at which point control goes into the CTE and the process repeats. This design, of sequentializing workers to execute only one-at-a-time is fairly standard in CCT tools [3].

3.2 Exploration Strategies

\(\textsc {Coyote} \) decouples the concern of how to control workers from how to explore their interleavings. The latter is the responsibility of the exploration strategy, which is defined by a common interface. At its core, the interface has a single method that accepts a list of enabled workers and must return one of them. With most of the heavy lifting performed by CTE, exploration strategies are easy to implement; the largest one is only 400 lines of code. Furthermore, at the time the exploration strategy is invoked, all workers are in a blocked state (blocked by the CTE). Some strategies (like QL and POS; see below) require inspection of the program state. This can be done safely by the strategy without worrying about racing with the program’s execution.

The random walk strategy (RW) picks an enabled worker uniformly at random in each step. This simple strategy has been shown to be effective in practice and argued as a necessary baseline for other strategies [53]. The PCT strategy [10] implements a priority-based scheduler. When a worker is created, it is assigned a new randomly-generated priority. At a scheduling point, PCT always picks the enabled worker that has the highest priority. In addition, at d times during an execution (called the bug depth parameter, which is supplied by a user-controlled configuration), PCT lowers the priority of the currently executing worker to be the smallest. These d priority lowering points are picked uniformly across the entire program execution. This priority-based nature helps PCT induce long delays in workers, unlike RW that switches back-and-forth between workers much more frequently.

Task-based PCT PCT was originally designed for multi-threaded programs. Later work observed its shortcomings for distributed systems and proposed the revised strategy called PCTCP [48]. We now discuss a novel adaptation of the idea behind PCTCP to TAP in a strategy called PCT \(^t\).

Consider again the program of Fig. 1. Let us define the function predicate to check that the string a49 does not appear before b0 in list. For the assertion in this program to fail, an interleaving must essentially execute t1 to completion before t2 gets a chance. The chance of RW producing this interleaving is tiny: around 1 in \(2^{50}\). If we imagine a thread-based scenario (ideal setting for PCT), where RunTest created two threads instead of tasks, then PCT (with \(\textit{d}=0\)) has \(50\%\) probability of hitting this bug. This is because if the first thread is assigned a higher priority, it will execute to completion before the second thread gets a chance to execute. However, PCT, with priorities-per-task, is unable to find this bug because of all the implicit tasks that get created at the await point (recall §2). Each time a new task is created, it gets a new randomly-generated priority. In effect, for this program, PCT behaves like RW.

PCTCP addresses this problem by constructing a partial order between workers, where two workers \(w_1\) and \(w_2\) are ordered if the programming model enforces that \(w_2\) must only start after \(w_1\) finishes. This partial order, constructed on-the-fly during program execution, is then decomposed into chains, which are totally-ordered subsets of the partial order. PCTCP then maintains priorities per chain, not per worker. When a new worker starts, it gets assigned to a chain (existing or a new one) and inherits the priority of the chain. PCTCP ’s effectiveness has only been demonstrated for distributed message-passing systems.

PCT \(^t\) adapts the concept of chains for TAP. On the explicit creation of a task (using Task.Run), it gets assigned to a new chain (hence, it gets a randomly-generated priority). If a task t yields control by executing Task.Yield, the continuation task is assigned to the same chain as t (hence, it inherits its priority). When a task t1 awaits another task t2 to complete, the continuation task of t1 is assigned to the chain of t2 because the continuation can only execute after t2 completes. (In reality, the continuation task is assigned to the chain of the task that completes t2, because t2 may have its own continuations created.) PCT \(^t\) recovers the benefits of PCT; in our running example, only two chains are created, and it can find the bug with a \(50\%\) probability.

Other strategies \(\textsc {Coyote} \) also implements a strategy based on reinforcement-learning (QL) [40]. QL requires a partial hash (or fingerprint) of the program state and then learns a model that maximize the number of unique fingerprints seen during a test run. Increased coverage helps uncover more bugs. The partial order sampling (POS) strategy [56] uses information about which workers are racing with each other, i.e., they are about to access the same object (either a memory location or a synchronization object). POS uses a priority-based scheduler like PCT, but instead of lowering priority at d chosen points, POS keeps shuffling (i.e., re-assigning) priorities of racing workers at each step.

Other strategies available in \(\textsc {Coyote} \) are delay bounding (DB) [19] and variants of RW that use a biased coin. These strategies can also be combined either in the same test iteration (run one strategy for certain number of steps, then switch to running another strategy) or across iterations (pick a different strategy, in a round-robin fashion, for each iteration).

Data non-determinism Exploration strategies also offer a means to generate unconstrained boolean or integer values. \(\textsc {Coyote} \) exposes these APIs to developers, who can use them to express non-determinism in their program. An example is when testing for the robustness of a program against faults. In this case, the developer can non-deterministically choose to raise a fault (like an exception or return an error code) and check that their code can handle the fault correctly. Other examples are non-deterministically firing timeouts, non-deterministically choosing what method to call from a set of equivalent library methods, etc. Most exploration strategies resolve this non-determinism uniformly at random, with the exception of QL that tries to learn, alongside scheduling decisions, what return values are able to maximize program coverage.

Liveness checking In addition to catching safety violations (assertion failures and uncaught exceptions), \(\textsc {Coyote} \) can also check liveness properties where, essentially, one asserts that every program run eventually makes progress. The definition of progress is programmable, using the concept of liveness monitors (variant of deterministic Büchi automata) borrowed from the P modeling language [15]. A violation of a liveness property is an infinite run where no progress is made. Testing cannot produce an infinite run, so instead \(\textsc {Coyote} \) looks for a sufficiently long execution based on user-set thresholds [27, 39]. Liveness properties are not rare. In fact, they are commonly asserted when testing distributed services to check that the service eventually completes every user request [12].

Any exploration strategy can be used for liveness checking, as long as it is fair, i.e., it does not contiguously starve an enabled worker for a long time. Unfairness can easily lead to liveness violations, but such violations are considered false positives because they cannot happen in practice as system scheduling is generally fair. RW is (probabilistically) fair, but PCT is not. \(\textsc {Coyote} \) converts unfair strategies to fair ones by running them up to a certain number of scheduling steps and then switching to use RW.

4 Automation for \(\textsc {C}{} \texttt {\#} \) Task Asynchronous Programs

The style of instrumentation shown in Fig. 5 is not practical because there are many ways in which lambdas and tasks can be created (some return a result on completion, some do not, and there are optimized variants of tasks like ValueTask [45], etc.). Imposing directly on the creation process would be very cumbersome. One must also be able to handle both explicit creation of tasks, as well as the implicit creation that happens at await points. After much trial-and-error, we arrived at an efficient solution that is simple and easy to maintain, even as \(\textsc {C}{} \texttt {\#} \) itself evolves. We crucially rely on controlling task execution through a narrow lower layer of abstraction in the .NET runtime called the TaskScheduler [44]. We observed that whenever a task is created, it goes to the \(\textsc {.NET} \) default task scheduler, which is then responsible for executing the task on the \(\textsc {.NET} \) thread pool. This task scheduler can be subclassed, which we do as shown in Fig. 6 (right). Coyote.TaskScheduler offers a convenient place to call into the test engine, without requiring imposition on the creation of the task or its lambda. The job of rewriting then is to route tasks to this scheduler instead of the default task scheduler. We do this by defining simple wrapper methods for Task APIs, and rewriting the user \(\textsc {C}{} \texttt {\#} \) binaries to call the wrapper methods instead of the original ones.

Fig. 6 (left) illustrates static wrapper methods for Task.Run and Task.Wait. Notice that on TaskWrapper.Run, no modification to the lambda (func) is required. A task gets created as usual, then gets enqueued to the \(\textsc {Coyote} \) task scheduler, which, in turn, executes the task with appropriate calls to the test engine (ExecuteTask). This solution piggybacks on the RunInline functionality that the default scheduler also uses. The TaskWrapper.Wait method adds the call to OnWorkerPaused.

What about implicitly created tasks? This required more digging into the \(\textsc {C}{} \texttt {\#} \) compiler to understand the compilation of async methods to state machines [52]. Fortunately, all we required is to identify the point where continuation tasks are created by these state machines, and instead call a wrapper method (similar to TaskWrapper.Run) that enqueues the task to the \(\textsc {Coyote} \) task scheduler.

Fig. 6.
figure 6

Wrapper methods for Task APIs (left) and the implementation of the \(\textsc {Coyote} \) task scheduler (right).

4.1 Binary Rewriting for \(\textsc {C}{} \texttt {\#} \) Tasks

Binary rewriting is necessary to provide a push-button experience for \(\textsc {Coyote} \) on TAP programs. In \(\textsc {C}{} \texttt {\#} \), code gets compiled into the Common Intermediate Language (CIL) [17], which is an object-oriented machine-independent bytecode language that can run on top of the \(\textsc {.NET} \) runtime in any supported operating system (Windows, Linux and macOS). Each compiled \(\textsc {C}{} \texttt {\#} \) program consists of one or more CIL binaries. Each binary contains an assembly, which is a unit of functionality implemented as a set of types (these can be exposed publicly to be consumed by other assemblies). Each type might contain members such as fields and methods, and so on.

Fig. 7.
figure 7

The architecture of the \(\textsc {Coyote} \) rewriting engine (left). The interface of a CIL rewriting pass (right).

We implemented the binary rewriting engine on top of Cecil [46], an open-source \(\textsc {.NET} \) library that provides a rich API for rewriting CIL code. The rewriting engine architecture is illustrated in Fig. 7. The engine loads all program binaries from disk to access the CIL assemblies in-memory, topologically sorts them (to ensure that dependencies are processed first), and then traverses each assembly (using the visitor pattern) to apply a sequence of CIL rewriting passes, where each pass focuses on a different type of instrumentation.

Each rewriting pass implements the \(\textsc {Coyote} \) Pass interface, which is listed in Fig. 7. The rewriting engine visitor will traverse the CIL assembly and invoke the corresponding pass method for each encountered type, field, method signature, as well as each variable and instruction in each method body.

Built-in Rewriting Passes \(\textsc {Coyote} \) implements and invokes in-order the following four passes: type rewriting pass, task API rewriting pass, async rewriting pass, and inter-assembly invocation rewriting pass. The type rewriting pass is responsible for replacing certain \(\textsc {C}{} \texttt {\#} \) system library types in the user program with corresponding drop-in-replacement types that are implemented by \(\textsc {Coyote} \). The replacement types implement exactly the same interface as the original types, and invoke the original methods to maintain semantics, but are instrumented with callbacks to the \(\textsc {Coyote} \) test engine. Some examples of replaced types are: (1) System.Threading.Monitor type, which implements the lock statement in \(\textsc {C}{} \texttt {\#} \), and (2) the System.Threading.Semaphore type that is another variant of a lock. The \(\textsc {Coyote} \) versions of these types invoke the test engine to notify it when a worker acquires or releases a lock. These two are the synchronization primitives that \(\textsc {Coyote} \) supports by default, in addition to Task APIs. Adding support for more synchronization requires adding another type rewriting pass.

The task API rewriting pass inserts calls to the Coyote.TaskWrapper wrapper type, as discussed earlier. The async rewriting pass is similar, except for wrapping APIs that create implicit tasks. Finally, the inter-assembly invocation rewriting pass is responsible for identifying invocations in the code that are made across CIL assembly boundaries, where the target assembly is not rewritten by \(\textsc {Coyote} \). \(\textsc {Coyote} \) adds instrumentation to detect (and tolerate) uncontrolled concurrency (see §5).

New passes that implement the Pass interface can be easily integrated in the current pipeline of passes, allowing power users to extend \(\texttt {coyoterewrite} \) for custom rewriting (e.g., to support controlling a new synchronization type without having to manually use the \(\textsc {Coyote} \) instrumentation API).

Design Considerations We decided to target CIL for instrumentation instead of doing it at the level of ASTs. This helps reduce the instrumentation scope because the CIL instruction set is much smaller than C# surface syntax. Furthermore, CIL changes infrequently (last update was in 2012 [17]), and we can target pre-compiled binaries without access to their source code.

5 Additional Features

Partially-Controlled Exploration As mentioned in §2, \(\textsc {Coyote} \) requires tests to be deterministic modulo the concurrency that it controls. This requirement can be broken when the test creates a worker without reporting it to the \(\textsc {Coyote} \) test engine, which impacts the ability of \(\textsc {Coyote} \) to reproduce an execution. This can happen when using APIs outside of the TAP programming model or by calling into a library that has not been rewritten. Partially-controlled exploration allows the controlled part of a program to be tested with high-coverage, even when interacting with an uncontrolled part. In fact, \(\textsc {Coyote} \) recommends to developers that they should only rewrite their test binaries as well as the binaries of their production code, but leave the binaries of any external dependencies unmodified (to be handled by partially-controlled exploration).

During partially-controlled exploration, \(\textsc {Coyote} \) will treat any un-rewritten binaries as “pass-through”, and their methods are invoked atomically from the perspective of the tool. In this testing mode, \(\textsc {Coyote} \) sequentializes the execution of the controlled workers, as usual, and if a controlled worker invokes a method in an un-rewritten binary, or waits on a task that was earlier returned by a method from a non-rewritten binary, or invokes an unsupported low-level \(\textsc {C}{} \texttt {\#} \) concurrency API, then \(\textsc {Coyote} \) detects this and invokes ScheduleNextWorker to explore a scheduling decision. Instead of immediately trying to choose a controlled worker to schedule, \(\textsc {Coyote} \) uses a (tunable) heuristic that gives a chance to wait for the uncontrolled task or invocation to first complete, before trying to resolve the scheduling decision. This is important because instead of regressing coverage, it allows \(\textsc {Coyote} \) to cover scenarios where completing the uncontrolled task or invocation first results in new states of the state space being available for exploration.

Setting max-steps Some tests can be potentially non-terminating, i.e., some executions of the test will go on forever. Non-termination comes naturally when a program has spinloops or polling loops (loops that keep going until some condition is met), or when they are unavoidable, as in consensus protocols like Paxos or Raft that cannot avoid the existence of infinite executions. \(\texttt {coyotetest} \) provides the option of setting a bound on the length of a test iteration in terms of the number of scheduling points that it hits. This bound is supplied with the max-steps flag. The test engine keeps a count of the number of scheduling points in the current iteration. When it hits the max value, the test engine throws an exception in all of the workers (that would currently be blocked by the engine). This exception essentially kills the worker by propagating all the way up to the test harness, where it is caught by the engine. Once all workers are killed, the engine starts the next iteration.

This solution, of throwing an exception to kill a worker, only works when the worker does not catch the exception to try and resume the execution. All exceptions in \(\textsc {C}{} \texttt {\#} \) must derive from the System.Exception type, and a construct like catch(Exception) will catch all exceptions. \(\textsc {Coyote} \) gets around this problem by using a binary rewriting pass that edits all catch statements to disallow catching of \(\textsc {Coyote} \) exceptions.

Thread-safety violations A thread-safety violation occurs in a program when it concurrently invokes some library API that is not designed to be thread safe. Prior work showed the prevalence of such errors in .NET programs when accessing data structures such as dictionaries and lists in the System.Collections.Generic namespace [33]. These data structures do not offer thread safe APIs. (In concurrent scenarios, one should instead use the data structures in System.Collections.Concurrent namespace.)

\(\textsc {Coyote} \) offers the ability to catch such errors. It implements a rewriting pass that replaces such a data structure, say Dictionary, with a drop-in replacement type WrapperDictionary. The latter keeps tracks of concurrent (write-write or write-read) accesses and throws an exception when there are two such simultaneous accesses. The exception causes \(\textsc {Coyote} \) to report a test failure.

Actor runtime \(\textsc {Coyote} \) offers a library, inspired from the \(\textsc {P}{} \texttt {\#} \) [11] line of work, that allows a developer to use actors to express concurrency in their program. Actors, when created, run concurrently with respect to other actors. They continue to be alive unless explicitly halted. Each actor has an inbox where it listens for messages from other actors and processes them in a FIFO order. Several production systems have been build with \(\textsc {Coyote} \)’s actor framework [12]. The actor runtime takes care of calling the test engine instrumentation APIs at the appropriate points, such as when creating an actor or sending a message to another actor. Hence, no rewriting is required. The \(\textsc {Coyote} \) test engine treats tasks and actors the same way, allowing a developer to freely mix the two programming models, i.e., test programs that use both actors and tasks.

6 Evaluation

Our evaluation covers three experiments, each on a different set of benchmarks. Each benchmark is a concurrent program with a known bug. We measure the effectiveness of \(\textsc {Coyote} \) by the number of times that it is able to hit the bug within a fixed number of test iterations. For each benchmark, we report its degree of concurrency (DoC), defined as the maximum number of simultaneously enabled workers, and the number of scheduling decisions (#SD), i.e., number of times the exploration strategy is invoked on average per test iteration.

Table 1. Results on \(\textsc {ProdService} \) tests. Degree of concurrency varied from 5 to 16, and the number of scheduling decisions varied from 94 to 1054.

The first experiment compares the performance of PCT \(^t\) against PCT on task-heavy programs. We took a proprietary production service of \(\textsc {Microsoft}\), which we call \(\textsc {ProdService} \). The service runs as part of the Azure platform; it is roughly 54K lines of \(\textsc {C}{} \texttt {\#} \), and is designed to be highly-concurrent for high throughput. The owning engineering team were routinely running \(\textsc {Coyote} \) on multiple concurrency tests. We took an intermediate version of this service and ran all tests with \(\textsc {RW} \), PCT and PCT \(^t\), each with 1000 iterations each. There were a total of 111 tests, out of which 21 tests reported a failure (i.e., bug) with some strategy. The comparison is shown in Table 1. (We actually ran both PCT and PCT \(^t\) with multiple different values of the d parameter, and selected the best among them for each strategy; this value turned out to be \(\textit{d}=10\) for both.)

Table 1 shows superior performance of PCT \(^t\). It is able to find 17 test failures, compared to 13 for PCT and 9 for Random. Furthermore, on tests that failed with both PCT and PCT \(^t\), the latter found the bug 9 times more often (geo mean). We observe that these tests created many tasks, roughly 277 tasks (geo mean) in each test iteration, which throws off PCT. With PCT \(^t\), the number of chains was 6 times smaller (geo mean). Running these 21 tests for 1000 iterations each takes roughly 50 min (wall clock) on a 16 core AMD EPYC (2.6Ghz) VM, running Ubuntu 20.04 on Azure, when utilizing 14 threads on the machine to run tests in parallel.

Table 2. Results from testing buggy protocol implementations. Number of test iterations was set to 10K, except for FailureDetector and Paxos that used 100K iterations. PCT, PCT \(^t\) and DB use the bound \(d=10\).

The second experiment is on buggy protocol implementations from prior work [40, 48], shown in Table 2. This experiment evaluates a wider range of strategies. Three schedulers (PCT, PCT \(^t\) and DB) find all the bugs, but none is a clear winner. A combination of schedulers is likely required for reliably finding bugs in a small number of iterations.

Table 3. Results on SctBench with 10K test iterations. PCT uses the \(d=3\) and DB uses the \(d=5\) bound. Numbers in parenthesis report performance on the same benchmark-strategy pair from a different CCT tool (Maple) [56].

The final experiment is to show that \(\textsc {Coyote} \) is indeed state-of-the-art by comparing against other tools. We did not find any other CCT tool for \(\textsc {C}{} \texttt {\#} \), so we instead took an established benchmark suite SCTBench [53] of C/\(\textsc {C}{} \texttt {++} \) programs that use pthreads for concurrency, and manually ported some of them to \(\textsc {C}{} \texttt {\#} \) (Table 3), replacing pthreads APIs with Task APIs. These benchmarks have potentially racy shared variables, so we implemented an experimental binary rewriting pass in \(\textsc {Coyote} \) that adds scheduling points on heap accesses, to ease the porting exercise. A direct comparison with prior tools is difficult because there can still be subtle differences in how scheduling points get inserted. Regardless, we note that numbers for POS are roughly in agreement with its original paper [56] and numbers for PCT and \(\textsc {RW} \) are in agreement with a prior empirical study [53]. (Note that PCT \(^t\) is identical to PCT on these benchmarks because there are no task continuations.) Our implementation of POS performs better than the original one, but the original implementation is unavailable for us to make a more accurate assessment. This comparison is useful to ground \(\textsc {Coyote} \) with respect to related work.

The code and scripts to run all the non-proprietary experiments from this paper are available as an artifact on Zenodo [14].

7 Related Work

The term controlled concurrency testing (CCT) was coined only recently [53] but it inherits its roots from stateless model checking (SMC) that was popularized by VeriSoft [24]. Stateful approaches require the ability to record the state of an executing program; this is hard to achieve for production code, consequently stateful checking tools [6, 26] are often applied to models of code that are written in custom languages. SMC/CCT, on the other hand, only record the sequence of actions taken during an execution, making them the technique of choice for directly testing code written in commercial languages (like \(\textsc {C}{} \texttt {\#} \)).

Research in SMC/CCT can further be classified in two categories. One category is of exhaustive techniques, where the goal is to explore the entire state-space of a program (in reality, it is the state-space of a fixed test that invokes a bounded workload on the program), and obtain a verified verdict. Exhaustive techniques are based on the notion of partial order reduction (POR) [24] that constructs equivalence classes of executions so that only one exploration per equivalence class is required [35]. Recently, this line of work has produced several tools, such as CDSChecker [47], GenMC [30], and Nidhugg [2], that have demonstrated value in verifying concurrency primitives (e.g., latches, mutex implementations) and concurrent data structures, especially when considering weak memory behaviors [1, 28, 29].

The other category for SMC/CCT are techniques aimed towards bug-finding. These techniques are either bounded (i.e., aim to explore only a subset of the executions) or randomized or both. By lowering expectations (i.e., not insisting on covering the entire state-space), these techniques can be applied on larger systems. We have discussed several instances of these techniques throughout this paper. The first work that popularized bug-finding was the notion of context-bounded exploration [41]. \(\textsc {Coyote} \) borrows heavily from this line of work on bug-finding techniques, which is evident in the set of exploration strategies that it supports. Implementing POR-based strategies is possible; the POS strategy already takes \(\textsc {Coyote} \) in this direction. The absence of exhaustive techniques has (so far) not been felt by users of \(\textsc {Coyote} \), likely because the usage scenarios have neither focused on weak memory behaviors (more present in C/\(\textsc {C}{} \texttt {++} \) rather than \(\textsc {C}{} \texttt {\#} \)), nor on verifying concurrent data structures. Nonetheless, supporting POR-based techniques remains an important direction for future work.

Related to the idea of CCT for bug-finding are noise-injection-based techniques [18, 20, 21]. These techniques rely on perturbing the execution of a concurrent program by injecting noise such as sleep statements, which force the execution to explore alternative interleavings. Unlike CCT, no control is required on concurrent workers, hence these techniques have simpler engineering requirements. However, the tradeoff is that the loss of control implies that the ability to explore specific interleavings, such as what PCT requires, is reduced. The ANaConDA tool has successfully demonstrated noise-injection in an industrial setting [21]. It can be interesting to explore the use of noise injection to provide coverage in portions of code that are not controlled by Coyote.

The CHESS tool [41], to the best of our knowledge, was the only other CCT tool to support \(\textsc {C}{} \texttt {\#} \). CHESS is currently not in a usable state. It was designed prior to the popularity of TAP in \(\textsc {C}{} \texttt {\#} \), thus had no special support for tasks. In terms of implementation, it occupied a different design space than \(\textsc {Coyote} \). It relied on interception of \(\textsc {C}{} \texttt {\#} \) threading APIs and redirecting them to custom mocks. Maintenance of these mocks was an engineering cost. Furthermore, the interception technology relied on a framework [36] that also went out of support. This showcases that the complexity of supporting \(\textsc {C}{} \texttt {\#} \) must be met with good engineering, built on stable frameworks. \(\textsc {Coyote} \) is also more extensible, both in terms of programming frameworks, as well as exploration strategies.