GenMC: A Model Checker for Weak Memory Models

. GenMC is an LLVM-based state-of-the-art stateless model checker for concurrent C/C++ programs. Its modular infrastructure allows it to support complex memory models, such as RC11 and IMM, and makes it easy to extend to support further axiomatic memory models. In this paper, we discuss the overall architecture of the tool and how it can be extended to support additional memory models, programming languages, and/or synchronization primitives. To demonstrate the point, we have extended the tool with support for the Linux kernel memory model (LKMM), synchronization barriers, POSIX I/O system calls, and better error detection capabilities.


Introduction
For any software developer or verification engineer, it is no news that concurrent programming is difficult, that concurrent software is often buggy, and that therefore verification of concurrent programs has attracted a lot of research interest. Within the verification community at least, it is also common knowledge that verification of concurrent programs is challenging because of the huge number of interleavings of the threads comprising a concurrent program.
What has changed in the last decade, however, is the importance of weak memory consistency [6,40,11,21,36,13,32,41,25,14] as a key factor contributing to the complexity of concurrent programming. Weak memory models do not simply increase the number of thread interleavings; they also confound programmers, who typically have little intuition about how to reason about the behaviors induced by these additional interleavings.
GenMC is a fully automatic verification tool meant for such programmers. It is a stateless model checker (SMC) [23] that can be used to verify bounded clients of intricate concurrent algorithms, such as implementations of synchronization primitives and shared data structures (e.g., queues, sets, and maps). It accepts as input a C/C++ program using C/C++11 atomics and/or the concurrency primitives from the pthread library, and reports any data races, assertion violations, or other errors encountered. By default, verification is performed with respect to the RC11 memory model [32], but there are command line options for selecting other models, such as IMM [41] and LKMM [10].
Since the theory underlying GenMC has already been published elsewhere [28,29,31], this paper focuses on the overall design of the tool and on various enhancements implemented in it. Our main design goals of GenMC were: Generality: The tool should be able to verify programs written in a variety of programming languages with respect to a variety of memory models. Efficiency: The tool should implement a state-of-the-art SMC algorithm and incorporate further optimizations for common programming patterns. Usability: The tool should provide useful and readable error messages. Extensibility: The tool should be easily adaptable to support additional models and synchronization primitives, and to tweak its performance. Extensibility is key to achieving the other goals, since it allows gradual improvements to the tool in terms of coverage, performance, and error detection/reporting.
These goals are achieved by a combination of techniques: -GenMC's core SMC algorithm [29,31] is parametric in the choice of the memory model-subject to a few minimal constraints (see §2). -The implementation is based on LLVM, a versatile intermediate language for multiple programming languages. -GenMC follows a modular architecture minimizing dependencies across components (see §3), which makes it easy to extend with support for additional memory models ( §4) and synchronization primitives ( §5). -Its architecture contains hooks to provide fast approximate consistency checks, which are exploited by the memory model implementations (see §4). -GenMC contains a number of optimizations that provide noticeable performance benefits on common workloads ( §7). -GenMC keeps additional metadata so as to present error messages in terms of variables names appearing in the source code ( §6).
GenMC has been applied to a few industrial settings, where it has found bugs and/or verified bounded correctness of concurrent libraries [39].
Related Work There has been extensive work on SMC, with most tools focusing on sequential consistency [23,37,7,8,15]. Tools that support weak memory models include CDSChecker [38] that verifies C/C++11 programs under the original C11 memory model, Tracer [5] that verifies C/C++11 programs under the RA model, RCMC [27] that verifies C programs under RC11 [32], and Nidhugg [2,1,4,12,3] that supports SC, TSO, PSO and provides limited support for the POWER and ARMv7 memory models. In contrast to GenMC, which uses the same core algorithm for all memory models, Nidhugg uses multiple different algorithms depending on the memory model. There has also been work on adapting SAT/SMT-based bounded model checking (BMC) techniques for weak memory models [9,17,22]. Dartagnan [22] is a BMC tool that is parametric in the choice of the memory model, as it accepts the memory model as input in the litmus format [11].

Memory Model Requirements
GenMC's core algorithm is parametric in the choice of the memory model provided that it can be expressed in an axiomatic way and satisfies a few basic requirements that we describe below.
Axiomatic memory models represent the executions of a concurrent program as execution graphs [11] that satisfy a certain consistency predicate. Execution graphs comprise a set of events (nodes) that represent the individual memory accesses performed by the program, and some relations on these events (edges). Example relations included in all memory models are the preserved program order (ppo) and reads-from (rf) relations: ppo relates events in the same thread that are ordered (e.g., by a chain of dependencies or a fence), while rf relates writes to reads reading from them.
GenMC can be used to verify programs under such a model as long as the model's consistency predicate fulfills the following requirements: No-Thin-Air: In consistent graphs, ppo ∪ rf should be acyclic. This intuitively means that an event cannot circularly depend on itself. Prefix-Closedness: Restricting a consistent graph to any (ppo ∪ rf)-prefixclosed subset of its events yields a consistent graph. Prefix-closedness enables the algorithm to construct a consistent graph incrementally. Extensibility: Adding a (ppo ∪ rf)-maximal event to a consistent graph for some choice of an incoming rf-edge preserves consistency. This captures the intuitive idea that executing a program should never get stuck if a thread has more statements to execute. In particular, a read of x should always be able to return the value written by the most recent write to x.
Although these requirements cannot be satisfied by more advanced memory models that cannot be defined in an axiomatic fashion (e.g., [25,33,14,24]), there is ongoing work to support such a model.
The first stage invokes clang to compile the source C/C++ program to LLVM-IR. To accommodate programs written in different languages, GenMC also accepts LLVM-IR as its input, provided that it adheres to certain conventions about thread creation.
The second stage transforms the LLVM-IR code to make verification more effective by replacing spinloops by assume statements, bounding infinite loops, and performing sound optimizations, such as dead allocation elimination. It also collects additional debugging information to enable better error reporting.
The third stage invokes the verification procedure, which explores all the executions of the program. If an error is found during this stage, the execution is halted and an error report is produced (see §6). The architectural subcomponents of this stage are depicted in Fig. 1 (right). At the center lies the verification driver, which owns three independent components: an execution graph, a work set, and an interpreter.
The execution graph records the visited execution trace, and has routines for calculating various relation on the graph, such as the happens-before relation. As each memory model comprises different relations, the execution graph contains multiple calculators that are dynamically populated when the graph is created, and the consistency predicate is calculated as a fixpoint of all the selected relations, whenever this is requested by the driver.
The work set records alternate options for later exploration, the precise definition of which can depend on the memory model.
The interpreter merely executes the user program, notifying the driver each time a "visible" action (e.g., a load/store to shared memory) is encountered. It is directly based on the LLVM interpreter lli [35], and is the only part of our code base that heavily depends on LLVM. In turn, the driver modifies accordingly the execution graph, possibly pushes some items to the work set, and returns control back to the interpreter, along with a value that will be used by the interpreter, if necessary (e.g., in the case of a load). In effect, the driver and the interpreter can be thought of as coroutines [18]. The interpreter calls the driver whenever it encounters a visible action or finishes running a thread, while the driver monitors execution consistency, schedules the program threads, and discovers alternative exploration options, which are pushed to the work set.
The aforementioned components are all parameterized by the user's configuration options. The most important of these options is the memory model, which also determines whether dependencies between instructions should be tracked by the interpreter and stored in the execution graph. Another important option is when and how consistency is to be calculated. Since checking consistency at each step can be expensive for some memory models, it is possible to provide an approximate consistency check to be applied at each step and only perform the full consistency check once an error is detected.
To facilitate memory-model-specific optimizations, the driver is overridden for each memory model. Each instance sets up the (approximate) consistency checks and can provide specialized methods for crucial verification components.

Supporting New Memory Models
Adding support for a new memory model entails three basic steps.
First, one has to provide definitions for any memory model primitives that the interpreter should intercept beyond those already supported (i.e., plain memory accesses and C/C++11 atomics). One can either provide a header file mapping these primitives to LLVM-IR instructions or create special event types for them.
Second, one has to provide calculators for the memory model's relations that are not already supported by GenMC. Depending on the memory model, this step may require a variable amount of effort, but it effectively boils down to translating relational calculations into matrix operations.
Third, one can also provide approximations for the consistency checks. Such approximations entail storing crucial information about a memory model's relations as vector clocks (e.g., causally preceding events, for some notion of causality), but deciding what to store is up to the user to decide and encode. Importantly, GenMC's performance depends not only on the calculators provided in the previous step, but also on the effectiveness of the approximations, which quickly filter out inconsistent exploration options. For instance, GenMC's current RC11 driver treats SC accesses as release-acquire (RA) accesses (the consistency of which can be quickly determined), and only checks for full RC11 consistency when an error has been triggered, a heuristic that seems to work well in practice for programs that have both SC and non-SC accesses.
All in all, adding support for a memory model largely depends on the complexity of the model. Adding support for models like SC or RA is trivial, since such accesses are already supported as part of RC11 and IMM. In contrast, adding support for LKMM involved much more work, as we describe below.

Supporting the Linux Kernel Memory Model (LKMM)
LKMM [10] is a memory model that encompasses a variety of different architectures supported by the Linux kernel. As LKMM differs substantially from RC11 and IMM, supporting it required all steps described above as well as a few other engineering decisions, the most important of which are discussed below.
First, LKMM uses complex constraints for checking consistency of an execution graph. As repeatedly calculating these constraints can be expensive, we designed approximations for them. Unlike most other memory models, LKMM does not define a suitable happens-before relation for checking coherence and detecting races. (Its hb relation cannot be used for this purpose.) We thus defined a custom happens-before relation that can rule out inconsistent executions very quickly, and use it to approximate coherence and race detection checks.
Second, although LKMM dictates that non-atomic accesses (called plain in LKMM's jargon) only conditionally contribute to ppo, we incorporate such accesses in GenMC's ppo (thus arriving at a stronger notion of ppo), mostly for technical reasons. Specifically, the calculation of dependencies between only nonplain accesses is difficult because each non-plain access in the source-code level may map to several plain and non-plain accesses in LLVM-IR level.
To increase confidence in our implementation, we ran all litmus tests distributed along with LKMM as part of the Linux kernel (32 tests in total), and compared our results with the results of the Herd [11] memory model simulator. Both tools explored the same number of executions for all tests.
In addition, we extracted some manually written tests from LKMM's supplementary repository [34] (categories atomic and kernel). We picked these categories as they contain tests written in C pseudocode (thus easily translatable to C) and do not contain tests with plain accesses, which, as described, GenMC treats slightly differently from what LKMM dictates. In total, these categories amount to another 84 tests, from which we excluded two tests containing unsupported primitives, one test for which Herd did not terminate within 42 hours, and three tests that cannot be cleanly translated to C. Out of the remaining 78 tests, GenMC explores the same number of executions for 75 tests. The discrepancies observed in the three remaining tests are due to the different way the two tools produce and calculate dependencies. (In GenMC, control dependencies extend to all subsequent memory accesses of the same thread, whereas in Herd they extend only to the merge point of a conditional statement.) We note that Herd took about 18 minutes to run all the above tests, while GenMC needed less than 2 seconds.

Supporting New Languages and Libraries
Supporting additional programming languages is straightforward as long as they can be compiled to LLVM. This was, for example, the case when we extended GenMC to accept C++ (the initial version accepted only C input). All we had to do was to create stub header files for the C++ library, and to extend the interpreter to recognize the memory (de)allocation calls generated by clang.
Supporting different runtime environments (e.g., JVM bytecode) requires constructing a new interpreter for the desired runtime system that calls the driver whenever a visible action is encountered. In addition, since the driver and the interpreter communicate using the LLVM type information, it may be necessary to add a translation layer between the interpreter(s) and the driver.
Supporting new concurrency libraries requires localized changes. If the library's semantics can be implemented in terms of memory accesses, one has to construct an appropriate header file or extend the interpreter to provide the mapping from library calls to the relevant memory access events. If this is not possible and/or if native support for a library is desirable (e.g., for performance reasons), then the execution graph has to be extended with new kinds of events and the consistency checks have to be adapted accordingly.
Next, we present two such library extensions, one mapping its calls to individual memory accesses, and the other creating new kinds of events.
System Calls As part of [26], we extended GenMC with support for system calls, such as open(), close(), read() and write(), which can be modeled by making multiple primitive calls (reads and writes) to a different address space.
There are two ways one could implement these system calls: either by providing an actual implementation (which would then be compiled to LLVM-IR) or by adding support in the interpreter to internally implement those calls and communicating multiple times with the driver.
We preferred the latter solution because it is more portable. An external implementation would have to be manually ported whenever support for more languages is added. In contrast, the internal implementation needs no change. Further, even if a new interpreter for a different runtime system is added, it should be simple to decouple the system calls from the interpreter, and have the different runtime systems share the infrastructure that handles system calls.
Barriers N -way barriers are a widely-used synchronization primitive. They have two functions: barrier init and barrier wait. The former initializes a barrier object with the number of threads that will rendezvous at the barrier, while the latter is called every time a thread reaches the barrier. A thread that is calling barrier wait blocks until the initially specified number of threads reaches the barrier, at which point all threads will be simultaneously unblocked, and the barrier value will reset to the one specified with barrier init.
Barriers can be straightforwardly implemented with a shared variable counting the number of threads that have called barrier wait. But doing so yields poor model checking performance. For N threads calling barrier wait, there are N ! possible orders in which they can update the shared counter, thus crippling the performance of the tool. Tracking the order of these updates is not only expensive but also completely unnecessary. For many real-world use cases of barriers (e.g., scatter-gather workloads), the order in which different threads reached the barrier is irrelevant, and the thread that reached last unimportant.
We leverage this intuition and provide built-in support for barrier init and barrier wait calls that does not track the relative ordering among barrier wait calls synchronizing with one another, thereby achieving an exponential reduction in verification time. Concretely, in the simple program below where N threads execute barrier wait concurrently, GenMC explores only one execution instead of N ! executions: barrier wait(); ... barrier wait(); Our extension is called BAM (Barrier-Aware Model-checking) and is detailed and evaluated in a companion paper [30].

Error Detection and Reporting
GenMC detects a number of different kinds of errors: violations of user-supplied regular and persistency assertions, data races, memory errors and simple cases of termination errors. It reports errors by printing an offending execution graph and highlighting the event(s) that caused the violation. Upon request, GenMC can also print a total ordering of the instructions that lead to the violation, or produce the offending execution in the DOT graph description language.
Persistency Errors To verify persistency properties of programs performing file I/O, we allow user programs to contain a special recovery routine [26], which would typically check some invariant over the persisted state.
When such a routine is present, GenMC simulates all possible ways in which the program could have crashed because of a power failure, executing the recovery routine at the end of every such execution. Of course, to avoid the obvious state-space explosion, the simulation of all the possible failures is done in an optimized fashion, driven by the memory accesses of the recovery routine.
The performance of GenMC when verifying persistency properties of programs under the ext4 filesystem has been evaluated at [26].
Memory Errors Memory errors refers to accessing uninitialized, unallocated or deallocated memory. In models like RC11 [32], reasoning about memory safety can be tricky at times, as demonstrated by the example below: This example is erroneous under RC11 because the allocation of p is not guaranteed to have propagated to the second thread by the time it is dereferenced.
(Since all accesses are relaxed, there is no synchronization between the threads.) GenMC also accounts for more complicated scenarios such as p being concurrently freed when accessed, p being freed twice, or p being the address of a local (stack) variable that might not be alive when accessed.
Refining Error Reports It is often useful to refine the error reporting. For example, in memory models that treat data races as errors (such as RC11), GenMC by default detects data races and reports them as errors. This, however, can be costly in terms of verification time or even prohibit the verification of programs that use compiler/custom primitives to access shared memory, as such programs would almost certainly be considered racy.
To deal with such cases, GenMC provides switches that disable race detection and refine the range of errors that will be reported to the user. Switches of the latter kind are especially useful when dealing with programs that contain system calls. By default, when such system calls fail, GenMC reports an error, which is inconvenient for programs that contain proper error handling, as some system errors are rather benign (e.g., a file not existing). With the appropriate switch, in case of system errors, an appropriate value is written in errno, as dictated by the POSIX standard.
Case Study We demonstrate the error reporting capabilities of GenMC with a real use case. We consider a flat-combining queue [19] that has been proposed to be ported in Rust's crossbeam library.
This queue serves as a nice case study for a couple of reasons. First, it contains loops that can diverge, and so its verification requires loop bounding, which Error detected: Attempt to read from uninitialized memory! Event (3,63)   GenMC can do automatically. Second, it is implemented using compiler primitives for concurrent accesses, and so its verification requires disabling race detection. Third, while experimenting with it, we found it to be buggy.
The error report produced by GenMC can be seen in Fig. 2. The error is quite intricate: it requires three threads to manifest, each of which executes a large number of instructions. The error is due to an ordering bug (relaxed accesses are used instead of release/acquire), which demonstrates the need for model checking tools that handle weak memory models.
We note that the error report contains helpful debugging information, such as the names of variables accessed (e.g., m.msg. meta.next) and the values read/written. To display this information, GenMC maintains a mapping from addresses to program variables using the additional debugging information collected in the "Transformation" phase.

Other Performance Enhancements to GenMC
In this section, we briefly discuss two recent changes to the driver to optimize its performance for certain kinds of programs.
Symmetry Reduction Many programs, such as the flat-combining queue of §6, have a symmetric structure: each thread runs the same code. In such cases, many execution graphs are equivalent up to some thread relabeling-a property that is exploited by symmetry reduction (SR) [16,20].
We implemented a simple SR algorithm that detects whether multiple threads with the same code are spawned with no intervening memory accesses, and avoids exploring executions for which a symmetric one (by relabeling such threads) has already been explored. This can yield exponential improvements. For example, To further demonstrate the benefits of SR, we measured the performance of GenMC with and without SR on some realistic lock implementations adapted from the literature. The results can be seen in Table 1. All reported times are in seconds, unless mentioned otherwise. We ran both GenMC versions three times for each benchmark, with an increasing number of threads each time (the initial thread number for each benchmark is provided in the second column). As it can be seen, SR leads to a significant performance improvement in all cases.
Lock-Aware Partial Order Reduction A common problem with locking is that of false sharing, where N threads contend to acquire the same lock even if it is unnecessary for correctness. In such cases, GenMC's partial order reduction algorithm [29] will explore all N ! orders in which the lock can be acquired even though they all lead to the same outcome.
We have implemented lock-aware partial order reduction (LAPOR) [28], an enhancement to partial order reduction that does not track ordering among locks unless their critical regions have conflicting accesses, in which case the lock ordering is induced from the ordering among those accesses. With LAPOR, GenMC achieves exponential improvements in lock-based implementations of concurrent libraries that have false sharing, such as search trees with coarsegrained or hand-over-hand locking. LAPOR has been evaluated at [28].

Conclusion
We presented GenMC, a state-of-the-art stateless model checker that can be used to verify consistency and persistency properties of C/C++ programs. We described its architecture, and how its modular design can be leveraged to account for new features and memory models. To widen the applicability of GenMC, we have extended it with support for LKMM, basic system calls and additional synchronization primitives. We have also improved its performance with optimizations, such as symmetry reduction and lock-aware partial order reduction that can exponentially decrease its search space.
In the future, we plan to implement a DSL for memory models, so as to make it easier to extend GenMC with new models and quickly tweak their approximation strategies. We are also planning to incorporate further optimizations into the tool to enable more effective verification of lock-free algorithms.