figure a

1 Introduction

Background. With the advent of multicore processors, multithreaded programming—a notoriously error-prone enterprise—has become increasingly important.

Because of this, mainstream languages have started to offer core support for higher-level communication primitives besides lower-level synchronisation primitives (e.g., Clojure, Go, Kotlin, Rust). The idea has been to add message passing as an abstraction on top of shared memory, for—supposedly—channels are easier to use than locks. However, empirical research shows that, actually, “message passing does not necessarily make multithreaded programs less error-prone than shared memory” [36]. One of the core challenges is as follows: given a specification S of the communication protocols that an implementation I should fulfil, how to prove that I is safe and live relative to S? Safety means that “bad” channel actions never happen: if a channel action happens in I, then it is allowed to happen by S (protocol compliance). Liveness means that “good” channel actions eventually happen (communication deadlock freedom).

Multiparty Session Typing (MPST). MPST [17] is a formal method to automatically prove safety and liveness of implementations relative to specifications. The idea is to implement communication protocols as sessions (of communicating threads), specify them as behavioural types [1, 21], and verify the former against the latter using behavioural type checking. Formally, the central theorem is that well-typedness implies safety and liveness. Over the past fifteen years, much progress has been made, including the development of many tools to combine MPST with mainstream languages (e.g., F# [31], F\(^\star \) [37], Go [9], Java [19, 20], OCaml [22], Rust [26, 27], Scala [3, 10, 11, 34], and TypeScript [29]).

Behavioural type checking can be done statically at compile-time or dynamically at run-time. The disadvantage of static MPST is, it is conservative: statically checking each possible run of a session is often prohibitively complicated—if computable at all—so sessions are often unnecessarily rejected. In contrast, the advantage of dynamic MPST is, it is liberal: dynamically checking one actual run of a session is much simpler, so sessions are never unnecessarily rejected.

This Work. Discourje (pronounced “discourse”) [13, 14, 18] is a library that adds dynamic MPST to ClojureFootnote 1. It has a specification language to write behavioural types (embedded as an internal DSL in Clojure) and a verification engine to dynamically type-check sessions against them. The key design goals have been to achieve high expressiveness (cf. static MPST) and to be particularly mindful of ergonomics (i.e., make Discourje’s usage as frictionless as possible).

In a nutshell, at run-time, Discourje’s dynamic type checker simulates behavioural type S—as if it were a state machine—alongside session I. Each time when a channel action is about to happen in I, the dynamic type checker intervenes and first verifies if a corresponding transition can happen in S. If so, both the channel action and the transition happen. If not, an exception is thrown.

However, while safety violations are detected in this way (protocol incompliance), liveness violations are not (communication deadlocks: threads cyclically depend on each others’ channel actions, and so, they collectively get stuck). This is a serious limitation relative to static MPST. In this paper, we present an extension of Discourje to detect also liveness violations. Achieving this, without compromising the key design goals, has been an elusive problem that for years we did not know how to solve (e.g., we could not reuse variants of existing techniques for static MPST at run-time, as this would negatively affect expressiveness).

Section 2 of this paper demonstrates that it can be done, while Sect. 3 outlines how. The key idea is to use “mock” channels, which mimic “real” channels, to track ongoing communications: before any channel action happens on a real channel, it is first tried on a corresponding mock channel, allowing us to check if all threads would get stuck in a total communication deadlock as a result.

Fig. 1.
figure 1

Discourje and Clojure in a nutshell

2 Demonstration

We demonstrate the extension to detect liveness violations with two examples. For reference, Fig. 1 summarises the main elements of Discourje and Clojure.

Fig. 2.
figure 2

Two-Buyer (Example 1)

Example 1

The Two-Buyer protocol consists of Buyer1, Buyer2, and Seller [17]: “Buyer1 and Buyer2 wish to buy an expensive book from Seller by combining their money. Buyer1 sends the title of the book to Seller, Seller sends to both Buyer1 and Buyer2 its quote, Buyer1 tells Buyer2 how much she can pay, and Buyer2 either accepts the quote or rejects the quote by notifying Seller.”

Figure 2 shows a behavioural type and a session. It is safe and live. In contrast, if we had accidentally written on line 11 (i.e., Buyer1 tries to receive from Buyer2 instead of Seller), then it deadlocks. The original Discourje does not detect this liveness violation, but with the extension, an exception is thrown.    \(\square \)

Fig. 3.
figure 3

Load Balancing (Example 2)

Example 2

The Load Balancing protocol consists of Client, Server1, Server2, and LoadBalancer. First, a request is communicated synchronously from Client to LoadBalancer, and asynchronously from LoadBalancer to Server1 or Server2. Next, the response is communicated synchronously from that server to Client.

Figure 3 shows a behavioural type and a session. It is safe but not live. There are two deadlocks. The first one occurs because Server1 and Server2 try to receive from and on lines 19 and 23; this should be and . The second deadlock occurs because one of the servers will never receive a value and, as a result, block the entire program from terminating. The original Discourje does not detect these liveness violations, but with the extension, exceptions are thrown.    \(\square \)

3 Technical Details

Requirements. In this section, we outline how the extension to detect liveness violations works, focussing on the core deadlock detection algorithm. We begin by stating the rather complicated requirements for this algorithm, as entailed by Discourje’s key design goals regarding expressiveness and ergonomics (Sect. 1):

  • Expressiveness: The algorithm must be applicable to any combination of buffered and unbuffered channels, and to all functions (send), (receive), and (select). Thus, the programmer can continue to freely mix synchronous and asynchronous sends/receives, possibly selected dynamically.

  • Ergonomics: The algorithm must call only into the public API of Clojure’s standard libraries, without modifying the internals, and without relying on JVM interoperability. Thus, the programmer can write portable code that runs on different versions of Clojure and on different architectures.

The combination of these requirements has made the design of the algorithm elusive. For instance, the expressiveness requirement means that we cannot simply reuse existing distributed algorithms for deadlock detection (e.g., [6, 16, 25, 35]), as they typically do not support mixing of synchrony and asynchrony. The ergonomics requirement means that we cannot instrument Clojure’s internal code to manage threads, nor can we use Java’s thread monitoring facilities.

Terminology. A channel action is either a send of v through ch, represented as [ch v], or a receive through channel ch, represented as just ch (cf. in Fig. 1). A channel action is pending if it has been initiated but not yet completed. A pending channel action is either enabled or disabled, depending on ch:

  • when ch is a buffered channel, a pending send is enabled iff ch is non-full, while a pending receive is enabled iff ch is non-empty;

  • when ch is an unbuffered channel, a pending send is enabled iff a corresponding receive is pending, and vice versa.

When a thread initiates channel actions, but they are disabled, it is suspended. When a disabled channel action becomes enabled, the suspended thread is resumed. A communication deadlock is a situation where each thread is suspended.

Setting the Stage. Normally, channel actions are initiated via functions , and . When these functions are called using the extension, the dynamic type checker intervenes and first calls to initiate corresponding “mock” channel actions on “mock” channels. Each mock channel mimics a “real” channel and is used only by the dynamic type checker.

The mock channels have the same un/buffered properties and contents as the real channels, except that values are replaced with tokens. So, if detects a deadlock on the mock channels, then a deadlock will occur on the real channels, too. (Mock channels are also essential to detect safety violations.)

To initiate the mock channel actions, a separate function in the public API of Clojure’s standard libraries is used: . It resembles , except that it never suspends the calling thread. Instead, a call of immediately returns and, asynchronously, initiates the channel actions in \( acts \) and calls \( f \) when one is completed. In this way, initiation of mock channel actions can be decoupled from suspension of threads (demonstrated below).

Algorithm. Let be the number of threads. The idea to detect deadlocks is to identify the situation when threads are already suspended, while the last thread is about to be suspended. In that situation, instead of suspending the last thread, an exception is thrown to flag the liveness violation. In code:

figure t

Function checks if any of the is enabled. If so, it immediately initiates and completes it, and returns the result (of the form [v ch]). If not, the function returns to indicate that the current thread would indeed be suspended if were to be initiated. In code:

figure y

On line 7, optional parameter configures such that it immediately returns when all are disabled.

Function increments the number of suspended threads and checks if the number is less than . If so, it initiates , and actually suspends the current thread. If not, the function returns to indicate that the current thread is indeed the last one, so a deadlock is detected. In code:

figure ah

The code shown so far explains the general idea behind the algorithm. However, the details are more involved: our presentation does not yet account for data races, several of which are possible. For instance, suppose that there are two threads (Alice and Bob), that they initiate corresponding channel actions (no deadlock), and that calls of are scheduled as follows:

(1) Alice executes . It returns . (2) Bob executes . It, again, returns , as Alice has not yet executed . (3) Bob executes . It increments to and suspends Bob. (4) Alice executes . It increments to , detects that Alice is last, and immediately returns .

At this point, mistakenly, an exception is thrown. There are more subtle data races, too. The core issue is that and should be run atomically to avoid problematic schedules (e.g., the one above). Details appear in the technical report [23, Sect. A]. The actual source code was validated using both unit tests and whole-program tests.

4 Conclusion

Closest to the work in this paper is existing work on dynamic MPST [4, 15, 30,31,32] and alternate forms of dynamic behavioural typing [7, 8, 12, 28]. However, none of these tools can check for liveness at run-time. Also closely related is existing work on dynamic deadlock detection in distributed systems (e.g., [6, 16, 25, 35]). However, as stated in Sect. 3, these algorithms do not fit our requirements. Finally, we are aware of two other works that use formal techniques to reason about Clojure programs: the formalisation of an optional type system for Clojure [5], and a translation from Clojure to Boogie [2, 33]. In future work, we aim to study and optimise the performance overhead of our deadlock detection algorithm.