A Case Study for Reversible Computing: Reversible Debugging of Concurrent Programs

. Reversible computing allows one to run programs not only in the usual forward direction, but also backward. A main application area for reversible computing is debugging, where one can use reversibility to go backward from a visible misbehaviour towards the bug causing it. While reversible debugging of sequential systems is well understood, reversible debugging of concurrent and distributed systems is less settled. We present here two approaches for debugging concurrent programs, one based on backtracking , which undoes actions in reverse order of execution, and one based on causal consistency , which allows one to undo any action provided that its consequences, if any, are undone beforehand. The ﬁrst approach tackles an imperative language with shared memory, while the second one considers a core of the functional message-passing language Erlang. Both the approaches are based on solid formal foundations.

One of the oldest and more explored application areas for reversible computing is program debugging. This can be explained by looking, on the one hand, to the relevance of the problem, and, on the other hand, to how naturally reversible computing fits in the picture. Concerning the former, finding and fixing bugs inside software has always been a main activity in the software development life cycle. Indeed, according to a 2014 study [47], the cost of debugging amounts to $312 billions annually. Another recent study [3] estimates that the time spent in debugging is 49.9% of the total programming time. Concerning how naturally reversible computing fits in this context, consider that debugging means finding a bug, i.e., some wrong line of code, causing some visible misbehaviour, i.e., a wrong effect of a program, such as a wrong message printed on the screen. In general, the execution of the wrong line precedes the wrong visible effect. For instance, a wrong assignment to a variable may imply a misbehaviour later on, when the value of the variable is printed on the screen. Usually, the programmer has a very precise idea about which line of code makes the misbehaviour visible, but a non trivial debugging activity may be needed to find the bug. Indeed, debugging practice requires to put a breakpoint before the line of code where the programmer thinks the bug is, and use step-by-step execution from there to find the wrong line of code. However, the guess of the location of the bug is frequently wrong, causing the breakpoint to occur too late (after the bug) and a new execution with an updated guess is often needed. Reversible debugging practice is more direct: first, run the program and stop when the visible misbehaviour is reached; then, execute backwards (possibly step-by-step) looking for the causes of the misbehaviour until the bug is found.
With these premises, it is no surprise that reversible debugging has been deeply explored, as shown for instance by the survey in [11]. Indeed, many debuggers provide features for reversible execution, including popular open source debuggers such as GDB [8] as well as tools from big corporations such as Microsoft, the case of WinDbg [34].
However, the problem is far less settled for concurrent and distributed programs. We remark that nowadays most of the software is concurrent, either since the platform is distributed, the case of Internet or the Cloud, or to overcome the advent of the power wall [46]. Finding bugs in concurrent and distributed software is more difficult than in sequential software [33], since faults may appear or disappear according to the speed of the different processes and of the network communications. The bugs generating these faults, called Heisenbugs, are thus particularly challenging because they are rather difficult to reproduce. Two approaches to reversible debugging of concurrent systems have been proposed. Using backtracking, 1 actions are undone in reverse order of execution, while using causal-consistent reversibility [25] actions can be undone in any order, provided that the consequences of a given action, if any, are undone beforehand. Note that, by exploring a computation back and forth using either backtracking or causal-consistent reversibility one is guaranteed that Heisenbugs that occurred in the computation will not disappear. This paper will present two lines of research on debugging for concurrent systems developed within the European COST Action IC1405 on "Reversible Computation -Extending Horizons of Computing" [23]. They share the use of state saving to enable backward computation (this is called a Landauer embedding [24], and it is needed to tackle languages which are irreversible) and a formal approach aiming at supporting debugging tools with a theory guaranteeing the desired properties. The first line of research [20][21][22] (Sect. 3) supports backtracking (apart from some non relevant actions) for a concurrent imperative language with shared memory, while the second line of research [28][29][30]36] (Sect. 4) supports causal-consistent reversibility for a core subset of the functional messagepassing language Erlang. We will showcase both the approaches on the same airline booking example (Sect. 2), coded in the two languages. Related work is discussed in Sect. 5 and final remarks are presented in Sect. 6.

Airline Booking Example
In this section we will introduce an example program that contains a bug, and discuss a specific execution leading to a corresponding misbehaviour. This example will be used as running example throughout the paper. We will show this example in the two programming languages needed for the two approaches mentioned above. We begin by introducing each of these languages.

Imperative Concurrent Language
Our first language is much like any while language, consisting of assignments, conditional statements and while loops. Support has also been added for block statements containing the declaration of local variables and/or procedures, as well as procedure call statements. Further to this, removal statements are introduced to "clean up" at the end of a block, where any variables or procedures declared within the block are removed. Our language also contains unique names given to each conditional, loop, block, procedure declaration and call statement, named construct identifiers (represented as i1.0, w1.0, b1.0, etc.), and sequences of block names in which a given statement resides named paths (represented as pa). Both of these are used to handle variable scope, allowing one to distinguish different variables with the same name. The final addition to our language is interleaving parallel composition. A parallel statement, written P par Q allows the execution of the programs P and Q to interleave. All statements except blocks contain a stack A that is used to store identifiers (see below). The syntax of our language follows, where ε represents an empty program. Note that ε is the neutral element of sequential and parallel composition. We write (pa,A)? to denote the fact that (pa,A) is optional. We also write In, Wn, Bn, Cn to range, respectively, over identifiers for conditionals, while loops, blocks and call statements. Also, n refers to the name of a procedure. P ::= ε | S | P; P | P par P S ::= skip (pa,A)? | X = E (pa,A) | if In B then P else Q end (pa,A) | while Wn B do P end (pa,A) | begin Bn BB end | call Cn n (pa,A) BB ::= DV; DP; P; RP; RV DV ::= ε | var X = v (pa,A); DV DP ::= ε | proc Pn n is P end (pa,A); DP RV ::= ε | remove X = v (pa,A); RV RP ::= ε | remove Pn n is P end (pa,A); RP Operational Semantics. Our approach (see [20] for a detailed explanation) to reversing programs starts by producing two versions of the original program. The first one, named the annotated version, performs forward execution and saves any information that would be lost in a normal computation but is needed for inversion (named reversal information and saved into our auxiliary store δ).
Identifiers are assigned to statements as we execute them, capturing the interleaving order needed for correct inversion. The second one, named the inverted version, executes forwards but simulates reversal using the reversal information as well as the identifiers to follow backtracking order. We comment here that we use 'inversion' to refer to both the process of producing the program code of the inverted version (program inverter [1]), and to the process of executing the inverted version of a program. A reverse execution computes all parallel statements as in a forward execution, but it uses identifiers to determine which statement to invert next (instead of nondeterministically deciding). For programs containing many nested parallel statements, the overhead of determining the correct interleaving order increases, though we still deem this as reasonable [19]. Note that using a nondeterministic interleaving for the reverse execution is not possible, since it is not guaranteed to behave correctly (e.g., requiring information from the auxiliary store that is not there may cause an execution to be stuck). However, a small number of execution steps, including closing a block and removing a skip, do not use an identifier and can therefore be interleaved nondeterministically during an inverse execution. Forward and reverse execution are each defined in terms of a non-standard, small step operational semantics. Our semantics perform both the expected execution (forward and reverse respectively) and all necessary saving/using of the reversal information. Consider the example rule [D1a] for assignments, which is a reversibilisation of the traditional irreversible semantics of an assignment statement [51].
As shown here, this rule consists of the evaluation of the expression e to the value v, evaluation of the variable X to a memory location l and finally the assigning of the value v to the memory location l as expected. Alongside this, the rule also pushes the old value of the variable (the current value held at the memory location, namely σ(l)) onto the stack for this variable name within δ (δ[(m,σ(l)) X], where denotes a push operation). This old value is saved alongside the next available identifier m, returned via the function next() and used within the rule to record interleaving order (represented using the labelled Now consider the rule [D1r] from our inverse semantics for reversing assignments (that executed forwards via [D1a]).
This rule first ensures this is the next statement to invert using the identifier m, which must match the last used identifier (previous()) and be present in both the statements stack (A = m:A ) and the auxiliary store alongside the old value (δ(X) = (m,v):X ). Provided this is satisfied, this rule then removes all occurrences of m, and assigns the old value v retrieved from δ to the corresponding memory location.
Note that e appears exactly as in the original version but it is not evaluated, and that the functions next() and previous() both update the next and previous identifiers respectively as a side effect.

Erlang
Our second approach deals with a relevant fragment of the functional and concurrent language Erlang. We show in Fig. 1 the syntax of its main constructs, focusing on the ones needed in our running example. We drop from the syntax some declarations related to module management, which are orthogonal to our purpose in this paper. A program is a sequence of function definitions, where each function has a name (an atom, denoted by a) and is defined by a number of equations of the form a i (p i1 , . . . , p ini ) when g i → e i , where p i1 , . . . , p ini are patterns (i.e., terms built from variables and data constructors), g i is a guard (typically an arithmetic or relational expression only involving built-in functions), and e i is an arbitrary expression. As is common, the variables in p i1 , . . . , p ini are the only variables that may occur free in g i and e i . The body of a function is an expression, which can include variables, literals (i.e., atoms, integers, floating point numbers, the empty list [ ], etc.), lists (using Prolog-like notation, i.e., [e 1 |e 2 ] is a list with head e 1 and tail e 2 ), tuples (denoted by {e 1 , . . . , e n }), 2 function applications (we do not consider higher order functions in this paper for simplicity), pattern matching, sequences (denoted by comma), receive expressions, spawn (for creating new processes), "!" (for sending a message), and self. Note that some of these functions are actually built-ins in Erlang.
In contrast to expressions, patterns are built from variables, literals, lists, and tuples. Patterns can only contain fresh variables. In turn, values are built from literals, lists, and tuples (i.e., values are ground patterns). In Erlang, variables start with an uppercase letter.
Let us now informally introduce the semantics of Erlang constructions. In the following, substitutions are denoted by Greek letters σ, θ, etc. A substitution σ denotes a mapping from variables to expressions, where Dom(σ) is its domain. Substitution application σ(e) is also denoted by eσ.
Given the pattern matching p = e, we first evaluate e to a value, say v; then, we check whether v matches p, i.e., there exists a substitution σ for the variables of p with v = pσ (otherwise, an exception is raised). Then, the expression reduces to v, and variables are bound according to σ. Roughly speaking, a sequence (p = e 1 , e 2 ) is equivalent to the expression let p = e 1 in e 2 in most functional programming languages.
A similar pattern matching operation is performed during a function application a(e 1 , . . . , e n ). First, one evaluates e 1 , . . . , e n to values, say v 1 , . . . , v n . Then, we scan the left-hand sides of the equations defining the function a until we find one that matches Here, we should also check that the guard, gσ, reduces to true. In this case, execution proceeds with the evaluation of the function's body, eσ.
Let us now consider the concurrent features of our language. In Erlang, a running system can be seen as a pool of processes that can only interact through message sending and receiving (i.e., there is no shared memory). Received messages are stored in the queues of processes until they are consumed; namely, each process has one associated local (FIFO) queue. A process is uniquely identified by its pid (process identifier). Message sending is asynchronous, while receive instructions block the execution of a process until an appropriate message reaches its local queue (see below).
We consider the following functions with side-effects: self, "!", spawn, and receive. The expression self() returns the pid of a process, while p ! v evaluates to v and, as a side-effect, sends message v to the process with pid p, which will be eventually stored in p's local queue. New processes are spawned with a call of the form spawn(mod, a, [v 1 , . . . , v n ]), where mod is the name of the module declaring function a, and the new process begins with the evaluation of the function application a(v 1 , . . . , v n ). The expression spawn (mod, a, [v 1 , . . . , v n ]) returns the (fresh) pid assigned to the new process.
Finally, an expression "receive p 1 when g 1 → e 1 ; . . . ; p n when g n → e n end" should find the first message v in the process' queue (if any) such that v matches some pattern p i (with substitution σ) and the instantiation of the corresponding guard g i σ reduces to true. Then, the receive expression evaluates to e i σ, with the side effect of deleting the message v from the process' queue. If there is no matching message in the current queue, the process suspends until a matching message arrives.

Airline Code
We are now ready to describe the example. Consider a model of an airline booking system, where multiple agents sell tickets for the same flight. In order to keep the example concise, we consider only two agents selling tickets in parallel, with three seats initially available. The code of the example is shown in Listing 1.1, written in the concurrent imperative programming language described in Sect. 2.1.
The code contains two while loops operating in parallel (lines 10-16 and 18-24), where each loop models the operation of a single agent. Let us consider the first loop. For each iteration, the agent checks whether any seat remains (line 11). As long as the number of currently available seats is greater than zero, the agent is free to sell a ticket via the procedure named sell (called at line 12). Once the number of available tickets has reached zero, each agent will then close, terminating its loop.
As previously mentioned, this program can show a misbehaviour under certain execution paths. Recall the simplified setting of three initially available seats. Consider an execution that begins with each agent selling a single ticket (allocating one seat) via one full iteration of each while loop (the interleaving among the two iterations is not relevant). At this point, both agents remain open (since agent1 = 1 and agent2 = 1), and the current number of seats is 1. Now assume that the execution continues with the following interleaving. The condition of each while loop is checked, both of which will evaluate to true as each agent is open. Next, the execution of each loop body begins with the evaluation of the guard of each conditional statement. They will both evaluate to true, as there is at least one seat available. At this point, each agent is committed to selling one more ticket, even if only one seat is available. The rest of the execution can then be finished under any interleaving. The important thing to note here is that the final number of free seats is -1. This is an obvious misbehaviour, as the two agents allocated four tickets when only three seats were available. This misbehaviour occurs since the programmer assumed that the checking for an available seat and its allocation were atomic, but there is no mechanism enforcing this. Listing 1.2 shows the same example coded in Erlang. A call to the initial function, main, spawns two processes (the agents) that start with the execution of function calls agent(1,Main) and agent(2,Main), respectively. Here, Main is a variable with the pid of the main process, which is obtained via a call to the predefined function self.
Then, at line 8, the main process calls to function seats with argument 3 (the initial number of available seats). From this point on, the main process behaves as a server that executes a potentially infinite loop that waits for requests and replies to them. Here, the state of the process is given by the argument Num which represents the current number of available seats. The server accepts two kinds of messages: {numOfSeats,Pid}, a request to know the current number of available seats, and {sell,Pid}, to decrease the number of available seats (analogously to the procedure sell in Listing 1.1). In the first case, the number of available seats is sent back to the agent that performed the request (Pid ! Num); in the second case, the number of the booked seat is sent. 3 The behaviour of the agents (lines 17-23) is simple. An agent first sends a request to know the number of available seats, Pid ! {numOfSeats,self()}, where self() is required for the main process to be able to send a reply back to the sender. Then, the agent suspends its execution waiting for an answer {seats,Num}: if Num is greater than zero, the agent sends a new message to sell a seat (Pid ! {sell,self()}) and receives the confirmation ({booked, }); 4 otherwise, it terminates the execution with the message "AgentN done!", where N is either 1 or 2.

Backtracking in a Concurrent Imperative Language
In this section we describe a state-saving approach to reversibility in the concurrent imperative programming language described in Sect. 2.1. We begin by discussing our approach and its use within the debugging of the airline example (see Sect. 2.3), along with our simulation tool [20,21]. As described in more detail in [21], we have produced a simulator implementing the operational semantics of our approach. This simulator is capable of parsing a program, automatically inserting removal statements, construct identifiers and paths, and simulating both forward and reverse execution. Each execution can be either end-to-end, or step-by-step.
We first execute the forward version of our airline example completely. This execution produces the annotated version in Fig. 2a, where the identifier stack for each statement has been populated capturing an interleaving order that experiences the bug as outlined in Sect. 2.3. The inverted version of the airline example is shown in Fig. 2b, where the overall statement order has been inverted. Note that some annotations are omitted to keep this source code concise (e.g., no paths  seat. This produces the state where seats = 1, and where the next available step is to close either of the inverse conditional statements. Though the identifiers ensure we must start by closing the conditional with identifier 19, the fact that both can be closed implies that both are open at the same time. This current position within the inverse execution is shown in Fig. 3, where the command 'display loops' outputs all current while loops (agents) with arrows indicating the next statement to be executed. It is clear from our semantics (see [20]) that the closing of an inverted conditional is the reverse of opening its forward version. Since the two conditionals have been opened using consecutive identifiers, one can see that each committed to selling a ticket. Given that the current state has seats = 1, this execution commits to selling two tickets when only one remains. It is therefore clear that this is an atomicity violation, since interleaving of other actions is allowed between the checking for at least one free seat and the allocation of it. We have therefore shown how the simulator implementing our approach to reversibility can be used during the debugging process of an example bug.

Causal-Consistent Reversibility in Erlang
In this section we will discuss how to apply causal-consistent reversible debugging to the airline booking example in Sect. 2.3. Our approach to reversible debugging is based on the following principles [29,30]: -First, we consider a reduction semantics for the language (a subset of Core Erlang [5], which is an intermediate step in Erlang compilation). Our semantics includes two transition relations, one for expressions (which is mostly a call-by-value semantics for a functional language) and one for systems, i.e., collections of processes, possibly interacting through message passing. An advantage of this modular design is that only the transition relation for systems needs to be modified in order to produce a reversible semantics. -Then, we instrument the standard semantics in different ways. On the one hand, we instrument it to produce a log of the computation; namely, by recording all actions involving the sending and receiving of messages, as well as the spawning of new processes (see [30] for more details). On the other hand, one can instrument the semantics so that the configurations now carry enough information to undo any execution step, i.e., a typical Landauer embedding.
Producing then a backward semantics that proceeds in the opposite direction is not difficult. Here, the configurations may include both a log-to drive forward executions-and a history-to drive backward executions. -It is worthwhile to note that forward computations need not follow exactly the same steps as in the recorded computation (indeed the log does not record the total order of steps). However, it is guaranteed that the admissible computations are causally equivalent to the recorded one; namely, they differ only for swaps of concurrent actions. Analogously, backward computations need not be the exact inverse of the considered forward computation, but ensuring that backward steps are causal-consistent suffices. This degree of freedom is essential to allow the user to focus on the process and/or actions of interest during debugging, rather than inspecting the complete execution (which is often impractical). -Finally, we define another layer on top of the reversible semantics in order to drive it following a number of requests from the user, e.g., rolling back up to the point where a given process was spawned, going forward up to the point where a message is sent, etc. This layer essentially implements a stack of requests that follows the causal dependencies of the reversible semantics.
In the following, we consider the causal-consistent reversible debugger CauDEr [27,28] which follows the principles listed above. CauDEr first translates the airline example into Core Erlang [5]. Then one can execute the program, either using a built-in scheduler, or using the log of an actual execution [30].
Here, if we compile the program in the standard environment and execute the call main(), we get the following output: Seat sold! Seat sold! Seat sold! Seat sold! Agent1 done! Agent2 done! which is clearly incorrect since we only had three seats available.
By using the logger and, then, loading both the program and the log into CauDEr (as described in [30]), we can replay the entire execution and explore the sequence of concurrent actions. Figure 4 shows the final state (on the left) and the sequence of concurrent actions (on the right), where process 63 is the main process, and processes 67 and 68 are the agents. Now, we can look at the sequence of concurrent actions, where messages are labelled with a unique identifier, added by CauDEr, which is shown in brackets to the right of the corresponding line:  (11) One can see that seat number 0 (which does not exist!) has been booked by process 68, and the notification has been provided via message number 16.
A good state to explore is the one where message number 16 has been sent. Here a main feature of causal-consistent reversible debugging comes handy: the possibility of going to the state just before a relevant action has been performed, by undoing it, including all and only its consequences. This is called a causal-consistent rollback. CauDEr provides causal-consistent rollbacks for various actions, including send actions. Thus, the programmer can invoke a Roll send command with message identifier 16 as a parameter.
In this way, one discovers that the message has been sent by process 63 (as expected, since process 63 is the main process). By exploring its state one understands that, from the point of view of process 63, sending message 16 is correct, since it is the only possible answer to a sell message. The bug should be thus before.
From the program code, the programmer knows that whether seat Num is available or not is checked by a message of the form {numOfSeats,Pid}, which is answered with a message of the form {seats,Num}, where Num is the number of available seats.
Looking again at the concurrency actions, the programmer can see that process number 68 was indeed notified of the availability of a seat by message number 12.
We can use again Roll send, now with parameter 12, to check whether this send is correct or not. We discover that indeed the send is correct since, when the message is sent, there is one available seat. However, here, another window comes handy: the Roll log window that shows which actions (causally dependent on the one undone) have been undone during a rollback, which shows:  (13) By checking it the programmer sees that also the interactions between process 67 and process 63 booking seat 1 are undone. Hence the problem is that, in between the check for availability and the booking, another process may interact with main, stealing the seat; thus, the error is an atomicity violation.
Of course, given the simplicity of the system, one could have spotted the bug directly by looking at the code or at the full sequence of message exchanges, but the technique above is quite driven by the visible misbehaviour, hence it will better scale to larger systems (e.g., with more seats and agents, or with additional functionalities).
We remark that, while the presentation above concentrates on the debugger and its practical use, this line of research also deeply considered its theoretical underpinning, as briefly summarised at the beginning of the section. Thanks to this, relevant properties have been proved, e.g., that if a misbehaviour occurs in a computation then the same misbehaviour will occur also in each replay [30].

Related Work
Reversible computation in general, and reversible debugging in particular, have been deeply explored in the literature.
A line of research considers naturally reversible languages, that is languages where only reversible programs can be written. Such approaches include the imperative languages Janus [49,50], R-CORE [17] and R-WHILE [16], and the object-oriented languages Joule [43] and ROOPL [18]. These approaches require dedicated languages, and cannot be applied to mainstream languages like Erlang or a classic imperative language, as we do in this paper.
The backtracking approach has been applied, e.g., in the Reverse C Compiler (RCC) defined by Perumalla et al. [6,37]. It supports the entire programming language C, but lacks a proof of correctness, which is instead provided by our approaches. The Backstroke framework [48] is a further example, supporting the vast majority of the programming language C++. This framework has been used to provide reverse execution in the field of Parallel Discrete Event Simulation (PDES) [13], as described in more recent works by Schordan et al. [40][41][42]. Similar approaches have been used for debugging, e.g., based on program instrumentation techniques [7]. Identifiers and keys are used to control execution in the work by Phillips and Ulidowski [38,39]. Another related work is omniscient debugging, where each assignment and method call is stored in an execution history, which can be used to restore any desired program state. An example of such a debugger written for Java was proposed by Lewis [32].
Causal-consistent reversibility has been mainly studied in the area of foundational process calculi such as CCS [10] and its variants [35,38], π-calculus [9], and higher-order π-calculus [26] and coordination languages such as Klaim [15]. The application to debugging has been first proposed in [14] in the context of the toy functional language μOz. A related approach is Actoverse [44], for Akkabased applications. It provides many relevant features complementary to ours, such as a partial-order graphical representation of message exchanges. On the other side, Actoverse allows one to explore only some states of the computation, such as the ones corresponding to message sending and receiving. We also mention Causeway [45], which however is not a full-fledged debugger, but just a post-mortem traces analyser.

Conclusion
We presented two approaches to reversible debugging of concurrent systems, we will now briefly compare them. Beyond the language they consider, the main difference between the two approaches is in the order in which execution steps can be reversed. The backtracking approach undoes them in reverse order of execution. This means that there is no need to track dependencies, and the user of the debugger can easily anticipate which steps will be undone by looking at identifiers. The causal-consistent approach instead allows independent steps of an execution to be reversed in any order, hence tracking dependencies between steps is crucial. This offers the benefit that only the steps strictly needed to reach the desired point of an execution need to be reversed, and steps which happened in between but were actually independent are disregarded.
Debugging is a relevant application area for reversible computation, but reversible debugging for concurrent and distributed systems is still in its infancy. While different techniques have been put forward, they are not yet able to deal with real, complex systems. A first reason is that they do not tackle mainstream languages (Erlang could be considered mainstream, but only part of the language is currently covered). When this first step will be completed, then runtime overhead and size of the logs will become relevant problems, as they are now in the setting of sequential reversible debugging.