Keywords

figure a
figure b

1 Introduction

Classifying states in a transition system as live or dead is a recurring problem in formal verification. For example, given an expression, can it be simplified to the identity? Given an input to a nondeterministic program, can it reach a terminal state, or can it reach an infinitely looping state? Given a state in an automaton, can it reach an accepting state? State classification is relevant to satisfiability modulo theories (SMT) solvers [8, 9, 24, 51], where theory-specific partial decision procedures often work by exploring the state space to find a reachable path that corresponds to a satisfying string or, more generally, a sequence of constructors. To a first approximation, the core problem in all of these cases amounts to classifying each state u in a directed graph as live, meaning that a feasible, accepting, or satisfiable state is reachable from u; or dead, meaning that all states reachable from u are infeasible, rejecting, or unsatisfiable.

Motivating Applications. We originally encountered the problem of incremental state classification during our prior work while building Z3’s regex solver [61] for the SMT theory of string and regex constraints [4, 13, 15]. Our solver leveraged derivatives (in the sense of Brzozowski [18] and Antimirov [5]) to explore the states of the finite state machine corresponding to the regex incrementally (as the graph is built), to avoid the prohibitive cost of expanding all states initially. This turns out to require solving the live and dead state detection problem in the finite state machine presented as an incremental directed graph.Footnote 1 Concretely, consider the regex \((\centerdot ^{\texttt {*}}\alpha \centerdot ^{100})^C \cap (\centerdot \alpha )\), where \(\centerdot \) matches any character, \(\cap \) is regex intersection, \(^C\) is regex complement, and \(\alpha \) matches any digit (0-9). A traditional solver would expand the left and right operands as state machines, but the left operand \((\centerdot ^{\texttt {*}}\alpha \centerdot ^{100})^C\) is astronomically large as a DFA, causing the solver to hang. The derivative-based technique instead constructs the derivative regex: \((\centerdot ^{\texttt {*}}\alpha \centerdot ^{100})^C \cap (\centerdot ^{100})^C \cap \alpha \). At this stage we have a graph of two states and one edge, where the states are the two regexes just described, and the edge is the derivative relation. After one more derivative operation, the regex is reduced to one that is clearly nonempty as it accepts the empty string.

It is important that a derivative-based solver identify nonempty (live) and empty (dead) regexes incrementally because it does not generally construct the entire state space before terminating (see the graph update rule Upd, p. 626 [61]). Moreover, the nonemptiness problem for extended regexes is non-elementary [62] — and still PSPACE-complete for more restricted fragments — which strongly favors a lazy approach over brute-force search.

Regexes are just one possible application; the algorithms we will present here are broadly applicable to any context where the states have a bounded (per-node) out-degree. For example, they could be applied in LTL model checking when lazily exploring the state space of a nondeterministic Büchi automaton (NBA), where the NBA is too expensive to construct up front. The important fact is that each state of the automaton has only finitely many outgoing edges, and when all these are added, we can hope to check for dead states incrementally.

Prior Work. Traditionally, while live state detection can be done incrementally, dead state detection is often done exhaustively (i.e., after the entire state space is explored). For example, bounded and finite-state model checkers based on translations to automata [20, 43, 58], as well as classical dead-state elimination algorithms [12, 16, 37], typically work on a fixed state space after it has been fully enumerated. However, we reiterate that exhaustive exploration is prohibitive for large (e.g., exponential or infinite) state spaces which arise in an SMT verification context. We also have good evidence that incremental feedback can improve SMT solver performance: a representative success story is the e-graph data structure [23, 67], which maintains an equivalence relation among expressions incrementally; because it applies to general expressions, it is theory-independent and re-usable. Incremental state space exploration could lead to similar benefits if applied to SMT procedures which still rely on exhaustive search.

However, in order to perform incremental dead state detection, we currently lack algorithms which match offline performance. As we discuss in Sect. 2, the best-known existing solutions would require maintaining strong connected components (SCCs) incrementally. For SCC maintenance and the related simpler problem of cycle detection, amortized algorithms are known with \(O(m^{3/2})\) total time for m edge additions [10, 33], with some recently announced improvements [11, 14]. Note that this is in sharp contrast to O(m) for the offline variants of these problems, which can be solved by breadth-first or depth-first search. More generally, research suggests there are computational barriers to solving unconstrained reachability problems in incremental and dynamic graphs [1, 29].

Fig. 1.
figure 1

GID consisting of the sequence of updates \({ \texttt {E}}(1, 2)\), \({ \texttt {E}}(1, 3)\), \({ \texttt {T}}(2)\). Terminal states are drawn as double circles. After the update \({ \texttt {T}}(2)\), states 1 and 2 are known to be live. State 3 is not dead in this GID, as a future update may cause it to be live.

Fig. 2.
figure 2

GID extending Fig. 1 with additional updates \({ \texttt {E}}(4, 3)\), \({ \texttt {E}}(4, 5)\), \({ \texttt {C}}(4)\), \({ \texttt {C}}(5)\). Closed states are drawn as solid circles. After the update \({ \texttt {C}}(5)\) (but not earlier), state 5 is dead. State 4 is not dead because it can still reach state 3.

This Paper. To improve on prior algorithms, our key observation is that in many applications (including our motivating applications above), edges are not added adversarially, but from one state at a time as the states are explored. As a result, we know when a state will have no further outgoing edges. This enables us to (i) identify dead states incrementally, rather than only after the whole state space is explored; and (ii) obtain more efficient algorithms than currently exist for general graph reachability.

We introduce guided incremental digraphs (GIDs), a variation on incremental graphs. Like an incremental directed graph, a guided incremental digraph may be updated by adding new edges between states, or a state may be labeled as closed, meaning it will receive no further outgoing edges. Some states are designated as terminal, and we say that a state is live if it can reach a terminal state and dead if it will never reach a terminal state in any extension – i.e. if all reachable states from it are closed (see Figs. 1 and 2). To our knowledge, the problem of detecting dead states in such a system has not been studied by existing work in graph algorithms. Our problem can be solved through solving SCC maintenance, but not necessarily the other way around (Sect. 2, Proposition 1). We provide two new algorithms for dead-state detection in GIDs.

First, we show that the dead-state detection problem for GIDs can be solved in time \(O(m \cdot \log m)\) for m edge additions, within a logarithmic factor of the O(m) cost for offline search. The worst-case performance of our algorithm thus strictly improves on the \(O(m^{3/2})\) upper bound for SCC maintenance in general incremental graphs. Our algorithm is technically sophisticated, and utilizes several data structures and existing results in online algorithms: in particular, Union-Find [63] and Henzinger and King’s Euler Tour Trees [35]. The main idea is that, rather than explicitly computing the set of SCCs, for closed states we maintain a single path to a non-closed (open) state. This turns out to reduce the problem to quickly determining whether two states are currently assigned a path to the same open state. On the other hand, Euler Tour Trees can solve undirected reachability for graphs that are forests in logarithmic time.Footnote 2 The challenge then lies in figuring out how to reduce directed connectivity in the graph of paths to an undirected forest connectivity problem. At the same time, we must maintain this reduction under Union-Find state merges, in order to deal with cycles that are found in the graph along the way.

While as theorists we would like to believe that asymptotic complexity is enough, the truth is that the use of complex data structures (1) can be prohibitively expensive in practice due to constant-factor overheads, and (2) can make algorithms substantially more difficult to implement, leading practitioners to prefer simpler approaches. To address these needs, in addition to the logarithmic-time algorithm, we provide a second lazy algorithm which avoids the user of Euler Tour Trees, and only uses union-find. This algorithm is based on an optimization of adding shortcut jump edges for long paths in the graph to quickly determine reachability. This approach aims to perform well in practice on typical graphs, and is evaluated in our evaluation along with the logarithmic time algorithm, though we do not prove its asymptotic complexity.

Finally, we implement and empirically evaluate both of our algorithms for GIDs against several baselines in 5.5k lines of code in Rust [47]. Our evaluation focuses on the performance of the GID data structure itself, rather than its end-to-end performance in applications. To ensure an apples-to-apples comparison with existing approaches, we put particular focus on providing a directed graph data structure backend shared by all algorithms, so that the cost of graph search as well as state and edge merges is identical across algorithms. We implement two naïve baselines, as well as an implementation of the state-of-the-art solution based on maintaining SCCs, BFGT [10] in our framework. To our knowledge, the latter is the first implementation of BFGT specifically for SCC maintenance. On a collection of generated benchmark GIDs, random GIDs, and GIDs directly pulled from the regex application, we demonstrate a substantial improvement over BFGT for both of our algorithms. For example, for larger GIDs (those with over 100K updates), we observe a 110-530x speedup over BFGT.

Contributions. Our primary contributions are:

  • Guided incremental digraphs (GIDs), a formalization of incremental live and dead state detection which supports labeling closed states. (Section 2)

  • Two algorithms for the state classification problem in GIDs: first, an algorithm that works in amortized \(O(\log m)\) time per update, improving upon the state-of-the-art amortized \(O(\sqrt{m})\) per update for incremental graphs; and second, a simpler algorithm based on lazy heuristics. (Section 3)

  • An open-source implementationFootnote 3 of GIDs in Rust, and an evaluation which demonstrates up to two orders of magnitude speedup over BFGT. (Section 4)

Following the above, we expand on the application of GIDs to regex solving in SMT (Sect. 5) and survey related work (Sect. 6).

2 Guided Incremental Digraphs

2.1 Problem Statement

An incremental digraph is a sequence of edge updates \({ \texttt {E}}(u, v)\), where the algorithmic challenge in this context is to produce some output after each edge is received (e.g., whether or not a cycle exists). If the graph also contains updates \({ \texttt {T}}(u)\) labeling a state as terminal, then we say that a state is live if it can reach a terminal state in the current graph. In a guided incremental digraph, we also include updates \({ \texttt {C}}(u)\) labeling a state as closed, meaning that will not receive any further outgoing edges.

Definition 1

Define a guided incremental digraph (GID) to be a sequence of updates, where each update is one of the following:

  1. (i)

    a new directed edge \({ \texttt {E}}(u, v)\);

  2. (ii)

    a label \({ \texttt {T}}(u)\) which indicates that u is terminal; or

  3. (iii)

    a label \({ \texttt {C}}(u)\) which indicates that u is closed, i.e. no further edges will be added going out from u (or labels to u).

The GID is valid if the closed labels are correct: there are no instances of \({ \texttt {E}}(u, v)\) or \({ \texttt {T}}(u)\) after an update \({ \texttt {C}}(u)\). The denotation of G is the directed graph (VE) where V is the set of all states u which have occurred in any update in the sequence, and E is the set of all (uv) such that \({ \texttt {E}}(u, v)\) occurs in G. An extension of a valid GID G is a valid GID \(G'\) such that G is a prefix of \(G'\). In a valid GID G, we say that a state u is live if there is a path from u to a terminal state in the denotation of G; and a state u is dead if it is not live in any extension of G. Notice that in a GID without any \({ \texttt {C}}(u)\) updates, no states are dead as an edge may be added in an extension which makes them live.

We provide an example of a valid GID in Figs. 1 and 2 consisting of the following sequence of updates: E(1, 2), E(1, 3), T(2), E(4, 3), E(4, 5), C(4), C(5). Terminal states \({ \texttt {T}}(u)\) are drawn as double circles; closed states, as single circles \({ \texttt {C}}(u)\); and states that are not closed, as dashed circles.

Definition 2

Given as input a valid GID, the GID state classification problem is to output, in an online fashion after each update, the set of new live and new dead states. That is, output \({ \texttt {Live}}(u)\) or \({ \texttt {Dead}}(u)\) on the smallest prefix of updates such that u is live or dead on that prefix, respectively.

2.2 Existing Approaches

In many applications, one might choose to classify dead states offline, after the entire state space is enumerated. This leads to a linear-time algorithm via either DFS or BFS, but it does not solve our problem (Definition 2) because it is not incremental. Naïve application of this idea leads to O(m) per update for m updates (\(O(m^2)\) total), as we may redo the entire search after each update.

For acyclic graphs, there exists an amortized O(1)-time per update algorithm for the problem (Definition 2): maintain the graph as a list of forward- and backward-edges at each state. When a state v is marked terminal, do a DFS along backward-edges to determine all states u that can reach v not already marked as live, and mark them live. When a state v is marked closed, visit all forward-edges from v; if all are dead, mark v as dead and recurse along all backward-edges from v. As each edge is visited only when marking a state live or dead, it is only visited a constant number of times overall (though we may use more than O(1) time on some particular update pass). Additionally, the live state detection part of this procedure still works for graphs containing cycles.

The challenge, therefore, lies primarily in detecting dead states in graphs which may contain cycles. For this, the breakthrough approach from [10] maintains a condensed graph which is acyclic, where the vertices in the condensed graph represent strongly connected components (SCCs) of states. The mapping from states to SCCs is maintained using a Union-Find [63] data structure. Maintaining the condensed graph requires \(O(\sqrt{m})\) time per update. To avoid confusing closed and non-closed states, we also have to make sure that they are not merged into the same SCC; the easiest solution to this is to withhold all edges from each state u in the graph until u are closed, which ensures that u must be in a SCC on its own. Once we have the condensed graph with these modifications, the same algorithm as in the previous paragraph works to identify live and dead states. Since each edge is only visited when a state is marked closed or live, each edge is visited only once throughout the algorithm, we use only amortized O(1) additional time to calculate live and dead states. While this SCC maintenance algorithm ignores the fact that edges do not occur from closed states \({ \texttt {C}}(u)\), this still proves the following result:

Proposition 1

GID state classification reduces to SCC maintenance. That is, suppose we have an algorithm over incremental graphs that maintains the set of SCCs in O(f(mn)) total time given n states and m edge additions.Footnote 4 Then there exists an algorithm to solve GID state classification in O(f(mn)) total time.

Despite this reduction one way, there is no obvious reduction the other way – from cycle detection or SCCs to Definition 2. This is because, while the existence of a cycle of non-live states implies bi-reachability between all states in the cycle, it does not necessarily imply that all of the bi-reachable states are dead.

3 Algorithms

This section presents Algorithm 2, which solves the state classification problem in logarithmic time (Theorem 3); and Algorithm 3, an alternative lazy solution. Both algorithms are optimized versions of Algorithm 1, a first-cut algorithm which establishes the structure of our approach. We begin by establishing some basic terminology shared by all of the algorithms (see Fig. 3).

Fig. 3.
figure 3

Top: Basic classification of GID states into four disjoint categories. Bottom: Additional terminology used in this paper.

States in a GID can be usefully classified as exactly one of four statuses: live, dead, unknown, or open, where unknown means “closed but not yet live or dead”, and open means “not closed and not live”. Note that a state may be live and neither open nor closed; this terminology keeps the classification disjoint. Pragmatically, for live states it does not matter if they are classified as open or closed, since edges from those states no longer have any effect. However, all dead and unknown states are closed, and no states are both open and closed.

Given this classification, the intuition is that for each unknown state u, we only need one path from u to an open state to prove that it is not dead; we want to maintain one such path for all unknown states. To maintain all of these paths simultaneously, we maintain an acyclic directed forest structure on unknown and open states where the roots are open states, and all non-root states have a single edge to another state, called its successor. Edges other than successor edges can be temporarily ignored, except for when marking live states; these are kept as reserve edges. Specifically, we add every edge (uv) as a backward-edge from v (to allow propagating live states), but for edges not in the forest we keep (uv) in a reserve list from u. We store all edges, including backward-edges, in the original order (uv). The reserve list edge becomes relevant only when either (i) u is marked as closed, or (ii) u’s successor is marked as dead.

In order to deal with cycles, we need to maintain the forest of unknown states not on the original graph, but on a union-find condensed graph, similar to [63]. When we find a cycle of unknown states, we merge all states in the cycle by calling the union method in the union-find. We refer to a state as canonical if it is the canonical representative of its equivalence class in the union find; the condensed graph is a forest on canonical states. We use xyz to denote canonical states (states in the condensed graph), and uvw to denote the original states (not known to be canonical). Following [63], we maintain edges as linked lists rather than sets, and using the original states instead of canonical states; this is important as it allows combining edge lists in O(1) time when merging states.

figure c

3.1 First-Cut Algorithm

Algorithm 1 is a first cut based on these ideas. The procedures OnEdge and OnTerminal contain all the logic to identify live states, using \({ \texttt {bck}}\) to look up backward-edges; OnTerminal doubles as a “mark live” function when it is called by OnEdge. The procedure OnClosed tries to assign a successor edge to a newly closed state, to prove that it is not dead. In case we run out of reserve edges, the state is marked dead and we recursively call OnClosed along backward-edges, which will either set a new successor or mark them dead.

The union-find data structure \({ \texttt {UF}}\) provides \({ \texttt {UF.union}}(v_1, v_2)\), \({ \texttt {UF.find}}(v)\), and \({ \texttt {UF.iter}}(v)\): \({ \texttt {UF.union}}\) merges \(v_1\) and \(v_2\) to refer to the same canonical state, \({ \texttt {UF.find}}\) returns the canonical state for v, and \({ \texttt {UF.iter}}\) iterates over states equivalent to v. These use amortized \(\alpha (n)\) for n updates, where \(\alpha (n) \in o(\log n)\) is the inverse Ackermann function. We only merge states if they are bi-reachable from each other, and both unknown; this implies that all states equivalent to a state x have the same status. Each edge (uv) is always stored in the maps \({ \texttt {res}}\) and \({ \texttt {bck}}\) using its original states (i.e., edge labels are not updated when states are merged); but we can quickly obtain the corresponding edge on canonical states via \(({ \texttt {UF.find}}(u), { \texttt {UF.find}}(v))\). Once a state is marked Live or Dead, its edge maps are no longer used.

Invariants. Altogether, we respect the following invariants. Successor and no cycles describe the forest structure, and, edge representation ensures that all edges in the input GID are represented somehow in the current graph.

  • Merge equivalence: For all states u and v, if \({ \texttt {UF.find}}(u) = { \texttt {UF.find}}(v)\), then u and v are bi-reachable and both closed. (This implies that u and v are both live, both dead, or both unknown.)

  • Status correctness: For all u, \({ \texttt {status}}({ \texttt {UF.find}}(u))\) equals the status of u.

  • Successor edges: If x is unknown, then \({ \texttt {succ}}(x)\) is defined and is an unknown or open state. If x is open, then \({ \texttt {succ}}(x)\) is not defined.

  • No cycles: There are no cycles among the set of edges \((x, { \texttt {UF.find}}({ \texttt {succ}}(x)))\), over all unknown and open canonical states x.

  • Edge representation: For all edges (uv) in the input GID, at least one of the following holds: (i) \((u, v) \in { \texttt {res}}({ \texttt {UF.find}}(v))\); (ii) \(v = { \texttt {succ}}({ \texttt {UF.find}}(u))\); (iii) \({ \texttt {UF.find}}(u) = { \texttt {UF.find}}(v)\); (iv) u is live; or (v) v is dead.

Theorem 1

Algorithm 1 is correct.

Proof (Summary)

The full proof can be found in the arXiv version [60]. The status correctness invariant implies correct output at each step, so it suffices to argue that all of the invariants above are preserved. Upon receiving \({ \texttt {E}}(u, v)\) or \({ \texttt {T}}(u)\), some dead, unknown, or open states may become live, but this does not change the status of any other states. The main challenge of the proof is the recursive procedure OnClosed \({ \texttt {C}}(u)\). On recursive calls, some states are temporarily marked \({ \texttt {Open}}\), meaning they are roots in the forest structure. During recursive calls, we need a slightly generalized invariant: each forest root corresponds to a pending call to OnClosed \({ \texttt {C}}(u)\) (i.e., an element of ToRecurse for some call on the stack) and is a state that is dead iff all of its reserve edges are dead. After we prove this (generalized) invariant, when OnClosed \({ \texttt {C}}(u)\) terminates, we know that there are no more temporary open states, and the forest structure implies that all closed states are correctly marked as unknown.

Complexity. The core inefficiency in Algorithm 1 — what we need to improve — lies in CheckCycle. The procedure repeatedly sets \(z \leftarrow { \texttt {succ}}(z)\) to find the tree root, which in general could be linear time in the number of edges. For example, this inefficiency results in \(O(m^2)\) work for a linear graph read in backwards order: E(2, 1), C(2), E(3, 2), C(3), ..., E(n, n-1), C(n).

All other procedures use amortized \(\alpha (m)\) time per update for m updates, using array lists to represent the maps fwd, bck, and succ for O(1) lookups. To do the amortized analysis, the cost of each call to OnClosed can be assigned either to the target of an edge being marked dead, or to an edge being merged as part of a cycle, and both of these events can only happen once per edge added to the GID. And the OnTerminal calls and loop iterations only run once per edge in the graph when the target of that edge is marked live or terminal.

figure d

3.2 Logarithmic Algorithm

At its core, CheckCycle requires solving an undirected reachability problem on a graph that is restricted to a forest. However, the forest is changed not just by edge additions, but edge additions and deletions. While undirected reachability and reachability in directed graphs are both difficult to solve incrementally, reachability in dynamic forests can be solved in \(O(\log m)\) time per operation. This is the main intuition for our solution, using an Euler Tour Trees data structure EF of Henzinger and King [35], shown in Algorithm 2.

Unfortunately, this idea does not work straightforwardly – once again because of the presence of cycles in the original graph. We cannot simply store the forest as a condensed graph with edges on condensed states. As we saw in Algorithm 1, it was important to store successor edges as edges into V, rather than edges into X – this is the only way that we can merge states in O(1), without actually inspecting the edge lists. If we needed to update the forest edges to be in X, this could require O(m) work to merge two O(m)-sized edge lists as each edge might need to be relabeled in the EF graph.

To solve this challenge, we instead store the EF data structure on the original states, rather than the condensed graph; but we ensure that each canonical state is represented by a tree of original states. When adding edges between canonical states, we need to make sure to remember the original label (uv), so that we can later remove it using the original labels (this happens when its target becomes dead). When an edge would create a cycle, we instead simply ignore it in the EF graph, because a line of connected trees forms a tree.

Summary and Invariants. In summary, the algorithm reuses the data, procedures, and invariants from Algorithm 1, with the following important changes: (1) We maintain the EF data structure EF, a forest on V. (2) The successor edges are stored as their original edge labels (uv), rather than just as a target state. (3) The procedure OnClosed is rewritten to maintain the graph EF. (4) The successor edges and no cycles invariants use the new succ representation: that is, they are constraints on the edges \((x, { \texttt {UF.find}}(v))\), where \({ \texttt {succ}}(x) = (u, v)\). (5) We add the following two constraints on edges in EF, depending on whether those states are equivalent in the union-find structure.

  • EF inter-edges: For all inequivalent uv, (uv) is in the EF if and only if \((u, v) = { \texttt {succ}}({ \texttt {UF.find}}(u))\) or \((v, u) = { \texttt {succ}}({ \texttt {UF.find}}(v))\).

  • EF intra-edges: For all unknown canonical states x, the set of edges (uv) in the EF between states belonging to x forms a tree.

Theorem 2

Algorithm 2 is correct.

Proof

Observe that the EF inter-edges constraint implies that EF only contains edges between unknown and open states, together with isolated trees. In the modified OnTerminal procedure, when marking states as live we remove inter-edges, so we preserve this invariant.

Next we argue that given the invariants about EF, for an open state y the CheckCycle procedure returns true if and only if (yz) would create a directed cycle. If there is a cycle of canonical states, then because canonical states are connected trees in EF, the cycle can be lifted to a cycle on original states, so y and z must already be connected in this cycle without the edge (yz). Conversely, if y and z are connected in EF, then there is a path from y to z, and this can be projected to a path on canonical states. However, because y is open, it is a root in the successor forest, so any path from y along successor edges travels only on backward-edges; hence z is an ancestor of y in the directed graph, and thus (yz) creates a directed cycle.

This leaves the OnClosed procedure. Other than the EF lines, the structure is the same as in Algorithm 1, so the previous invariants are still preserved, and it remains to check the EF invariants. When we delete the successor edge and temporarily mark \({ \texttt {status}}(x) = { \texttt {Open}}\) for recursive calls, we also remove it from EF, preserving the inter-edge invariant. Similarly, when we add a successor edge to x, we add it to EF, preserving the inter-edge invariant. So it remains to consider when the set of canonical states changes, which is when merging states in a cycle. Here, a line of canonical states is merged into a single state, and a line of connected trees is still a tree, so the intra-edge invariant still holds for the new canonical state, and we are done.    \(\square \)

Theorem 3

Algorithm 2 uses amortized logarithmic time per edge update.

Proof

By the analysis of Algorithm 1, each line of the algorithm is executed O(m) times and there are O(m) calls to CheckCycle. Each line of code is either constant-time, \(\alpha (m) = o(\log m)\) time for the UF calls, or \(O(\log m)\) time for the EF calls, so in total the algorithm takes \(O(m \log m)\) time total, or amortized \(O(\log m)\) time per edge.    \(\square \)

figure e

3.3 Lazy Algorithm

While the asymptotic complexity of \(\log m\) could be the end of the story, in practice, we found the cost of the EF calls to be a significant overhead. The technical details of Euler Tour Trees include building an AVL-tree cycle for each tree, where the cycle contains each state of the graph once and each edge in the graph twice. While this is elegant, it turns out that adding one edge to EF results in no less than seven modifications to the AVL tree: a split at the source, then a split at the target, then an edge addition in both directions (uv) and (vu) to the cycle, and finally the four resulting trees need to be glued together (using three merge operations).Footnote 5 Each one of these operations comes with a rebalancing operation which could do \(\Omega (\log m)\) tree rotations and pointer dereferences to visit the nodes in the AVL tree. Some optimizations may be possible – including, e.g., combining rebalancing operations or considering variants of AVL trees with better cache locality. Nonetheless, these constant-factor overheads constitute a serious practical drawback for Algorithm 2.

To address this, in this section, we investigate a simpler, lazy algorithm which avoids EF and directly optimizes Algorithm 1. For this, one idea in the right direction is to store for each state a direct pointer to the root which results from repeatedly calling \({ \texttt {succ}}\). But there are two issues with this. First, maintaining this may be difficult (when the root changes, potentially updating a linear number of root pointers). Second, the root may be marked dead, in which case we have to re-compute all pointers to that root.

Instead, we introduce a jump list from each state: intuitively, it will contain states after calling successor once, twice, four times, eight times, and so on at powers of two; and it will be updated lazily, at most once for every visit to the state. When a jump becomes obsolete (the target dead), we just pop off the largest jump, so we do not lose all of our work in building the list. We maintain the following additional information: for each unknown canonical state x, a nonempty list of jumps \([v_0, v_1, v_2, \ldots , v_k]\), such that \(v_0\) is reachable from x, \(v_1\) is reachable from \(v_0\), \(v_2\) is reachable from \(v_1\), and so on, and \(v_1 = { \texttt {succ}}(x)\).

The resulting algorithm is shown in Algorithm 3. The key procedure is GetRoot z, which is called when adding a reserve edge (yz) to the graph. In addition to all invariants from Algorithm 1, we maintain the following invariants for every unknown canonical state x, where \({ \texttt {jumps}}(x)\) is a list of states \(v_0, v_1, v_2, \ldots , v_k\). First jump: if the jump list is nonempty, then \(v_0 = { \texttt {succ}}(v)\). Reachability: \(v_{i+1}\) is reachable from \(v_i\) for all i. The jump list also satisfies the following powers of two invariant: on the path of canonical states from \(v_0\) to \(v_i\), the total number of states (including all states in each equivalence class) is at least \(2^i\). While this invariant is not necessary for correctness, it is the key to the algorithm’s practical efficiency: it follows from this that if the jump list is fully saturated for every state, querying GetRoot z will take only logarithmic time. However, since jump lists are updated lazily, the jump list may not be saturated, so this does not establish a true asymptotic complexity for the algorithm.

Theorem 4

Algorithm 3 is correct.

Proof

The first jump and reachability invariants imply that \(v_1, v_2, \ldots \) is some sublist of the states along the path from an unknown state to its root, potentially followed by some dead states. We need to argue that the subprocedure GetRoot (i) receives the same verdict as repeatedly calling \({ \texttt {succ}}\) to find a cycle in the first-cut algorithm and (ii) preserve both invariants. For first jump, if the jump list is empty, then GetRoot ensures that the first jump is set to the successor state. For reachability, popping dead states from the jump list clearly preserves the invariant, as does adding on a state along the path to the root, which is done when \(k' \ge k\). Merging states preserves both invariants trivially because we throw the jump list away, and marking states live preserves both invariants trivially since the jump list is only maintained and used for unknown states.    \(\square \)

4 Experimental Evaluation

The primary goal of our evaluation has been to experimentally validate the performance of GIDs as a data structure in isolation, rather than their use in a particular application. Our evaluation seeks to answer the following questions:

  • Q1 How does our approach (Algorithms 2 and 3) compare to the state-of-the-art approach based on maintaining SCCs?

  • Q2 How does the performance of the studied algorithms vary when the class of input graphs changes (e.g., sparse vs. dense, structured vs. random)?

  • Q3 Finally, how do the studied algorithms perform on GIDs taken from the example application to regexes described in Sect. 5?

To answer Q1, we put substantial implementation effort into a common framework on which a fair comparison could be made between different approaches. To this end, we implemented GIDs as a data structure in Rust which includes a graph data structure on top of which all algorithms are built. In particular, this equalizes performance across algorithms for the following baseline operations: state and edge addition and retrieval, DFS and BFS search, edge iteration, and state merging. We chose Rust for our implementation for its performance, and because there does not appear to be an existing publicly available implementation of BFGT in any other language.Footnote 6 The number of lines of code used to implement these various structures is summarized in Fig. 4. We implement Algorithms 2 and 3 and compare them with the following baselines:

  • BFGT The state-of-the-art approach based on SCC maintenance, using worst-case amortized \(O(\sqrt{m})\) time per update [10].

  • Simple A simpler version of BFGT that uses a forward-DFS to search for cycles. Like Algorithm 1, it can take \(\varTheta (m^2)\) in the worst case.

  • Naïve A greedy upper bound for all approaches which re-computes the entire set of dead states using a linear-time DFS after each update.

Fig. 4.
figure 4

Left: Lines of code for each algorithm and other implementation components. Right: Benchmark GIDs used in our evaluation. Where present, the source column indicates the quantity prior to filtering out trivially small graphs.

Fig. 5.
figure 5

Evaluation results. Left: Cumulative plot showing the number of benchmarks solved in time t or less for basic GID classes (top), randomly generated GIDs (middle), and regex-derived GIDs (bottom). Top right: Scatter plot showing the size of each benchmark vs time to solve. Bottom right: Average time to solve benchmarks of size closest to s, where values of s are chosen in increments of 1/3 on a log scale.

To answer Q2, first, we compiled a range of basic graph classes which are designed to expose edge case behavior in the algorithms, as well as randomly generated graphs. We focus on graphs with no live states, as live states are treated similarly by all algorithms. Most of the generated graphs come in \(2 \times 2 = 4\) variants: (i) the states are either read in a forwards- or backwards- order; and (ii) they are either dead graphs, where there are no open states at the end and so everything gets marked dead; or unknown graphs, where there is a single open state at the end, so most states are unknown. In the unknown case, it is sufficient to have one open state at the end, as many open states can be reduced to the case of a single open state where all edges point to that one. We include GIDs from line graphs and cycle graphs (up to 100K states in multiples of 3); complete and complete acyclic graphs (up to 1K states); and bipartite graphs (up to 1K states). These are important cases, for example, because the reverse-order line and cycle graphs are a potential worst case for Simple and BFGT.

Second, to exhibit more dynamic behavior, we generated random graphs: sparse graphs with a fixed out-degree from each state, chosen from 1, 2, 3,  or 10 (up to 100K states); and dense graphs with a fixed probability of each edge, chosen from .01, .02,  or .03 (up to 10K states). Each case uses 10 different random seeds. As with the basic graphs, states are read in some order and marked closed.

To answer Q3, we wrote a backend to extract a GID at runtime from Z3’s regex solver [61]. While the backend of the solver is precisely a GID — and so could be passed to our GID implementation dynamically — this setup includes many extraneous overheads, including rewriting expressions and computing derivatives when adding nodes to the graph. While some of these overheads may be possible to eliminate, and we are fairly confident that GIDs would be a bottleneck for sufficiently large input examples, this makes it difficult to isolate the performance impact of the GID data structure itself, which is the sole focus of this paper. We therefore instrumented the Z3 solver code to export the (incremental) sequence of graph updates that would be performed during a run of Z3 on existing regex benchmarks. For each benchmark, this instrumented code produces a faithful representation of the sequence of graph updates that actually occur in a run of the SMT solver on this particular benchmark. For each regex benchmark, we thus get a GID benchmark for the present paper. The benchmarks focus on extended regexes, rather than plain classical regexes as these are the ones for which dead state detection is relevant (see Sect. 5). We include GIDs for the RegExLib benchmarks [15] and the handcrafted Boolean benchmarks reported in [61]. We add to these 11 additional examples designed to be difficult GID cases. The collection of regex benchmarks we used (just described) is available on GitHub.Footnote 7

From both the Q2 and Q3 benchmarks, we filter out any benchmark which takes under 10 milliseconds for all of the algorithms to solve (including Naïve), and we use a 60 second timeout. The evaluation was run on a 2020 MacBook Air (MacOS Monterey) with an Apple M1 processor and 8GB of memory.

Correctness. To ensure that all of our implementations our correct, we invested time into unit testing and checked output correctness on all of our collected benchmarks, including several cases which exposed bugs in previous versions of one or more algorithms. In total, all algorithms are vetted against 25 unit tests from handwritten edge cases that exposed prior bugs, 373 unit tests from benchmarks, and 30 module-level unit tests for specific functions.

Results. Figure 5 shows the results. Algorithm 3 shows significant improvements over the state-of-the-art, solving more benchmarks in a smaller amount of time across basic GIDs, random GIDs, and regex GIDs. Algorithm 2 also shows state-of-the-art performance, similar to BFGT on basic and regex GIDs and significantly better on random GIDs. On the bottom right, since looking at average time is not meaningful for benchmarks of widely varying size, we stratify the size of benchmarks into buckets, and plot time-to-solve as a function of size. Both x-axis and y-axis are on a log scale. The plot shows that Algorithm 3 exhibits up to two orders of magnitude speedup over BFGT for larger GIDs – we see speedups of 110x to 530x for GIDs in the top five size buckets (GIDs of size nearest to 100K, \({\sim }200\)K, \({\sim }500\)K, 1M, and \({\sim }2\)M).

New Implementations of Existing Work. Our implementation contributes, to our knowledge, the first implementation of BFGT specifically for SCC maintenance. In addition, it is one of the first implementations of Euler Tour Trees (see [7] for another), including the AVL tree backing for tours, and likely the first implementation in Rust.

5 Application to Extended Regular Expressions

In this section, we explain how precisely the GID state classification problem arises in the context of derivative-based solvers [45, 61]. We first define extended regexes [31] (regexes extended with intersection \( \,\texttt { \& }\,\) and complement ) modulo a symbolic alphabet \(\mathcal {A}\) of predicates that represent sets of characters. We explain the main idea behind symbolic derivatives, as found in [61]; these generalize Brzozowski [18] and Antimirov derivatives [5] (see also [19, 42] for other proposals). Symbolic derivatives provide the foundation for incrementally creating a GID. Then we show, through an example, how a solver can incrementally expand derivatives to reduce the satisfiability problem to the GID state classification problem (Definition 2).

Define a regex by the following grammar, where \(\varphi \in \mathcal {A}\) denotes a predicate:

$$ \textit{RE}\,{::=}\, \; \varphi \;\mid \; \varepsilon \;\mid \; \textit{RE}_1\cdot \textit{RE}_2 \;\mid \; \textit{RE}^{\texttt {*}}\;\mid \; \textit{RE}_1\,\texttt {|}\,\textit{RE}_2\;\mid \; \textit{RE}_1\,\texttt { \& }\,\textit{RE}_2\;\mid \; \texttt {~{}}\textit{RE}$$

Let \(R^k\) represent the concatenation of R k times. The symbolic derivative of a regex R, denoted \(\delta _{}(R)\), is a regex which describes the set of suffixes of strings in R after the first character is removed. The formal definition can be found in [61] and in the arXiv version of the present paper [60].

To apply Definition 1 to regexes: states are regexes; edges are transitions from a regex to its derivatives; and terminal states are the so-called nullable regexes, where a regex is nullable if it matches the empty string. Nullability can be computed inductively over the structure of regexes: for example, \(\varepsilon \) and \(R^{\texttt {*}}\) are nullable, and \( R_1\,\texttt { \& }\,R_2\) is nullable iff both \(R_1\) and \(R_2\) are nullable. A live state here is thus a regex that reaches a nullable regex via 0 or more edges. This implies that there exists a concrete string matching it. Conversely, dead states are always empty, i.e. they match no strings, but can reach other dead states, creating strongly connected components of closed states none of which are live. For example, the false predicate \(\bot \) of \(\mathcal {A}\) serves as the regex that matches nothing and is trivially a dead state. Thus \(\texttt {~{}}\bot \) is equivalent to \(\centerdot ^{\texttt {*}}\), where \(\centerdot \) is the true predicate and is trivially a live state.

5.1 Reduction from Incremental Regex Emptiness to GIDs

For simplicity, suppose we want to determine the satisfiability of a single regex constraint \(s \in R\), where s is a string variable and R is a concrete regex. (This is not overly restrictive – any number of simultaneous regex constraints for a string s can be combined into single regex constraint by using the Boolean operations of regexes.) For example, let \(L = \texttt {~{}}(\centerdot ^{\texttt {*}}\alpha \centerdot ^{100})\) and \( R= L \,\texttt { \& }\,(\centerdot \alpha )\), where \(\alpha \) is the “is digit” predicate that is true of characters that are digits (often denoted ). The solver manipulates regex membership constraints on strings by unfolding them [61]. The constraint \(s \in R\), that essentially tests nonemptiness of R with s as a witness, becomes

$$ (s = \epsilon \wedge \textit{Nullable}(R)) \vee (s \ne \epsilon \wedge s_{1..}\in \delta _{s_{0}}(R)) $$

where, \(s\ne \epsilon \) since R is not nullable, \(s_{i..}\) is the suffix of s from index i, and

figure h

Let \( R_1=L\,\texttt { \& }\,\texttt {~{}}(\centerdot ^{100})\,\texttt { \& }\,\alpha \) and \( R_2=L\,\texttt { \& }\,\alpha \). So R has two outgoing transitions \({R}{\xrightarrow {\alpha }}{R_1}\) and \({R}{\xrightarrow {\lnot {\alpha }}}{R_2}\) that contribute the edges \((R,R_1)\) and \((R,R_2)\) into the GID. Note that these edges depend only on R and not on \(s_{0}\).

We continue the search incrementally by checking the two branches of the if-then-else constraint, where \(R_1\) and \(R_2\) are again not nullable (so \(s_{1..} \ne \epsilon \)):

figure i

It follows that \({R_1}{\xrightarrow {\alpha }}{\varepsilon }\) and \({R_2}{\xrightarrow {\alpha }}{\varepsilon }\), so the edges \((R_1,\varepsilon )\) and \((R_2,\varepsilon )\) are added to the GID where \(\epsilon \) is a trivial terminal state. In fact, after \(R_1\) the search already terminates because we then have the path \((R,R_1)(R_1,\epsilon )\) that implies that R is live. The associated constraints \(s_{0}\in \alpha \) and \(s_{1}\in \alpha \) and the final constraint that \(s_{2..}=\epsilon \) can be used to extract a concrete witness, e.g., \(s=\texttt {``42"}\).

Soundness of the algorithm follows from that if R is nonempty (\(s\in R\) is satisfiable), then we eventually arrive at a nullable (terminal) regex, as in the example run above. To achieve completeness – and to eliminate dead states as early as possible – we incrementally construct a GID corresponding to the set of regexes seen so far (as above). After all the feasible transitions from R to its derivatives in \(\delta _{}(R)\) are added to the GID as edges (WLOG in one batch), the state R becomes closed. Crucially, due to the symbolic form of \(\delta _{}(R)\), no derivative is missing. Therefore R is known to be empty precisely as soon as R is detected as a dead state in the GID. An additional benefit is that the algorithm is independent of the size of the universe of \(\mathcal {A}\), that may be very large (e.g. the Unicode character set), or even infinite. We get the following theorem that uses finiteness of the closure of symbolic derivatives [61, Theorem 7.1]:

Theorem 5

For any regex R, (1) If R is nonempty, then the decision procedure eventually marks R live. (2) If R is empty, then the decision procedure marks R dead at the earliest stage that it is know to be dead, and terminates.

6 Related Work

Online Graph Algorithms. Online graph algorithms are typically divided into problems over incremental graphs (where edges are added), decremental graphs (where edges are deleted), and dynamic graphs (where edges are both added and deleted), with core data structures discussed in [27, 49]. Important problems include transitive closure, cycle detection, topological ordering, and strongly connected component (SCC) maintenance.

For incremental topological ordering, [46] is an early work, and [33] presents two different algorithms, one for sparse graphs and one for dense graphs – the algorithms are also extended to work with SCCs. The sparse algorithm was subsequently simplified in [10] and is the basis of our implementation named BFGT in Sect. 4. A unified approach of several algorithms based on [10] is presented in [21] that uses a notion of weak topological order and a labeling technique that estimates transitive closure size. Further extensions of [10] are studied in [11, 14] based on randomization.

For dynamic directed graphs, a topological sorting algorithm that is experimentally preferable for sparse graphs is discussed in [56], and a related article [55] discusses strongly connected components maintenance. Transitive closure for dynamic graphs is studied in [57], improving upon some algorithms presented earlier in [34]. One major application for these algorithms is in pointer analysis [54].

For undirected forests, fully dynamic reachability is solvable in amortized logarithmic time per edge via multiple possible approaches [3, 30, 35, 59, 64]; our implementation uses Euler Tour Trees [35].

Data Structures for SMT. UnionFind [63] is a foundational data structure used in SMT. E-graphs [23, 67] are used to ensure functional extensionality, where two expressions are equivalent if their subexpressions are equivalent [25, 52]. In both UnionFind and E-graphs, the maintained relation is an equivalence relation. In contrast, maintaining live and dead states involves tracking reachability rather than equivalence. To the best of our knowledge, the specific formulation of incremental reachability we consider here is new.

Dead State Elimination in Automata. A DFA or NFA may be viewed as a GID, so state classification in GIDs solves dead state elimination in DFAs and NFAs, while additionally working in an incremental fashion. Dead state elimination is also known as trimming [37] and plays an important role in automata minimization [12, 38, 48]. The literature on minimization is vast, and goes back to the 1950s [16, 17, 39,40,41, 50, 53]; see [65] for a taxonomy, [2] for an experimental comparison, and [22] for the symbolic case. Watson et. al. [66] propose an incremental minimization algorithm, in the sense that it can be halted at any point to produce a partially minimized, equivalent DFA; unlike in our setting, the DFA’s states and transitions are fixed and read in a predetermined order.