Fine-Grained Complexity of Safety Verification

We study the fine-grained complexity of Leader Contributor Reachability (LCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {LCR}} $$\end{document}) and Bounded-Stage Reachability (BSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {BSR}} $$\end{document}), two variants of the safety verification problem for shared memory concurrent programs. For both problems, the memory is a single variable over a finite data domain. Our contributions are new verification algorithms and lower bounds. The latter are based on the Exponential Time Hypothesis (ETH\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {ETH}} $$\end{document}), the problem SetCover\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {Set~Cover}} $$\end{document}, and cross-compositions. LCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {LCR}} $$\end{document} is the question whether a designated leader thread can reach an unsafe state when interacting with a certain number of equal contributor threads. We suggest two parameterizations: (1) By the size of the data domain D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {D}}$$\end{document} and the size of the leader L\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {L}}$$\end{document}, and (2) by the size of the contributors C\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {C}}$$\end{document}. We present algorithms for both cases. The key techniques are compact witnesses and dynamic programming. The algorithms run in O∗((L·(D+1))L·D·DD)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}^*(({\texttt {L}}\cdot ({\texttt {D}}+1))^{{\texttt {L}}\cdot {\texttt {D}}} \cdot {\texttt {D}}^{{\texttt {D}}})$$\end{document} and O∗(2C)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}^*(2^{{\texttt {C}}})$$\end{document} time, showing that both parameterizations are fixed-parameter tractable. We complement the upper bounds by (matching) lower bounds based on ETH\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {ETH}} $$\end{document} and SetCover\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {Set~Cover}} $$\end{document}. Moreover, we prove the absence of polynomial kernels. For BSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {BSR}} $$\end{document}, we consider programs involving t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {t}}$$\end{document} different threads. We restrict the analysis to computations where the write permission changes s\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {s}}$$\end{document} times between the threads. BSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {BSR}} $$\end{document} asks whether a given configuration is reachable via such an s\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {s}}$$\end{document}-stage computation. When parameterized by P\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {P}}$$\end{document}, the maximum size of a thread, and t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {t}}$$\end{document}, the interesting observation is that the problem has a large number of difficult instances. Formally, we show that there is no polynomial kernel, no compression algorithm that reduces the size of the data domain D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {D}}$$\end{document} or the number of stages s\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {s}}$$\end{document} to a polynomial dependence on P\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {P}}$$\end{document} and t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {t}}$$\end{document}. This indicates that symbolic methods may be harder to find for this problem.


Introduction
We study the fine-grained complexity of two safety verification problems [1,17,31] for shared memory concurrent programs. The motivation to reconsider these problems are recent developments in fine-grained complexity theory [11,38,7,34]. They suggest that classifications such as NP or even FPT are too coarse to explain the success of verification methods. Instead, it should be possible to identify the precise influence that parameters of the input have on the verification time. Our contribution confirms this idea. We give new verification algorithms for the two problems that, for the first time, can be proven optimal in the sense of finegrained complexity theory. To state the results, we need some background. As we proceed, we explain the development of fine-grained complexity theory.
There is a well-known gap between the success that verification tools see in practice and the judgments about computational hardness that worst-case complexity is able to give. The applicability of verification tools steadily increases by tuning them towards industrial instances. The complexity estimation is stuck with considering the input size or at best assuming certain parameters to be constant. However, the latter approach is not very enlightening if the runtime is n k , where n is the input size and k the parameter.
The observation of a gap between practical algorithms and complexity theory is not unique to verification but made in every field that has to solve hard computational problems. Complexity theory has taken up the challenge to close the gap. So-called fixed-parameter tractability (FPT) [12,14] proposes to identify parameters k so that the runtime is f (k)poly(n), where f is a computable function and poly(n) denotes any polynomial dependent on n. These parameters are powerful in the sense that they dominate the complexity.
For an FPT-result to be useful, function f should only be mildly exponential, and of course k should be small in the instances of interest. Intuitively, they are what one needs to optimize. Fine-grained complexity is the study of upper and lower bounds on the function. Indeed, the fine-grained complexity of a problem is written as O * (f (k)), emphasizing f and k and suppressing the polynomial part. For upper bounds, the approach is still to come up with an algorithm.
For lower bounds, fine-grained complexity has taken a new and very pragmatic perspective. For the problem of n-variable 3-SAT the best known algorithm runs in O(2 n ) time, and this bound has not been improved since 1970. The idea is to take improvements on this problem as unlikely, known as the exponential-time hypothesis (ETH) [34]. Formally, it asserts that there is no 2 o(n) -time algorithm for 3-SAT. ETH serves as a lower bound that is reduced to other problems [38]. An even stronger assumption about SAT, called strong exponential-time hypothesis (SETH) [34,7], and a similar one about Set Cover [11] allow for lower bounds like the absence of O * ((2 − δ) n )-time algorithms.
In this work, we contribute fine-grained complexity results for verification problems on concurrent programs. The first problem (LCR) is reachability for a leader thread that is interacting with an unbounded number of contributors [31,17]. We show that, assuming a parameterization by the size of the leader L and the size of the data domain D, the problem can be solved in O * ((L · (D + 1)) L·D · D D ). At the heart of the algorithm is a compression of computations into witnesses. To check reachability, our algorithm then iterates over candidates for witnesses and checks each of them for being a proper witness. Interestingly, we can formulate a variant of the algorithm that seems to be suited for large state spaces.
Using ETH, we show that the algorithm is (almost) optimal. Moreover, the problem is shown to have a large number of hard instances. Technically, there is no polynomial kernel [5,6]. Experience with kernel lower bounds is still limited. This notion of hardness seems to indicate that symbolic methods are hard to apply here. The lower bounds that we present share similarities with the reductions presented in [28,8,29].
If we consider the size C of the contributors as a parameter, we obtain an O * (2 C ) upper bound. Our algorithm is based on dynamic programming. We use the technique to solve a reachability problem on a graph that is shown to be a compressed representation for LCR. The compression is based on a saturation argument which is inspired by thread-modular reasoning [22,23,30,33]. With the hardness assumption on Set Cover we show that the algorithm is indeed optimal. Moreover, we prove the absence of a polynomial kernel.
Parameterizations of LCR involving just a single parameter D or L are intractable. We show that these problems are W[1]-hard. This proves the existence of an FPT-algorithm for those parameterizations unlikely.
The second problem we study generalizes bounded context switching. Bounded stage reachability (BSR) asks whether a state is reachable if there is a bound s on the number of times the write permission is allowed to change between the threads [1]. Again, we show the new form of kernel lower bound. The result is tricky and highlights the power of the computation model.
The results are summarized by the table below. Main findings are highlighted in gray. We present two new algorithms for LCR. Moreover, we suggest kernel lower bounds as hardness indicators for verification problems. The corresponding lower bound for BSR is particularly difficult to achieve. among configurations → ⊆ C × (Op(D ) ∪ {ε}) × C is obtained by lifting the transition relations of the threads. To define it, let pc 1 = pc[i = q i ], meaning thread P i is in state q i and otherwise the program counter coincides with pc. Let pc 2 = pc[i = q ′ i ]. If thread P i tries to read with the transition q i ?a − → q ′ i , then (pc 1 , a) ?a − → (pc 2 , a). Note that the memory is required to hold the desired value.
If the thread has the transition q i a). The program's transition relation is generalized to words, c w − → c ′ . We call such a sequence of consecutive labeled transitions a computation. To indicate that there is a word justifying a computation from c to c ′ , we write c → * c ′ . We may use an index w − → i to indicate that the computation was induced by P i . Where appropriate, we use the program as an index, Fixed-Parameter Tractability We wish to study the fine-grained complexity of safety verification problems for the above programs. This means our goal is to identify parameters of these problems that satisfy two properties. First, in practical instances they are small. Second, assuming that these parameters are small, show that efficient verification algorithms can be obtained. Parametrized complexity is a branch of complexity theory that makes precise the idea of being efficient relative to a parameter.
Fix a finite alphabet Σ. A parameterized problem L is a subset of Σ * ×N. The problem is called fixed-parameter tractable if there is a deterministic algorithm that, given (x, k) ∈ Σ * × N, decides (x, k) ∈ L in time f (k) · |x| O(1) . We use FPT for the class of all such problems and say a problem is FPT to mean it is in that class. Note that f is a computable function only depending on the parameter k. It is common to denote the runtime by O * (f (k)) and suppress the polynomial part. We will be interested in the precise dependence on the parameter, in upper and lower bounds on the function f . This study is often referred to as fine-grained complexity.
Lower bounds on f are usually obtained from assumptions about SAT. The most famous is the Exponential Time Hypothesis (ETH). It assumes that there is no algorithm solving n-variable 3-SAT in 2 o(n) time. Then, the reasoning is as follows: If f drops below a certain bound, ETH would fail. Other standard assumptions for lower bounds are the Strong Exponential Time Hypothesis (SETH) and the hardness assumption of Set Cover. We postpone the definition of the latter and focus on SETH. This assumption is more restrictive than ETH. It asserts that n-variable SAT cannot be solved in O * ((2 − δ) n ) time for any δ > 0.
While many parameterizations of NP-hard problems were proven to be fixedparameter tractable, there are problems that are unlikely to be FPT. Such problems are hard for the complexity class W [1]. For a theory of relative hardness, the appropriate notion of reduction is called parameterized reduction. Given parameterized problems L, L ′ ⊆ Σ * × N, we say that L is reducible to L ′ via a parameterized reduction if there is an algorithm that transforms an input (x, k) to an input (x ′ , k ′ ) in time g(k) · |x| O(1) such that (x, k) ∈ L if and only if (x ′ , k ′ ) ∈ L ′ . Here, g is a computable function and k ′ is computed by a function only dependent on k.

Leader Contributor Reachability
We consider the leader contributor reachability problem for shared memory programs. The problem was introduced in [31] and shown to be NP-complete in [17] for the finite state case. 1 We contribute two new verification algorithms that target two parameterizations of the problem. In both cases, our algorithms establish fixed-parameter tractability. Moreover, with matching lower bounds we prove them to be optimal even in the fine-grained sense.
An instance of the leader contributor reachability problem is given by a shared memory program of the form A = (D , a 0 , (P L , (P i ) i∈ [1..t] )). The program has a designated leader thread P L and several contributor threads P 1 , . . . , P t . In addition, we are given a set of unsafe states for the leader. The task is to check whether the leader can reach an unsafe state when interacting with a number of instances of the contributors. It is worth noting that the problem can be reduced to having a single contributor. Let the corresponding thread P C be the union of P 1 , . . . , P t (constructed using an initial ε-transition). We base our complexity analysis on this simplified formulation of the problem.
For the definition, let A = (D , a 0 , (P L , P C )) be a program with two threads. Let F L ⊆ Q L be a set of unsafe states of the leader. For any t ∈ N, define the program A t = (D , a 0 , (P L , (P C ) i∈ [1..t] )) to have exactly t copies of P C . Further, let C f be the set of configurations where the leader is in an unsafe state (from F L ). The problem of interest is as follows: Leader Contributor Reachability (LCR) Input: A program A = (D , a 0 , (P L , P C )) and a set of states F L ⊆ Q L . Question: Is there a t ∈ N such that c 0 → * A t c for some c ∈ C f ?
We consider two parameterizations of LCR. First, we parameterize by D, the size of the data domain D , and L, the number of states of the leader P L . We denote the parameterization by LCR(D, L). The second parameterization that we consider is LCR(C), a parameterization by the number of states of the contributor P C . For both, LCR(D, L) and LCR(C), we present fine-grained analyses that include FPT-algorithms as well as lower bounds for runtimes and kernels.
While for LCR(D, L) we obtain an FPT-algorithm, it is not likely that LCR(D) and LCR(L) admit the same. We prove that these problems are W[1]-hard.

Parameterization by Memory and Leader
We give an algorithm that solves LCR in time O * ((L·(D+1)) L·D ·D D ), which means LCR(D, L) is FPT. We then show how to modify the algorithm to solve instances of LCR as they are likely to occur in practice. Interestingly, the modified version of the algorithm lends itself to an efficient implementation based on off-the-shelf sequential model checkers. We conclude with lower bounds for LCR(D, L).
Upper Bound We give an algorithm for the parameterization LCR(D, L). The key idea is to compactly represent computations that may be present in an instance of the given program. To this end, we introduce a domain of so-called witness candidates. The main technical result, Lemma 6, links computations and witness candidates. It shows that reachability of an unsafe state holds in an instance of the program if and only if there is a witness candidate that is valid (in a precise sense). With this, our algorithm iterates over all witness candidates and checks each of them for being valid. To state the overall result, let Wit (L, D ) = (L · (D + 1)) L·D · D D · L be the number of witness candidates and let Valid (L, D, C ) = L 3 · D 2 · C 2 be the time it takes to check validity of a candidate. Note that it is polynomial.
Let A = (D , a 0 , (P L , P C )) be the program of interest and F L be the set of unsafe states in the leader. Assume we are given a computation ρ showing that P L can reach a state in F L when interacting with a number of contributors. We explain the main ideas to find an efficient representation for ρ that still allows for the reconstruction of a similar computation. To simplify the presentation, we assume the leader never writes !a and immediately reads ?a (same value). If this is the case, the read can be replaced by ε.
In a first step, we delete most of the moves in ρ that were carried out by the contributors. We only keep first writes. For each value a, this is the write transitions fw (a) = c !a − → c ′ where a is written by a contributor for the first time. The reason we can omit subsequent writes of a is the following: If fw (a) is carried out by contributor P 1 , we can assume that there is an arbitrary number of other contributors that all mimicked the behavior of P 1 . This means whenever P 1 did a transition, they copycatted it right away. Hence, there are arbitrarily many contributors pending to write a. Phrased differently, the symbol a is available for the leader whenever P L needs to read it. The idea goes back to the Copycat Lemma stated in [17]. The reads of the contributors are omitted as well. We will make sure they can be served by the first writes and the moves done by P L .
After the deletion, we are left with a shorter expression ρ ′ . We turn it into a word w over the alphabet Each transition c !a/?a/ε − −−−− → L c ′ in ρ ′ that is due to the leader moving from q to q ′ is mapped (i) to q.a.q ′ if it is a write and (ii) to q.⊥.q ′ otherwise. A first write fw (a) = c a − → c ′ of a contributor is mapped toā. We may assume that the resulting word w is of the form w = w 1 .w 2 with w 1 ∈ ((Q L .D ⊥ ) * .D) * and w 2 ∈ (Q L .D ⊥ ) * .F L . Note that w can still be of unbounded length.
In order to find a witness of bounded length, we compress w 1 and w 2 to w ′ 1 and w ′ 2 . Between two first writesā andb in w 1 , the leader can perform an unbounded number of transitions, represented by a word in (Q L .D ⊥ ) * . Hence, there are states q ∈ Q L repeating betweenā andb. We contract the word between the first and the last occurrence of q into just a single state q. This state now represents a loop on P L . Since there are L states in the leader, this bounds the number of contractions. Furthermore, we know that the number of first writes is bounded by D, each symbol can be written for the first time at most once. Thus, the compressed string w ′ 1 is a word in the language ((Q L .D ⊥ ) ≤L .D ) ≤D . The word w 2 is of the form w 2 = q.u for a state q ∈ Q L and a word u. We truncate the word u and only keep the state q. Then we know that there is a computation leading from q to a state in F L where P L can potentially write any symbol but read only those symbols which occurred as a first write in w ′ 1 . Altogether, we are left with a word of bounded length.

Definition 2. The set of witness candidates is
Before we elaborate on the precise relation between witness candidates and computations, we turn to an example. It shows how an actual computation is compressed to a witness candidate following the above steps.
Example 3. Consider the program A = (D , a 0 , (P L , P C )) with domain D , leader thread P L , and contributor thread P C given in Figure 1. We follow a computation in A 2 that reaches the unsafe state q 4 of the leader. Note that the transitions are labeled by L and C, depending on whether the leader or a contributor moved.
We construct a witness candidate out of the computation. To this end, we only keep the first writes of the contributors. These are the write !a in the first transition and the write !c in the fifth transition. Both are marked red. They will be represented in the witness candidate by the symbolsā,c ∈D . Now we map the transitions of the leader to words. Writes are preserved, reads are mapped to ⊥. Then we obtain the witness candidatē Note that we omit the last two transitions of the leader. The reason is as follows. After the first writec, the leader is in state q 2 . From this state, the leader can reach q 4 while only reading from first writes that have already appeared in the candidate, namely a and c. Hence, we can truncate the witness candidate at that point and do not have to keep the remaining computation to q 4 .
To characterize computations in terms of witness candidates, we define the notion of validity. This needs some notation. Consider a word w = w 1 . . . w ℓ over some alphabet Γ . For i ∈ [1..ℓ], we set w[i] = w i and w[1. Consider a witness candidate w ∈ E and let i ∈ [1.
.|w|]. We useD (w, i) for the set of all first writes that occurred in w up to position i. Formally, we define it to beD(w, i) = {a |ā is a letter in w[1..i] ↓D }. We abbreviateD (w, |w|) asD (w). Let q ∈ Q L and S ⊆ D . Recall that the state represents a loop in P L . The set of all letters written within a loop from q to q when reading only symbols from The definition of validity is given next. Technical details of the three requirements are made precise in the text below. (1) If w ↓D =c 1 . . .c ℓ , then thec i are pairwise different.
Alternatively, there is a read q i ?a − → L q i+1 of a symbol a ∈D(w, pos(a i )) that already occurred within a first write (the leader does not read its own writes). Here, we use pos(a i ) to access the position of a i in w. State q 1 = q 0 L is initial. There is a run from q ℓ+1 to a state q f ∈ F L . During this run, reading is restricted to symbols that occurred as first writes in w. Formally, there is a word v ∈ (W (D ) ∪ R(D (w))) * leading to an unsafe state q f . We have q ℓ+1 v − → L q f .
(3) For each prefix vā of w withā ∈D there is a computation q 0 C u!a − − → C q on P C so that the reads in u can be obtained from v. Formally, let u ′ = u ↓ R(D) . Then there is an embedding of u ′ into v, a monotone map µ : [1..|u ′ |] → [1..|v|] that satisfies the following. Let u ′ [i] = ?a with a ∈ D . The read is served in one of the following three ways. We may have v[µ(i)] = a, which corresponds to a write of a by P L . Alternatively, v[µ(i)] = q ∈ Q L and a ∈ Loop(q,D (w, µ(i))). This amounts to reading from a leader's write that was executed in a loop. Finally, we may have a ∈D(w, µ(i)), corresponding to reading from another contributor.
Our goal is to prove that a valid witness candidate exists if and only if there is a computation leading to an unsafe state. Before we state the corresponding lemma, we provide some intuition for the three requirements along an example.
Example 5. Reconsider the program A from Figure 1. We elaborate on why the three requirements for validity are essential. To this end, we present three witness candidates, each violating one of the requirements. Thus, these candidates cannot correspond to an actual computation of the program.
The witness candidate w 1 =ā . q 0 . ⊥ . q 1 . b .ā . q 2 clearly violates requirement (1) due to the repetition ofā. Since first writes are unique there cannot exist a computation of program A following candidate w 1 .
Requirement (2) asks for a proper run on the leader thread P L . Hence, the witness candidate w 2 =ā . q 0 . a . q 1 . b .c . q 2 violates the requirement although it satisfies (1). The subword q 0 . a . q 1 of w 2 encodes that the leader should take the transition q 0 !a − → L q 1 . But this transition does not exist in P L . Consequently, there is no computation of A which corresponds to the witness candidate w 2 .
For requirement (3), consider the candidate w 3 =ā . q 0 . ⊥ . q 1 . ⊥ .c . q 2 . It clearly satisfies (1). Requirement (2) is also fulfilled. In fact, the subwords encoding transitions of the leader are q 0 . ⊥ . q 1 and q 1 . ⊥ . q 2 . The first subword corresponds to transition q 0 ?a − → L q 1 which can be takes since a already appeared as a first write in w 3 . The second subword refers to the transition q 1 ε − → L q 2 . To explain that w 3 does not satisfy requirement (3), we show that c cannot be provided as a first write. To this end, assume that w 3 satisfies (3). Then, The reads in u are either first writes in v or writes provided by the leader (potentially in loops). Symbol b is not provided as such: It is neither a first write in v nor a symbol written by the leader (in a loop) along v. However, a computation u leading to state p 2 in P C needs to read b once. Hence, such a computation does not exist and c cannot be provided as a first write.
The witness candidate w =ā . q 0 . ⊥ . q 1 . b .c . q 2 from Example 3 satisfies all the requirements. In particular (3) is fulfilled since b is written by the leader in the transition q 1 !b − → q 2 . Hence, in this case, c can be provided as a first write. Lemma 6. There is a t ∈ N so that c 0 → * A t c with c ∈ C f if and only if there is a valid witness candidate w ∈ E.
Our algorithm iterates over all witness candidates w ∈ E and tests whether w is valid. The number of candidates Wit (L, D) is (L · (D + 1)) L·D · D D · L. This is due to the fact that we can force a witness candidate to have maximum length via inserting padding symbols. Hence, the number of candidates constitutes the first factor of the complexity estimation stated in Theorem 1. The polynomial factor Valid (L, D, C) is due to the following lemma.
Practical Algorithm We improve the above algorithm so that it should work well on practical instances. The idea is to factorize the leader along its strongly connected components (SCCs), the number of which is assumed to be small in real programs. Technically, our improved algorithm works with valid SCC-witnesses. They symbolically represent SCCs rather than loops in the leader. To state the complexity, we first define the straight line depth, the number of SCCs the leader may visit during a computation. The definition needs a graph construction.
Let V ⊆D ≤D contain only words that do not repeat letters. Take an element r =c 1 . . .c ℓ ∈ V and let i ∈ [0..ℓ]. By P L ↓ i we denote the automaton obtained from P L by removing all transitions that read a value outside {c 1 , . . . , c i }. Let SCC(P L ↓ i ) denote the set of all SCCs in this automaton. We construct the directed graph G(P L , r) as follows. The vertices are the SCCs of all P L ↓ i where i ∈ [0..ℓ]. There is an edge between S, S ′ ∈ SCC(P L ↓ i ), if there are states we only get an edge if we can get from S to S ′ by reading c i . Note that the resulting graph is acyclic. The depth d(r) of P L relative to r is the length of the longest path in G(P L , r). The straight line depth is d = max{d(r) | r ∈ V}. The number of SCCs s is the size of SCC(P L ↓ 0 ). With these values at hand, the number of SCC-witness candidates (the definition of which can be found in Appendix A) can be bounded by Wit SCC (s, D, d) ≤ (s · (D + 1)) d · D D · 2 D+d . The time needed to test whether a candidate is valid is For this algorithm, what matters is that the leader's state space is strongly connected. The number of states has limited impact on the runtime.
Lower bound We prove that the algorithm from Theorem 1 is only a rootfactor away from being optimal: A 2 o( √ L·D·log(L·D)) -time algorithm for LCR would contradict ETH. We achieve the lower bound by a reduction from k × k Clique, the problem of finding a clique of size k in a graph the vertices of which are elements of a k × k matrix. Moreover, the clique has to contain one vertex from each row. Unless ETH fails, the problem cannot be solved in time 2 o(k·log(k)) [38].
Technically, we construct from an instance (G, k) of k × k Clique an instance (A = (D , a 0 , (P L , P C )), F L ) of LCR such that D = O(k) and L = O(k). Furthermore, we show that G contains the desired clique of size k if and only if there is a t ∈ N such that c 0 → * A t c with c ∈ C f . Suppose we had an algorithm for LCR running in time 2 o( √ L·D·log(L·D)) . Combined with the reduction, this would yield an algorithm for k × k Clique with runtime 2 o( √ k 2 ·log(k 2 )) = 2 o(k·log k) . But unless the exponential time hypothesis fails, such an algorithm cannot exist. We assume that the vertices V of G are given by tuples (i, j) with i, j ∈ [1..k], where i denotes the row and j denotes the column in the matrix. In the reduction, we need the leader and the contributors to communicate on the vertices of G. However, we cannot store tuples (i, j) in the memory as this would cause a quadratic blow-up D = O(k 2 ). Instead, we communicate a vertex (i, j) as a string row(i). col(j). We distinguish between row-and column-symbols to avoid stuttering, the repeated reading of the same symbol. With this, it cannot happen that a thread reads a row-symbol twice and takes it for a column.
The program starts its computation with each contributor choosing a vertex (i, j) to store. For simplicity, we denote a contributor storing the vertex (i, j) by P (i,j) . Note that there can be copies of P (i,j) .
Since there are arbitrarily many contributors, the chosen vertices are only a superset of the clique we want to find. To cut away the false vertices, the leader P L guesses for each row the vertex belonging to the clique. Contributors storing other vertices than the guessed ones will be switched off bit by bit. To this end, the program performs for each i ∈ [1..k] the following steps: If (i, j i ) is the vertex of interest, P L first writes row(i) to the memory. Each contributor that is still active reads the symbol and moves on for one state. Then P L communicates the column by writing col(j i ). Again, the active contributors P (i ′ ,j ′ ) read.
Upon transmitting (i, j i ), the contributors react in one of the following three ways: (1) If i ′ = i, the contributor P (i ′ ,j ′ ) stores a vertex of a different row. The computation in P (i ′ ,j ′ ) can only go on if (i ′ , j ′ ) is connected to (i, j i ) in G. Otherwise it will stop. (2) If i ′ = i and j ′ = j i , then P (i ′ ,j ′ ) stores exactly the vertex guessed by P L . In this case, P (i ′ ,j ′ ) can continue its computation. (3) If i ′ = i and j ′ = j, thread P (i ′ ,j ′ ) stores a different vertex from row i. The contributor has to stop.
After k such rounds, there are only contributors left that store vertices guessed by P L . Furthermore, each two of these vertices are connected. Hence, they form a clique. To transmit this information to P L , each P (i,ji) writes # i to the memory, a special symbol for row i. After P L has read the string # 1 . . . # k , it moves to its final state. A formal construction is given in Appendix A.
Note that the size O(k) of the data domain cannot be avoided, even if we encoded the row and column symbols in binary. The reason is that P L needs a confirmation of k contributors that were not stopped during the guessing and terminated correctly. Since contributors do not have final states, we need to transmit this information in the form of k different memory symbols.
Absence of a Polynomial Kernel A kernelization of a parameterized problem is a compression algorithm. Given an instance, it returns an equivalent instance the size of which is bounded by a function only in the parameter. From an algorithmic perspective, kernels put a bound on the number of hard instances. Indeed, the search for small kernels is a key interest in algorithmics, similar to the search for FPT-algorithms. It can be shown that kernels exist if and only if a problem admits an FPT-algorithm [12].
Let Q be a parameterized problem. A kernelization of Q is an algorithm that given an instance (B, k), runs in polynomial time in B and k, and outputs an equivalent instance (B ′ , k ′ ) such that |B ′ | + k ′ ≤ g(k). Here, g is a computable function. If g is a polynomial, we say that Q admits a polynomial kernel.
Unfortunately, for many problems the community failed to come up with polynomial kernels. This lead to the contrary approach, namely disproving their existence [27,5,6]. The absence of a polynomial kernel constitutes an exponential lower bound on the number of hard instances. Like computational hardness results, such a bound is seen as an indication of general hardness of the problem.
Technically, the existence of a polynomial kernel for the problem of interest is shown to imply NP ⊆ coNP/poly. However, the inclusion is considered unlikely as it would cause a collapse of the polynomial hierarchy to the third level [41].
In order to link the existence of a polynomial kernel for LCR(D, L) with the above inclusion, we follow the framework developed in [6]. Let Γ be an alphabet. A polynomial equivalence relation is an equivalence relation R on Γ * with the following properties: Given x, y ∈ Γ * , it can be decided in time polynomial in |x| + |y| whether (x, y) ∈ R. Moreover, for n ∈ N there are at most polynomially many equivalence classes in R restricted to Γ ≤n .
The key tool for proving kernel lower bounds are cross-compositions. Let L ⊆ Γ * be a language and Q ⊆ Γ * × N be a parameterized language. We say that L cross-composes into Q if there exists a polynomial equivalence relation R and an algorithm C, together called the cross-composition, with the following properties: C takes as input ϕ 1 , . . . , ϕ I ∈ Γ * , all equivalent under R. It computes in time polynomial in I ℓ=1 |ϕ ℓ | a string (y, k) ∈ Γ * × N such that (y, k) ∈ Q if and only if there is an ℓ ∈ [1..I] with ϕ ℓ ∈ L. Furthermore, parameter k is bounded by p(max ℓ∈ [1..I] |ϕ ℓ | + log(I)), where p is a polynomial.
It was shown in [6] that a cross-composition of any NP-hard language into a parameterized language Q prohibits the existence of a polynomial kernel for Q unless NP ⊆ coNP/poly. In order to make use of this result, we show how to cross-compose 3-SAT into LCR(D, L). This yields the following: The difficulty in coming up with a cross-composition is the restriction on the size of the parameters. In our case, this affects D and L: Both parameters are not allowed to depend polynomially on I, the number of given 3-SAT-instances. We resolve the polynomial dependence by encoding the choice of such an instance into the contributors via a binary tree.
Proof (Idea). Assume some encoding of Boolean formulas as strings over a finite alphabet. We use the polynomial equivalence relation R defined as follows: Two strings ϕ and ψ are equivalent under R if both encode 3-SAT-instances, and the numbers of clauses and variables coincide.
Let the given 3-SAT-instances be ϕ 1 , . . . , ϕ I . Every two of them are equivalent under R. This means all ϕ ℓ have the same number of clauses m and use the same set of variables {x 1 , . . . , x n }. We assume that ϕ ℓ = C ℓ 1 ∧ · · · ∧ C ℓ m . We construct a program proceeding in three phases. First, it chooses an instance ϕ ℓ , then it guesses an evaluation for all variables, and in the third phase it verifies that the evaluation satisfies ϕ ℓ . While the second and the third phase do not cause a dependence of the parameters on I, the first phase does. It is not possible to guess a number ℓ ∈ [1..I] and communicate it via the memory as this would provoke a polynomial dependence of D on I.
To implement the first phase without a polynomial dependence, we transmit the indices of the 3-SAT-instances in binary. The leader guesses and writes tuples (u 1 , 1), . . . , (u log(I) , log(I)) with u ℓ ∈ {0, 1} to the memory. This amounts to choosing an instance ϕ ℓ with binary representation bin(ℓ) = u 1 . . . u log(I) .
It is the contributors' task to store this choice. Each time the leader writes a tuple (u i , i), the contributors read and branch either to the left, if u i = 0, or to the right, if u i = 1. Hence, in the first phase, the contributors are binary trees with I leaves, each leaf storing the index of an instance ϕ ℓ . Since we did not assume that I is a power of 2, there may be computations arriving at leaves that do not represent proper indices. In this case, the computation deadlocks.
The size of D and P L in the first phase is O(log(I)). Note that this satisfies the size-restrictions of a cross-composition.
For guessing the evaluation in the second phase, the program communicates on tuples (x i , v) with i ∈ [1..n] and v ∈ {0, 1}. The leader guesses such a tuple for each variable and writes it to the memory. Any participating contributor is free to read one of these. After reading, it stores the variable and the evaluation.
In the third phase, the satisfiability check is performed as follows: Each contributor that is still active has stored in its current state the chosen instance ϕ ℓ , a variable x i , and its evaluation v i . Assume that x i when evaluated to v i satisfies C ℓ j , the j-th clause of ϕ ℓ . Then the contributor loops in its current state while writing the symbol # j . The leader waits to read the string # 1 . . . # m . If P L succeeds, we are sure that the m clauses of ϕ ℓ were satisfied by the chosen evaluation. Thus, ϕ ℓ is satisfiable and P L moves to its final state. For details of the construction and a proof of correctness, we refer to Appendix A.

Parameterization by Contributors
The size of the contributors C has substantial influence on the complexity of LCR.
We show that the problem can be solved in time O * (2 C ) via dynamic programming. Moreover, we present a matching lower bound proving it unlikely that LCR can be solved in time O * ((2 − δ) C ), for any δ > 0. The result is obtained by a reduction from Set Cover. Finally, we prove the absence of a polynomial kernel.
Upper Bound Our algorithm is based on dynamic programming. Intuitively, we cut a computation of the program along the states reached by the contributors. To this end, we keep a table with an entry for each subset of the contributors' states. The entry of set S ⊆ Q C contains those states of the leader that are reachable under a computation where the behavior of the contributors is limited to S. We fill the table by a dynamic programming procedure and check in the end whether a final state of the leader occurs in an entry. The result is as follows.
To define the table, we first need a compact way of representing computations that allows for fast iteration. The observation is that keeping one set of states for all contributors suffices. Let S ⊆ Q C be the set of states reachable by the contributors in a given computation. By the Copycat Lemma [17], we can assume for each q ∈ S an arbitrary number of contributors that are currently in q. This means that we do not have to distinguish between different contributor instances.
Formally, we reduce the search space to V = Q L × D × P(Q C ). Instead of explicit configurations, we consider tuples (q, a, S), where q ∈ Q L , a ∈ D , and S ⊆ Q C . Between these tuples, we define an edge relation E.
Reads of the leader are similar. Contributors also change the memory but saturate set S instead of changing the state: If there is a transition p Reads are handled similarly.
The set V together with the relation E form a finite directed graph G = (V, E). We call the node v 0 = (q 0 L , a 0 , {q 0 C }) the initial node. Computations are represented by paths in G starting in v 0 . Hence, we reduced LCR to the problem of checking whether the set of nodes Before we elaborate on solving reachability on G we turn to an example. It shows how G is constructed from a program and illustrates Lemma 12. Example 13. We consider the program A = (D , a 0 , (P L , P C )) from Figure 2. The nodes of the corresponding graph G are given by V = Q L × D × P({p 0 , p 1 , p 2 }). Its edges E are constructed following the above rules. For instance, we get an edge (q 1 , a, {p 0 }) → E (q 1 , a, {p 0 , p 1 }) since P C has a read transition p 0 ?a − → p 1 . Intuitively, the edge describes that currently, the leader is in state q 1 , the memory holds a, and an arbitrary number of contributors is waiting in p 0 . Then, some of these read a and move to p 1 . Hence, we might assume an arbitrary number of contributors in the states p 0 and p 1 .
The complete graph G is presented in Figure 3. For the purpose of readability, we only show the nodes reachable from the initial node v 0 = (q 0 , a 0 , {p 0 }). Moreover, we omit self-loops and we present the graph as a collection of subgraphs. The latter means that for each subset S of P({p 0 , p 1 , p 2 }), we consider the induced subgraph G[Q L ×D ×{S}]. It contains the set of nodes Q L ×D ×{S} and all edges that start and end in this set. Note that we omit the last component from a node (q, a, S) in G[Q L × D × {S}] since it is clear from the context. The induced subgraphs are connected by edges that saturate S.
The red marked nodes are those which contain the unsafe state q 3 of the leader. Consider a path from v 0 to one of these nodes. It starts in the subgraph To reach one of the red nodes, the path has to traverse via Phrased differently, the states of the contributors need to be saturated two times along the path. This means that in an actual computation, there must be contributors in p 0 , p 1 , and p 2 . These can then provide the symbols b and c which are needed by the leader to reach the state q 3 . Constructing G for a program and solving reachability takes time O * (4 C ) [10]. Hence, we have to solve reachability without constructing G explicitly. Our algorithm computes a table T which admits a recurrence relation that simplifies the reachability query: Instead of solving reachability directly on G, we can restrict to so-called slices of G. These are subgraphs of polynomial size where reachability queries can be decided efficiently.
We define the table T . For each set S ⊆ Q C , we have an entry T [S] given by: that are reachable from the initial node v 0 .
Assume we have already computed T . By Lemma 12 we get: There is a It remains to compute the table. Our goal is to employ a dynamic programming based on a recurrence relation over T . To formulate the relation, we need the notion of slices of G. Let W ⊆ Q C be a subset and p ∈ Q C \ W be a state. We denote by S the union We denote its set of edges by E W,S .
The main idea of the recurrence relation is saturation. When traversing a path π in G, the set of contributor states gets saturated over time. Assume we cut π each time after a new state gets added. Then we obtain subpaths, each being a path in a slice: If p ∈ Q C gets added to W ⊆ Q C , the corresponding subpath is in G W,W ∪{p} . This means that for a set S ⊆ Q C , the entry T [S] contains those nodes that are reachable from T [S \{p}] in the slice G S\{p},S , for some state p ∈ S.
Formally, we define the set R(W, S) for each W ⊆ Q C , p ∈ Q C \ W , and S = W ∪ {p}. These sets contain the nodes reachable from T [W ] in G W,S :

Lemma 14. Table T admits the recurrence relation T [S] = p∈S R(S \{p}, S).
We illustrate the lemma and the introduced notions on an example. Afterwards, we show how to compute the table T by exploiting the recurrence relation.

Example 15.
Reconsider the program given in Figure 2. The table T has eight entries, one for each subset of Q C . The entries that are non-empty can be seen in the graph of Figure 3. Each of the subgraphs contains exactly those nodes that are reachable Let W = {p 0 } and S = {p 0 , p 1 }. Then, the slice G W,S is shown in the figure as blue highlighted area. Note that it also contains the edge from (q 1 , a, The set R(W, S) contains those nodes in G[Q L × D × {S}] that are reachable from T [W ] in the slice G W,S . According to the graph, these are (q 1 , a, S) and (q 1 , c, S) and hence we get T [S] = R(W, S).
In general, not all nodes in We apply the recurrence relation in a bottom-up dynamic programming to fill the table T . Let S ⊆ Q C be a subset and assume we already know T [S \{p}], for each p ∈ S. Then, for a fixed p, we compute R(S \{p}, S) by a fixed-point iteration on the slice G S\{p},S . The number of nodes in the slice is O(L·D). Hence, the iteration takes time at most O(L 2 · D 2 ). It is left to construct G S\{p},S . We state the time needed in the following lemma. The proof is postponed so as to finish the complexity estimation of Theorem 11.
Wrapping up, we need O(C 3 · L 2 · D 2 ) time for computing a set R(S \{p}, S). Due to the recurrence relation of Lemma 14, we have to compute at most C sets R(S \{p}, S) for a given S ⊆ Q C . Hence, an entry T [S] can be computed in time O(C 4 · L 2 · D 2 ). The estimation also covers the base case Since the table T has 2 C entries, the complexity estimation of Theorem 11 follows. It is left to prove Lemma 16.
Proof. The slice G S\{p},S consists of the two subgraphs and the edges leading from G S\{p} to G S . We elaborate on how to construct G S . The construction of G S\{p} is similar.
First, we write down the nodes of G S . This can be done in time O(L · D). Edges in the graph are either induced by transitions of the leader or by the contributor. The former ones can be added in time O(|δ L | · D) = O(L 2 · D 2 ) since a single transition of P L may lead to D edges. To add the latter edges, we browse To complete the construction, we add the edges from G S\{p} to G S . These are induced by transitions r ?a/!a −−− → p ∈ δ C with r ∈ S \{p}. Since each of these may again lead to L · D different edges, adding all of them takes time O(C 3 · L · D 2 ). In total, we estimate the time for the construction by O(C 3 · L 2 · D 2 ).

Lower bound
We prove it unlikely that LCR can be solved in O * ((2−δ) C ) time, for any δ > 0. This shows that the algorithm from Section 3.2 has an optimal runtime. The lower bound is achieved by a reduction from Set Cover, one of the 21 original NP-complete problems by Karp [36]. We state its definition.

Input:
A family of sets F ⊆ P(U ) over a universe U , and r ∈ N.
Besides its NP-completeness, it is known that Set Cover admits an O * (2 n )time algorithm [24], where n is the size of the universe U . However, no algorithm solving Set Cover in time O * ((2 − δ) n ) for a δ > 0 is known so far. Actually, it is conjectured in [11] that such an algorithm cannot exist unless the SETH breaks.
While a proof for the conjecture in [11] is still missing, the authors provide evidence in the form of relative hardness. They obtain lower bounds for prominent problems by tracing back to the assumed lower bound of Set Cover. These bounds were not known before since SETH is hard to apply: No suitable reductions from SAT to these problems are known so far. Hence, Set Cover can be seen as an alternative source for lower bounds whenever SETH seems out of reach. This made the problem a standard assumption for hardness [11,4,9].
To obtain the desired lower bound for LCR, we establish a polynomial time reduction from Set Cover that strictly preserves the parameter n. Formally, if (F , U, r) is an instance of Set Cover, we construct (A = (D , a 0 , (P L , P C )), F L ), an instance of LCR where C = n + c with c a constant. Note that even a linear dependence on n is not allowed. Moreover, the instance satisfies the equivalence: There is a set cover if and only if there for Set Cover breaking its hardness.
For the proof of the proposition, we elaborate on the aforementioned reduction. The main idea is the following: We let the leader guess r sets from F . The contributors store the elements that got covered by the chosen sets. In a final communication phase, the leader verifies that it has chosen a valid cover by querying whether all elements of U have been stored by the contributors.
Leader and contributors essentially communicate over the elements of U . For guessing r sets from F , the automaton P L consists of r similar phases. Each phase starts with P L choosing an internal transition to a set S ∈ F . Once S is chosen, the leader writes a sequence of all u ∈ S to the memory.
A contributor in the program consists of C = n + 1 states: An initial state and a state for each u ∈ U . When P L writes an element u ∈ S to the memory, there is a contributor storing this element in its states by reading u. Hence, each element that got covered by S is recorded in one of the contributors.
After r rounds of guessing, the contributors hold those elements of U that are covered by the chosen sets. Now the leader verifies that it has really picked a cover of U . To this end, it needs to check whether all elements of U have been stored by the contributors. Formally, the leader can only proceed to its final state if it can read the symbols u # , for each u ∈ U . A contributor can only write u # to the memory if it stored the element u before. Hence, P L reaches its final state if and only if a valid cover of U was chosen.
Absence of a Polynomial Kernel We prove that 3-SAT can be cross-composed into LCR(C). This shows that the problem is unlikely to admit a polynomial kernel. The result is the following. For the cross-composition, let ϕ 1 , . . . , ϕ I be the given 3-SAT-instances, each two equivalent under R, where R is the polynomial equivalence relation from Theorem 10. Then, each formula has the same number of clauses m and variables x 1 , . . . , x n . Let us fix the notation to be ϕ ℓ = C ℓ 1 ∧ · · · ∧ C ℓ m . The basic idea is the following. Leader P L guesses the formula ϕ ℓ and an evaluation for the variables. The contributors store the latter. At the end, leader and contributors verify that the chosen evaluation indeed satisfies formula ϕ ℓ .
For guessing ϕ ℓ , the leader has a branch for each instance. Note that we can afford the size of the leader to depend on I since the cross-composition only restricts parameter C. Hence, we do not face the problem we had in Theorem 10.
Guessing the evaluation of the variables is similar to Theorem 10: The leader writes tuples ( 1} to the memory. The contributors store the evaluation in their states. After this guessing-phase, the contributors can write the symbols # ℓ j , depending on whether the currently stored variable with its evaluation satisfies clause C ℓ j . As soon as the leader has read the complete string # ℓ 1 . . . # ℓ m , it moves to its final state, showing that the evaluation satisfied all clauses of ϕ ℓ . Note that parameter C is of size O(n) and does not depend on I at all. Hence, the size-restrictions of a cross-composition are met.

Intractability
We show the W[1]-hardness of LCR(D) and LCR(L). Both proofs rely on a parameterized reduction from k-Clique, the problem of finding a clique of size k in a given graph. This problem is known to be W[1]-complete [14]. We state our result. We first reduce k-Clique to LCR(L). To this end, we construct from an instance (G, k) of k-Clique in polynomial time an instance (A = (D , a 0 , (P L , P C )), F L ) of LCR with L = O(k). This meets the requirements of a parameterized reduction.
Program A operates in three phases. In the first phase, the leader chooses k vertices of the graph and writes them to the memory. Formally, it writes a sequence (v 1 , 1).(v 2 , 2) . . . (v k , k) where the v i are vertices of G. During this selection, the contributors non-deterministically choose to store a suggested vertex (v i , i) in their state space.
In the second phase, the leader again writes a sequence of vertices using different symbols: . Note that the vertices w i do not have to coincide with the vertices from the first phase. It is then the contributor's task to verify that the new sequence constitutes a clique. To this end, for each i, the program does the following: If a contributor storing (v i , i) reads the value (w # i , i), the computation on the contributor can only continue if , the computation can only continue if v j = w i and if there is an edge between v j and w i .
Finally, in the third phase, we need to ensure that there was at least one contributor storing (v i , i) and that the above checks were all positive. To this end, a contributor that has successfully gone through the second phase and stores (v i , i) writes the symbol # i to the memory. The leader pends to read the sequence of symbols # 1 . . . # k . This ensures the selection of k different vertices, where each two are adjacent.
For proving W[1]-hardness of LCR(D), we reuse the above construction. However, the size of the data domain is |V | · k, where V is the set of vertices of G. Hence, it is not a parameterized reduction for parameter D. The factor |V | appears since leader and contributors communicate on the pure vertices. The main idea of the new reduction is to decrease the size of D by transmitting the vertices in binary. To this end, we add binary branching trees to the contributors that decode a binary encoding. We omit the details and refer to Appendix C.

Bounded-Stage Reachability
The bounded-stage reachability problem is a simultaneous reachability problem. It asks whether all threads of a program can reach an unsafe state when restricted to s-stage computations. These are computations where the write permission changes s times. The problem was first analyzed in [1] and shown to be NPcomplete for finite state programs. We give matching upper and lower bounds in terms of fine-grained complexity and prove the absence of a polynomial kernel. Let A stage is a computation in A where only one of the threads writes. The remaining threads are restricted to reading the memory. An s-stage computation is a computation that can be split into s parts, each of which forming a stage. We state the decision problem.

Bounded-Stage Reachability (BSR)
Input: , a set C f ⊆ C, and s ∈ N. Question: Is there an s-stage computation c 0 → * A c for some c ∈ C f ?
We focus on a parameterization of BSR by P, the maximum number of states of a thread, and t, the number of threads. Let it be denoted by BSR(P, t). We prove that the parameterization is FPT and present a matching lower bound. The main result in this section is the absence of a polynomial kernel for BSR(P, t). The result is technically involved and shows what makes the problem hard.
Parameterizations of BSR involving only D and s are intractable. We show that BSR remains NP-hard even if both, D and s, are constants. This proves the existence of an FPT-algorithm for those cases unlikely.

Parameterization by Number of States and Threads
We first give an algorithm for BSR, based on a product construction of automata. Then, we present a lower bound under ETH. Interestingly, the lower bound shows that we cannot avoid building the product. We conclude with proving the absence of a polynomial kernel. As before, we cross-compose from 3-SAT but now face the problem that two important parameters in the construction, P and t, are not allowed to depend polynomially on the number of 3-SAT-instances.
Upper Bound We show that BSR(P, t) is fixed-parameter tractable. The idea is to reduce to reachability on a product automaton. The automaton stores the configurations, the current writer, and counts up to the number of stages s. To this end, it has O * (P t ) many states. Details can be found in Appendix D.
Proposition 20. BSR can be solved in time O * (P 2t ).
Lower Bound By a reduction from k × k Clique, we show that a 2 o(t·log(P)) -time algorithm for BSR would contradict ETH. The above algorithm is optimal.
Proposition 21. BSR cannot be solved in time 2 o(t·log(P)) unless ETH fails.
The reduction constructs from an instance of k × k Clique an equivalent in- Moreover, it keeps the parameters small. We have that P = O(k 2 ) and t = O(k). As a consequence, a 2 o(t·log(P))time algorithm for BSR would yield an algorithm for k × k Clique running in time 2 o(k·log(k 2 )) = 2 o(k·log(k)) . But this contradicts ETH.

Proof (Idea). For the reduction, let
.k] be the vertices of G. We define D = V ∪ {a 0 } to be the domain of the memory. We want the threads to communicate on the vertices of G. For each row we introduce a reader thread P i that is responsible for storing a particular vertex of the row. We also add one writer P ch that is used to steer the communication between the P i . Our program A is given by the tuple (D , a 0 , ((P i ) i∈[1..k] , P ch )).
Intuitively, the program proceeds in two phases. In the first phase, each P i non-deterministically chooses a vertex from the i-th row and stores it in its state space. This constitutes a clique candidate (1, j 1 ), . . . , (k, j k ) ∈ V . In the second phase, thread P ch starts to write a random vertex (1, j ′ 1 ) of the first row to the memory. The first thread P 1 reads (1, j ′ 1 ) from the memory and verifies that the read vertex is actually the one from the clique candidate. The computation in P 1 will deadlock if j ′ 1 = j 1 . The threads P i with i = 1 also read (1, j ′ 1 ) from the memory. They have to check whether there is an edge between the stored vertex (i, j i ) and (1, j ′ 1 ). If this fails in some P i , the computation in that thread will also deadlock. After this procedure, the writer P ch guesses a vertex (2, j ′ 2 ), writes it to the memory, and the verification steps repeat. In the end, after k repetitions of the procedure, we can ensure that the guessed clique candidate is indeed a clique. Formal construction and proof are given in Appendix D.

Absence of a Polynomial Kernel
We show that BSR(P, t) does not admit a polynomial kernel. To this end, we cross-compose 3-SAT into BSR(P, t).
In the present setting, coming up with a cross-composition is non-trivial. Both parameters, P and t, are not allowed to depend polynomially on the number I of given 3-SAT-instances. Hence, we cannot construct an NFA that distinguishes the I instances by branching into I different directions. This would cause a polynomial dependence of P on I. Furthermore, it is not possible to construct an NFA for each instance as this would cause such a dependence of t on I. To circumvent the problems, some deeper understanding of the model is needed.
Proof (Idea). Let ϕ 1 , . . . , ϕ I be given 3-SAT-instances, where each two are equivalent under R, the polynomial equivalence relation of Theorem 10. Then each ϕ ℓ has m clauses and n variables {x 1 , . . . , x n }. We assume ϕ ℓ = C ℓ 1 ∧ · · · ∧ C ℓ m . In the program that we construct, the communication is based on 4-tuples of the form (ℓ, j, i, v). Intuitively, such a tuple transports the following information: The j-th clause in instance ϕ ℓ , C ℓ j , can be satisfied by variable For choosing and storing an evaluation of the x i , we introduce so-called variable threads P x1 , . . . , P xn . In the beginning, each P xi non-deterministically chooses an evaluation for x i and stores it in its state space.
We further introduce a writer P w . During a computation, this thread guesses exactly m tuples (ℓ 1 , 1, i 1 , v 1 ), . . . , (ℓ m , m, i m , v m ) in order to satisfy m clauses of potentially different instances. Each (ℓ j , j, i j , v j ) is written to the memory by P w . All variable threads then start to read the tuple. If P xi with i = i j reads it, then the thread will just move one state further since the suggested tuple does not affect the variable x i . If P xi with i = i j reads the tuple, the thread will only continue its computation if v j coincides with the value that P xi guessed for x i and, moreover, x i with evaluation v j satisfies clause C ℓj j . Now suppose the writer did exactly m steps while each variable thread did exactly m + 1 steps. This proves the satisfiability of m clauses by the chosen evaluation. But these clauses can be part of different instances: It is not ensured that the clauses were chosen from one formula ϕ ℓ . The major difficulty of the cross-composition lies in how to ensure exactly this.
We overcome the difficulty by introducing so-called bit checkers P b , where b ∈ [1.. log(I)]. Each P b is responsible for the b-th bit of bin(ℓ), the binary representation of ℓ, where ϕ ℓ is the instance we want to satisfy. When P w writes a tuple (ℓ 1 , 1, i 1 , v 1 ) for the first time, each P b reads it and stores either 0 or 1, according to the b-th bit of bin(ℓ 1 ). After P w has written a second tuple (ℓ 2 , 2, i 2 , v 2 ), the bit checker P b tests whether the b-th bit of bin(ℓ 1 ) and bin(ℓ 2 ) coincide, otherwise it will deadlock. This will be repeated any time P w writes a new tuple to the memory.
Assume the computation does not deadlock in any of the P b . Then we can ensure that the b-th bit of bin(ℓ j ) with j ∈ [1..m] never changed during the computation. This means that bin(ℓ 1 ) = · · · = bin(ℓ m ). Hence, the writer P w has chosen clauses of just one instance ϕ ℓ . Moreover, the current evaluation satisfies the formula. Since the parameters are bounded, P ∈ O(m) and t ∈ O(n + log(I)), the construction constitutes a proper cross-composition. For a formal construction and proof, we refer to Appendix D.
⊓ ⊔ Variable threads and writer thread are needed for testing satisfiability of clauses. The need for bit checkers comes from ensuring that all clauses stem from the same formula. We illustrate the notion with an example.
Example 23. Let four formulas ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 with two clauses each be given. We show how the bit checkers are constructed. To this end, we first encode the index of the instances as binary numbers using two bits. The encoding is shown in Figure 4 on the right hand side. Note the offset by one in the encoding.
We focus on the bit checker P b1 responsible for the first bit. It is illustrated in Figure 4 on the left hand side. Note that the label ℓ = 1, ℓ = 3 refers to transitions of the form ?(ℓ, j, i, v) with ℓ either 1 or 3 and arbitrary values for i,j, and v. On reading the first of these tuples, P b1 stores the first bit of ℓ in its state space. The blue marked states store that b 1 = 0, the red states store b 1 = 1. Then, the bit checker can only continue on reading tuples (ℓ, j, i, v) where the first bit of ℓ matches the stored bit. In the case of b 1 = 0, this means that P b1 can only read tuples (ℓ, j, i, v) with ℓ either 1 or 3.
Assume the writer thread has output two tuples (ℓ 1 , 1, i 1 , v 1 ) and (ℓ 2 , 2, i 2 , v 2 ) and the bit checker P b1 has reached a last state. Since the computation did not deadlock on P b1 , we know that the first bits of ℓ 1 and ℓ 2 coincide. If the bit checker for the second bit does not deadlock as well, we get that ℓ 1 = ℓ 2 . Hence, the writer has chosen two clauses from one instance ϕ ℓ1 .

Intractability
We show that parameterizations of BSR involving only s and D are intractable.
To this end, we prove that BSR remains NP-hard even if both parameters are con-stant. This is surprising as the number of stages s seems to be a powerful parameter. Introducing such a bound in simultaneous reachability lets the complexity drop from PSPACE to NP. But it is not enough to guarantee an FPT-algorithm. Proof (Idea). We give a reduction from 3-SAT to BSR that keeps both parameters constant. Let ϕ be a 3-SAT-instance with m clauses and variables x 1 , . . . , x n . We construct a program A = (D , a 0 , P 1 , . . . , P n , P v ) with D = 4 different memory symbols that can only run 1-stage computations.
The program cannot communicate on literals directly, as this would cause a blow-up in parameter D. Instead, variables and evaluations are encoded in binary in the following way. Let ℓ be a literal in ϕ. It consists of a variable x i and an evaluation v ∈ {0, 1}. The padded binary encoding bin # (i) ∈ ({0, 1}.#) log(n)+1 of i is the usual binary encoding where each bit is separated by a #. The string Enc(ℓ) = v# bin # (i) encodes that variable x i has evaluation v. We need the padding symbol # to prevent the threads in A from reading the same symbol more than once. Program A communicates by passing messages of the form Enc(ℓ). To this end, we need the data domain D = {a 0 , #, 0, 1}.
The program contains threads P i , i ∈ [1..n], called variable threads. Initially, these threads choose an evaluation for the variables and store it: Each P i can branch on reading a 0 and choose whether it assigns 0 or 1 to x i . Then, a verifier thread P v starts to iterate over the clauses. For each clause C, it picks a literal ℓ ∈ C that should evaluate to true and writes its encoding Enc(ℓ) to the memory. Each of the P i reads Enc(ℓ). Note that reading and writing Enc(ℓ) needs a sequence of transitions. In the construction, we ensure that all the needed states and transitions are provided. It is the task of each P i to check whether the chosen literal ℓ is conform with the chosen evaluation for x i . We distinguish two cases.
(1) If ℓ involves a variable x j with j = i, variable thread P i just continues its computation by reading the whole string Enc(ℓ).
(2) If ℓ involves x i , P i has to ensure that the stored evaluation coincides with the one sent by the verifier. To this end, P i can only continue its computation if the first bit in Enc(ℓ) shows the correct evaluation. Formally, there is only an outgoing path of transitions on Enc(x i ) if P i stored 1 as evaluation and on Enc(¬x i ) if it stored 0.
Note that each time P v picks a literal ℓ, all P i read Enc(ℓ), even if the literal involves a different variable. This means that the P i count how many literals have been seen already. This is important for correctness: The threads will only terminate if they have read a word of fixed length and did not miss a single symbol. There is no loss in the communication between P v and the P i . Now assume P v iterated through all m clauses and none of the variable threads got stuck. Then, each of them read exactly m encodings without running into a deadlock. Hence, the picked literals were all conform with the evaluation chosen by the P i . This means that a satisfying assignment for ϕ is found.
During a computation of A, the verifier P v is the only thread that has write permission. Hence, each computation of A consists of a single stage. For a formal construction, we refer to Appendix E.

Conclusion
We have studied several parameterizations of LCR and BSR, two safety verification problems for shared memory concurrent programs. In LCR, a designated leader thread interacts with a number of equal contributor threads. The task is to decide whether the leader can reach an unsafe state. The problem BSR is a generalization of bounded context switching. A computation gets split into stages, periods where writing is restricted to one thread. Then, BSR asks whether all threads can reach a final state simultaneously during an s-stage computation.
For LCR, we identified the size of the data domain D, the size of the leader L and the size of the contributors C as parameters. Our first algorithm showed that LCR(D, L) is FPT. Then we modified the algorithm to obtain a verification procedure valuable for practical instances. The main insight was that due to a factorization along strongly connected components, the impact of L can be reduced to a polynomial factor in the time complexity. We also proved the absence of a polynomial kernel for LCR(D, L) and presented an ETH-based lower bound which shows that the upper bound is a root-factor away from being optimal.
For LCR(C) we presented a dynamic programming, running in O * (2 C ) time. The algorithm is based on slice-wise reachability. This reduces a reachability problem on a large graph to reachability problems on subgraphs (slices) that are solvable in polynomial time. Moreover, we gave a tight lower bound based on Set Cover and proved the absence of a polynomial kernel.
Parameterizations different from LCR(D, L) and LCR(C) were shown to be intractable. We gave reductions from k-Clique and proved W[1]-hardness.
The parameters of interest for BSR are the maximum size of a thread P and the number of threads t. We have shown that a parameterization by both parameters is FPT and gave a matching lower bound. The main contribution was to prove it unlikely that a polynomial kernel exists for BSR(P, t). The proof relies on a technically involved cross-composition that avoids a polynomial dependence of the parameters on the number of given 3-SAT-instances.
Parameterizations involving other parameters like s or D were proven to be intractable for BSR. We gave an NP-hardness proof where s and D are constant.
Extension of the Model In this work, the considered model for programs allows the memory to consist of a single cell. We discuss whether the presented results carry over when the number of memory cells increases. Having multiple memory cells is referred to as supporting global variables. Extending the definition of programs in Section 2 to global variables is straightforward.
For the problem LCR, allowing global variables is a rather powerful mechanism. Let LCR Var denote the problem LCR where the input is a program featuring global variables. The interesting parameters for the problem are D, L, C, and v, the number of global variables. It turns out that LCR Var is PSPACE-hard, even when C is constant. One can reduce the intersection emptiness problem for finite automata to LCR Var . The reduction makes use only of the leader, contributors are not needed.
A program A with global variables can always be reduced to a program A ′ with a single memory cell [25]. Roughly, the reduction constructs the leader of A ′ in such a way that it can store the memory contents of A and manage contributor accesses to the memory. This means the new leader needs exponentially many states since there are D v many possible memory valuations. The domain and the contributor of A ′ are of polynomial size. In fact, we can then apply the algorithm from Section 3.1 to the program A ′ . The runtime depends exponentially only on the parameters D, L, and v. This shows that LCR Var (D, L, v) is fixed-parameter tractable. It is an interesting question whether this algorithm can be improved. Moreover, it is open whether there are other parameterizations of LCR Var that have an FPT-algorithm. A closer investigation is considered future work.
For BSR, allowing global variables also leads to PSPACE-hardness. The problem BSR Var , defined similarly to LCR Var , is PSPACE-hard already for a constant number of threads. In fact, the proof is similar to the hardness of LCR Var where only one thread is needed. To obtain an algorithm for the problem, we modify the construction from Proposition 20. The resulting product automaton then also maintains the values of the global variables. This shows membership in PSPACE. But the size of the product now also depends exponentially on D and v. The interesting question is whether we can find an algorithm that avoids an exponential dependence on one of the parameters P, t, D or v. It is a matter of future work to examine the precise complexity of the different parameterizations.

A Proofs for Section 3.1
We give the missing constructions and proofs for Section 3.1.

Proof of Lemma 6
Proof. We will first show that for each computation leading to an unsafe state, there is a corresponding valid witness candidate. To this end, assume there is a t ∈ N and a computation π = c 0 → * A t c with c ∈ C f . The computation π acts on configurations but we want to work with transitions of the leader and contributor instead. To this end, let σ be the sequence of transitions appearing in π. Without loss of generality, we assume that the last transition in σ is due to the leader. In the following we show how to construct a valid witness candidate out of the sequence σ. It is useful to assume that each transition in σ is uniquely identifiable. We use Pos(τ ) to access the position of a certain transition τ in σ. Hence, we have σ[Pos(τ )] = τ .
The first step to construct the witness candidate is collecting the first writes from σ. Identifying these is simple. One only needs to iterate over σ and mark those write transitions of the contributors that write a symbol for the first time. Then, the transitions of the contributors that are not marked are removed from σ. Moreover, each marked transition is replaced by the symbol that it writes. Formally, if a marked transition is of the form (q, !a, q ′ ), it is replaced byā ∈D . The resulting sequence is of the form σ 1c1 σ 2c2 . . . σ ncn σ n+1 , where thec i are the symbols written by the first writes and σ i is the sequence of transitions performed by the leader between the first writesc i andc i+1 . Note that we havec i =c j for i = j and n ≤ |D | since first writes can only be written once and there are at most |D | many of them.
In order to define a witness candidate in E = ((Q L .D ⊥ ) ≤L .D) ≤D .Q L , we need to cut out loops in the σ i and map the resulting sequences to a word. We define a procedure Shrinkmap that performs these two operations. As input, it takes a tuple (α, c) where α is a sequence of transitions of the leader and c is a natural number. The procedure computes a tuple (v, ϕ) where the word v ∈ (Q L .D ⊥ ) ≤L is obtained by cutting out the loops in α and mapping writes of a symbol a to a and reads of any symbol to ⊥. The function maps the transitions of the given sequence α into the word v. It is used to recover the sequence α from v. Moreover, the constant c is needed to right shift the map ϕ. This gets important when we append different words obtained from applying Shrinkmap. The procedure is explained in Algorithm 1.
We consecutively apply Shrinkmap. We begin with the input (σ 1 , 0) and obtain the tuple (w 1 , ϕ 1 ). In the i-th step, we run the procedure on the input (σ i , j∈[1..i−1] |w j |) and get the output (w i , ϕ i ). We do not apply Shrinkmap

Algorithm 1 Shrinkmap
Input: (α = τ1 . . . τ k , c) where α is a sequence of leader transitions and c is a constant.
Moreover, we define the map ϕ to be the concatenation of the ϕ i . Formally, We show that the witness w is valid. Requirement (1) is clearly satisfied since the symbolsc i written by the first writes are pairwise different. The second requirement is also fulfilled since we started with a proper run of the leader leading to an unsafe state q f ∈ F L . Formally, let w ↓ QL∪D ⊥ = q 0 a 0 q 1 a 1 . . . q m a m q. Since σ 1 . . . σ n is a run of the leader starting in q 0 and ending in q, we get that q 0 is indeed the initial state of P L . Moreover, the transition sequence σ n+1 leads from q to the state q f and reading in this sequence is restricted to the symbolsc i that were provided by the first writes. This means there is a word .m] and consider a i . If a i ∈ D , we know that there is a transition (q i , !a i , q i+1 ). This follows from the application of Shrinkmap. Similarly, if a i = ⊥, we get a transition of the form (q i , ε, q i+1 ) or (q i , ?a, q i+1 ). In the latter case, the read symbol a is provided by an earlier first write. This is due to the fact that the read transition appears in the computation σ 1 . . . σ n of the leader. Formally, a ∈D (w, pos(a i )). (3) is satisfied. We show that the reads of contributors that are responsible for first writes can be embedded into the witness candidate w. To this end, consider the i-th first writec i and the corresponding prefix v.c i of w. Since π is a computation of the system, we know there is a contributor providingc i . Formally, there is a computation ρ on this contributor of the form ρ = q 0 C u!ci −−→ C p. Let u ′ = u ↓ R(D) be the reads of u and τ C 1 . . . τ C z be the read transitions along ρ. Note that |u ′ | = z. Our goal is to define a monotonic function µ : [1..z] → [1.

It is left to show that Requirement
.|v|] that maps the reads of ρ into v.
We first identify those transitions among the τ C i that read a value written by the leader. Let these be τ C i1 , . . . τ C is . Then, there are writes of the leader in π that serve these reads. Let τ L ij denote the transition of the leader that writes the symbol read in τ C ij . This is the transition of the leader (writing the correct symbol) that immediately precedes τ C ij . We set µ(i j ) = ϕ(τ L ij ). Note that this already covers two cases of Requirement (3). If the read is served from a write of the leader that appears in w, the map µ directly points to that write. If the corresponding write stems from a loop, the map µ points to the state in w where the loop starts. This is due to the application of Shrinkmap. When loops are cut out, the procedure ensures that ϕ is changed accordingly.
Let τ C j1 , . . . , τ C jr be the read transitions among the τ C i that read symbols not provided by the leader. We consider τ C ji . Let the transition read the symbol a. Then, we need to ensure that µ maps j i to a position in w such that a ∈D (w, µ(j i )). Moreover, we need to keep µ monotonic. The idea is to map j i either to the position of the write transition of the leader preceding τ C ji or to the position of the last first write before τ C ji , depending on which of the two positions is larger. Let τ L be the write transition of the leader that precedes τ C ji in π. Moreover, letc h be the last first write before τ C ji . If Pos(τ L ) > |w 1 .c 1 . . . w h .c h |, we set µ(j i ) = ϕ(τ L ). Otherwise, we set µ(j i ) = |w 1 .c 1 . . . w h .c h |. The resulting map µ is indeed monotonic and satisfies Requirement (3).
For the other direction, we assume the existence of a valid witness w = w 1 .c 1 .w 2 .c 2 . . . w n .c n .q ∈ E.
Our goal is to show that there is a t ∈ N and a computation c 0 → * A t c leading to a configuration c ∈ C f . Since w is valid according to Definition 4, we get by the first requirement that thec i are pairwise different. This shows that thec i are unique and are thus candidates for a sequence of first writes. By Requirement (2), we obtain a computation of the leader from the w i . Formally, there are γ 1 , . . . , γ n and γ n+1 such that q 0 Before we construct the computation c 0 → * A t c, we need to determine the number of contributors t that are involved. Consider a first writec i . Each time, c i is read, we need a contributor to provide it. Hence, we first give a bound t(i) on how often c i needs to be provided. Summing up all the t(i) then bounds the number of involved contributors. Let Intuitively, the leader P L can read c i at most |w| + |γ n+1 | many times when executing a loop free computation along the witness. The loops are taken care of separately. During a loop in the leader, it can read c i at most L many times. Moreover, loops appear at most |α j | many times for each j. The latter is true since a contributor currently performing the computation q 0 C αj !cj − −− → p j for a j, may need the leader to run a complete loop for each step in α j . We set the total number of involved contributors to be t = j∈[1..n] t(i).
We introduce some notions needed for the construction of the computation. For each i ∈ [1..n], variable x i is used to point to a position of the word u i . Moreover, variable x points to a position in the witness w. Initially, these variables are set to zero. The computation will involve t contributors captured in the set S. Each of these provides a certain symbol c i . We partition S into sets We construct the computation inductively, in such a way that it maintains the following invariants. Roughly, these describe that there are always enough contributors in the set S(i) and those can execute the computation q 0 C αi −→ p ′ i to reach the write transition of c i .
(1) If w[x] =c i , we need that all contributors in S(i) have already executed q 0 C αi − → p ′ i so that they can provide c i whenever it is needed. To this end, all pending reads in α i need to be served during the computation. We can ensure the latter by the invariant: If x i = |u i | then µ i (x i ) ≥ x. It means that whenever there are still pending reads in α i , the currently pending read at position x i can still be served since the current position in the witness is not further than µ i (x i ).
(2) The number of contributors in S(i) needs to be large enough in order to provide c i during the ongoing computation. This is ensured by the invariant (3) We synchronize the contributors in the sets S(i), i ∈ [1..n]. To this end, we demand that after each step of the construction, all contributors from a particular set S(i) are currently in the same state.
We elaborate on the inductive construction of the computation. To this end, assume that a computation ρ = c 0 → * A t c was already constructed and that the variables x, x 1 , . . . , x n admit values such that the invariants (1), (2), and (3) hold. We show how to extend ρ to a computation ρ ′ = c 0 → * A t c → * A t c ′ . Moreover, ρ ′ satisfies (1), (2), and (3) along new values x ′ , x ′ 1 , . . . , x ′ n for the variables with x ′ = x + 1 and .n]. Note that the induction basis is simple. The computation c 0 along with x = x 1 = · · · = x n = 0 already satisfies the invariants (1), (2), and (3). We perform the induction step by distinguishing the following four cases: Case 1: w[x] = ⊥. Then, the corresponding transition τ in the computation γ of the leader is either an ε-transition or a read of an earlier first write. In the premier case, we extend the computation ρ by the ε-transition τ and increment x by 1. Then, clearly invariants (2) and (3)  If transition τ reads the symbol c i of an earlier first write, we add two transitions to ρ. First, we add a transition c !ci S(i)ĉ to write c i to the memory. Note that we have a contributor in S(i) that can perform the transition due to invariants (1) and (2). Then, we add the read τ of the leader, resulting in and increment x by 1. Invariant (1) is still satisfied. Invariant (2) also holds since one contributor is removed from S(i) and x is incremented by 1. And since no other contributor moved, Invariant (3) is preserved, as well. Case 2: w[x] = a ∈ D . This means that the corresponding transition τ in the computation γ of the leader writes a to the shared memory. In this case, we first append τ to ρ and obtain Now we serve all contributors that need to read the value a in order to reach their first write. Let i ∈ [1..n] and let P be a contributor in S(i) with x i = |u i | and µ i (x i ) = x. This means that P needs to read a in order to go on with its computation. Hence, we extend the computation byĉ ?a − → S(i)ĉ ′ . This ensures that all contributors in S(i) do the required transition and read a. Note that since Invariant (3) holds, all these contributors are in the required state to perform the transition. After that, we increment x i by 1. When we appended the required transitions for each i ∈ [1..n], we increment x by 1.
Invariant (1) is satisfied by the new values since the maps µ i are monotonic. Invariant (2) is preserved since we did not remove any of the contributors from the sets S(i). And Invariant (3) also holds since all the contributors in S(i) do the same transition or do not move at all.
Case 3: w[x] =c i . We have that x i = |u i | since µ i maps the positions of u i to positions of w that occur beforec i . Hence, by invariants (1) and (3) we get that all contributors in S(i) are in the same state from which they can write c i . The transitions that we need to add to ρ in this case stem from those contributors in S(j) with j > i that need to read c i in order to reach their first write c j . We serve these reads with a single contributor from S(i). Hence, we first add a corresponding write transition, resulting in: Then, if x j = |u j | and µ j (x j ) = x, we add the read transitions c ?ci − − → S(j)ĉ to the computation and increase x j by 1. After adding the transitions for each j, we increment x by 1. Now, Invariant (1) is satisfied since the µ i are monotonic. Invariant (2) holds, since we remove only one contributor from S(i) and increase x (and potentially some x j ) by 1. Invariant (3) is fulfilled since the contributors from S(j) that move, all do the same transition. And the one contributor from S(i) is removed after moving to a different state.
Case 4: Hence, the contributors in S(i) need to read either a first write c j ∈D(w, x) that appeared before x or a value that is written in a simple loop of the leader, D (w, x)) while reading is restricted to earlier first writes. For the premier case, we can append the same transitions to ρ as in Case 3 above. We focus on the latter.
Assume that u i [x i ] = a is the value that the contributors in S(i) need to read. Moreover, according to Requirement (3) of validity, the symbol is written in a loop q β.!a.β ′ − −−− → L q of the leader. The reads in β and β ′ are only from earlier first writesD (w, x). Since the loop is simple, the leader can read at most L first writes along it. Let c j1 . . . c j ℓ be the sequence of first writes, the leader reads along the loop. Note that there might be repetition among the c j k . Since x i < |u i | ≤ |α i |, there are at least L many contributors in each S(j) according to Invariant (2). Hence, we can provide enough contributors to execute the loop.
We add the following transitions to ρ. The leader executes along β i until it needs a first write c j k . Then, we let a contributor perform the transition !cj k S(j) , followed by the leader reading c j k . When the leader reaches the transition for writing a, it performs the transition, followed by the contributors in S(i) reading the value: ?a − → S(i) . In the same manner, β ′ i is processed. After adding the loop and contributor transitions to ρ, we increment x i by 1 since we have served the read request of S(i). Once we did this for each such i, we increment x by 1.
Invariant (1) is satisfied since we increased the corresponding x i by 1 and µ i is monotonic. Invariant (2) holds since we use at most L many contributors from S(j) in a loop that writes a symbol required by S(i). After the loop, x i is increased by 1, preserving the inequality for S(j) from Invariant (2). Moreover, Invariant (3) also holds since the contributors that move, all perform the same transition and contributors writing first writes are removed from S(j).
From the induction we get a computation π ′ : c 0 → * A t c ′ which satisfies the invariants and where the leader arrives in state q. Now we add the transitions of q γn+1 − −− → q f to π ′ . Reading in γ n+1 is restricted to first writes. For each first writē c i , by Invariant (3), we have that |S(i)| ≥ |γ n+1 | since x ≤ |w|. This means, that each time c i is required, we can add a contributor transition !ci S(i) followed by the corresponding read of the leader. This way we construct a computation π :

Proof of Lemma 7
Proof. Note that our complexity estimation are conservative. We do not assume that states of an automaton are stored in special lists which allow for faster iteration.
It is clear that Property (1) can be checked in time O(L · D). We claim that Property (2) can be tested in time O(L 3 · D 2 ). To this end, we check for every adjacent pair of states q, q ′ and letter a ∈ D between them if there is a transition of type (q, !a, q ′ ) ∈ δ L . Similarly if ⊥ is the symbol between q and q ′ . Then we look for a transition (q, ε, q ′ ) or (q, ?c i , q ′ ) for an i. Each such transition can be found in time |δ L | ≤ L 2 · D. Finally, checking whether there is a run from the last state of w to a state in F L , can be decided in time O(L 2 ), as it is just a reachability query on an NFA.
Property (3) can be checked in time O(C 2 · L 3 · D 2 ). We reduce to reachability in a finite state automata N constructed as follows. The states of N are given by The initial state is (q 0 C , 1). We set up the transition relation: For all (q, !a, q ′ ) ∈ δ C , we add the transitions (q, i) → N (q ′ , i). For each read transitions (q, ?a, q ′ ) ∈ δ C and state (q, i), we add the transition (q, i) → N (q ′ , i) if one of the following three options hold: D(w, i)). If a ∈D (w, i). Further, we add (q, i) → N (q, i + 1). Finally, for all states (q, i) in N such that (q, !c j , q ′ ) ∈ δ C , we add the transition (q, i) → N f . This ends the computation since c j was written. Now we have that Property (3) is satisfied if and only if there is a computation (q 0 C , 1) → * N f . The construction of N is the dominant factor and takes time O(C 2 · L 3 · D 2 ).
If we now combine the three results, we get that validity can be tested in time O(L 3 · D 2 · C 2 ).

Formal Definition of SCC-witness Candidates and Validity
We give a formal definition of SCC-witness candidates. Let r =c 1 . . .c ℓ ∈ V. We denote by C(r) the set of all strings in The set of SCC-witness candidates is the union C = r∈V C(r). A SCC-witness candidate w ∈ C is called valid if it satisfies the following properties: 1. The sequence in w without the barred symbol induces a valid run in the leader. For this we need to find appropriate entry and exit states in each SCC such that the exit state of each SCC is connected to its adjacent SCC through a transition. Let v = scc 1 1 a 1 1 . . . scc 1 k1 a 1 k1 scc 2 1 a 2 1 . . . scc 2 k2 a 2 k2 . . . scc 1 ℓ be the sequence obtained by projecting out the barred symbols. Further for any symbol α appearing in v, let pos(α) denote the position of α in v. For any i ∈ [1..|v|], We will also useD (v, i) to refer the the set of all barred symbols appearing before the position i:D (v, i) = {a |ā appears in v[1..i]}. Now corresponding to any sub-sequence of v of the form scc 1 ascc 2 that appears in v, one of the following is true.
-If a ∈ D , then we can find states q ∈ scc 1 , q ′ ∈ scc 2 such that there is a transition of the form q a − → q ′ in P L . -If a = ⊥ then, then we can find states q ∈ scc 1 , q ′ ∈ scc 2 such that there is a transition of the form q ε/c? − −− → q ′ in P L for some c ∈D (v, pos(a)). We also require that q 0 L ∈ scc 1 1 . Finally we require that from any state q in the final scc scc 1 ℓ , a run to the final state only involving writes, internal or a read of the barred symbol occurring in w. That is a run of the form q σ − → q f such that σ ∈ (W (D ) ∪ R(D (v, |v|))) * . 2. We can construct supportive computations on the contributors. For each prefix vā of w withā ∈D there is a computation q 0 C u!a − − → C q on P C to some q ∈ Q C such that the reads within u can be obtained from w. Formally, let Hence, u ′ is only mapped to elements from i∈[0..ℓ] SCC(P L ↓ i ) ∪ D . Let u ′ [i] =?a with a ∈ D . Then either v[µ(i)] = a, which corresponds to a write of a by P L , or v[µ(i)] = scc ∈ SCC(P L ↓ i ) for some i ∈ [0..ℓ]. In the latter case, we have a ∈ D (scc), this corresponds to the letter being a write by the leader through the scc or write of a value by a contributor that was already seen. 3. Let w be of the form v 1c1 v 2c2 . . . v ℓcℓ scc 1 ℓ , then the scc dont repeat in each of v 1 , v 2 . . . v ℓ .
We refer to the above Properties as (SCC) validity. Instead of stating a characterization of computations in terms of SCC-witnesses directly, we relate the SCC-witnesses to the witnesses as defined in Section 3.1.

Lemma 25. There is a valid SSC-witness candidate in C if and only if there is a valid witness candidate in E.
Proof. First we will prove the ⇒ direction. For this, we will assume a valid witness string w ∈ E. Further let w = v 0c1 v 1c2 . . . v k−1ck q. Now consider the decomposition (v ′ i = scc i 1 . . . scc i ki ) of each v i according to their SCC in P L ↓ i . This we do by taking the maximum subsequence and replace it with the SCC to which it belongs. We can be sure that none of the SCC in this sequence repeats (otherwise they will already form a bigger SCC). Secondly notice that such a SCC sequence will be a path in the SCC graph G(P L ,c 1c2 . . .c k ). This implies that the decomposition thus obtained is a scc-witness. However we need it to be a valid scc-witness. The fact that such a scc-witness satisfies property-1 of the scc-validity property follows from the fact that we started with a valid-witness that satisfies property-2 of the validity-property, this already provides us with the requires states in the adjacent scc and a transition relation between them.
Proving property-2 of the scc-validity is slightly more complicated. For this, we need to construct a run in the contributor for each prefix that ends in a barred symbol and show that there is an appropriate mapping from this run to the scc-witness string. But notice that since we started with a valid-witness string, we are guaranteed a computation in the contributor and a mapping from such a run into the witness string. For this, let us fix one such prefix to be α ′c j . Let the corresponding prefix of w be αc j . Let q 0 c u·c! − − → q be such a run and µ : [1..|u ′ |] → [1..|α ′ |] (where u ′ is obtained from u by deleting all the values that do not correspond to a read transition) be the corresponding mapping. We construct from this mapping, another mapping µ ′ : where v ′ i are the corresponding decomposition of v i into SCC. The required mapping µ ′ is constructed as follows. Suppose the original mapping µ mapped a value in [1..|u|] to a state or letter that got replaced by an SCC then we let µ ′ to map such a position to the SCC that replaced the state. If the original mapping mapped the position in [1..|u ′ |] to a letter that survied, then we also let µ ′ to do the same. That is, suppose for some v i = q 1 a 1 . . . q k a k if its corresponding decomposition be scc 1 a i1 . . . scc j a k and µ(i) = j, where j is a position into the string v i . If j points to one of a i1 . . . a k (the letters that survived the decomposition), then we let µ ′ (i) = j ′ , where j ′ is the position of such a letter in α ′ . If j points to a state or a letter in D that was decomposed into scc i , then µ ′ (i) gets the value of the position of scc i in α ′ . The correctness of such a mapping follows from the following reasoning. Clearly µ ′ thus constructed is monotonic since µ already was. Suppose µ ′ maps a position in u ′ to a letter, such a letter is guaranteed to be the same since it was the same for the mapping µ. Suppose µ mapped a position in u ′ to a letter a that got decomposed into scc i , then clearly a ∈ D (scc i ). Suppose µ mapped a position i in u ′ to a state q that got decomposed into scc i , we observe that for any a ∈ Loop(q,D (w, µ(i))), we also have a ∈ D (scc i ). With this, we get the result for one direction of the proof.
For the (⇐) direction, we will assume a valid scc-witness w and show how to construct a valid witness from this. Let w be of the form . . .c ℓ scc 1 ℓ . From the scc-validity property-1, we have q 0 L ∈ scc 1 1 . Further for every subsequence of the form scc 1 ascc 2 , we have the states q 1 ∈ scc 1 and q 2 ∈ scc 2 such that there is a transition of the form q 1 b − → q 2 , where b is either ε/c! or in a? depending on the type of a. We will call the state q 1 the exit state of scc 1 and q 2 the entry state of scc 2 . From this, corresponding to each scc i j , there are entry and exit states en i j , ex i j . In each such scc, corresponding to its entry and exit states, there is also a shortest path between them. That is, corresponding to scc of the form scc i j ∈ SCC(P L ↓ i ) there is a shortest path of the form To get the valid witness string, we replace each scc in the valid scc-witness by its shortest path as follows. There is one exception to this replacement, the final scc occuring in the scc-witness is simply replaced by its entry state. The scc scc i j is replaced by a string of the form en i j b 1 q . .c ℓ q ℓ be the string thus obtained. Notice that in each v i , none of the states repeat, this is because none of the SCC repeats in w and SCC partitions the state space. We still need to show that the string thus obtained satisfies the validity properties.
Property-1 of validity is satisfied since all the barred symbols in valid sccwitness do not repeat. Property-2 is satisfied because between any adjacent SCC in w, there is a valid move by definition. Further the way we constructed the valid witness, we replaced each scc in the scc-witness by a valid shortest path in that scc. Finally, by definition there is also a run to final state from any state in the final scc .
Finally we need to prove that property-3 of validity holds. For this, we need to show existence of an appropriate computation in the contributor along with a mapping to the witness string for every prefix of w ′ such that the prefix ends in a barred symbol. Let us fix one such prefix to be σ = v 0c1 . . . v jcj+1 , let the corresponding prefix in w be σ ′ = v ′ 0c 1 . . . v ′ jc j+1 . From the scc-validity property, we already have a computation of the form q 0 c u·cj+1! −−−−→ q and a mapping µ from positions of u ′ (u ′ is the projection of u onto only read operation) to positions in σ ′ . We provide a mapping from the positions in u ′ to positions in σ as follows. If µ maps a position to a symbol in D , then µ ′ maps such a position to the corresponding position of the same letter in σ ′ . Otherwise, the position is mapped to the first state of the shortest path of the corresponding SCC. If a letter is in D (scc), then one can always find a loop that visits such a letter. Hence such a mapping satisfies the validity property.

⊓ ⊔
As above, the algorithm iterates over all SCC-witness candidates and each is tested to be valid. The validity check can be performed in polynomial time, as shown in the below lemma.
Lemma 26. Validity of w ∈ C can be checked in time O(L 2 · D · C 2 · d 2 ).
Proof. Checking whether a SCC-witness candidate w satisfies Property (1) can be tested in time O(d ·L 2 ·D). For this, between any two SCC in w of the form the scc 1 ascc 2 , we need to check if there are states q 1 ∈ scc 1 and q 2 ∈ scc 2 such that q 1 → q 2 . This can easily be done in time O(|δ L |) ≤ O(L 2 · D). This operation, we need to perform between every pair of adjacent SCC, there are at most d many such adjacent pairs. Hence overall the time required to check whether there are proper transitions between each scc is O(d · L 2 · D) Checking whether there is a path from the final SCC to the final state reduces to reachability in an automata and can be easily done in time O(L 2 ).
Property (2) can be tested in time O(d 2 · L 2 · D · C 2 ). This can be achieved by reducing the problem to reachability in a NFA. The idea is similar to the one we saw in proof of lemma-7. The automaton we construct will have the states of the contributor and an index into the given witness string.
For constructing this automaton, we need to check for any given scc and a letter a ∈ D , whether a ∈ D (scc). This can easily be achieved in time O(L 2 ) by reducing it to reachability problem in a graph. We add a transition of the form (c 1 , i) − → (c 2 , i) if there is a transition of the form c 1 a? − → PC c 2 , i refers to scc in w and a ∈ scc. So for each i, we go through each transition in δ C and perform the scc-check. This takes time O(d · C 2 · D · L 2 ).
Finally, the size of such an automaton is simply O(C × d) and we need to perform a reachability check which is quadratic and hence in O(C 2 × d 2 ) . From this, we get the required complexity.
Finally, it is obvious that Property (3) can be checked in time O(d). This completes the proof.

Comparison between introduced Methods
Compared to the formerly introduced witnesses, there are less candidates to test. We have that Wit SCC (s, D, d) ≤ Wit (L, D).
Proof. For this, we show how to construct an injective function from every string in Wit SCC (s, D, d) to strings in Wit (L, D). This would already give us the result. The idea here is to represent each scc by an unique state present in that scc. For this, we will assume an arbitrary ordering on the states of the leader Q L . Recall that the scc partitions the states space of the leader, hence for any scc 1 , scc 2 ∈ SCC(P L ↓ i ), for some i we have scc 1 ∩scc 2 = ∅. Now given any string w = scc 1 1 a 1 1 . . . scc 1 k1 a 1 k1c 1 scc 2 1 a 2 1 . . . scc 2 k2 a 2 k2c 2 . . .c ℓ scc 1 ℓ ∈ Wit SCC (s, D, d), let f (w) = v be obtained by replacing each scc i j by the minimum state q i j in that scc. Since no two scc share a state, clearly v ∈ Wit (L, D). Quite clearly such a mapping is injective since each of the replacing states uniquely represents the scc.

A Bound on the Number of SCC-witness Candidates
We will here show that Wit SCC (s, D, d) ≤ (s · (D + 1)) d · D D · 2 D+d . For this, we note that for any w ∈ Wit SCC (L, D, d), the size of such a string is bounded by 2d + D.

Formal Construction and Proof of Proposition 9
The domain of the memory is D = {row(i), col(i), # i | i ∈ [1..k]} ∪ {a 0 }. The contributor threads are defined by P C = (Op(D ), Q C , q 0 C , δ C ) with set of states Intuitively, we use the states q (r,ℓ) (i,j) , q (c,ℓ ′ ) (i,j) to indicate that the contributor has chosen (i, j) to store and to count the number of vertices that the contributor has read so far. More precise, the state q (r,ℓ) (i,j) reflects that the last symbol read was row(ℓ), the row of the ℓ-th vertex. The state q (c,ℓ) (i,j) indicates that the last read symbol was the column-symbol belonging to the ℓ-th vertex. Note that this can be any column and thus different from col(ℓ).
The transition relation δ C is defined by the following rules. We have a rule to choose a vertex: q 0 To read the ℓ-th rowsymbol, we have q For reading the ℓ-th column-symbol, we get the transition q , but only if one of the following is satisfied: (1) We have that i = ℓ and there is an edge between (ℓ, j ′ ) and (i, j) in G. Intuitively, the contributor stores a vertex (i, j) from a row, different than ℓ. But then it can only continue its computation if (i, j) and (ℓ, j ′ ) share an edge. (2) We have that i = ℓ and j ′ = j. This means that the contributor stores the vertex that it has read. Note that with this, we rule out all contributors storing other vertices from the i-th row.
To end the computation in a contributor, we get the rule q We further set F L = {q (#,k) L }, the state after receiving all #-symbols. Then our program is defined to be A = (D , a 0 , (P L , P C )). Note that the parameters are D = L = 3k + 1. The correctness of the construction is proven in the following lemma.
Lemma 28. There is a t ∈ N so that c 0 → * A t c with c ∈ C f if and only if there is a clique of size k in G with one vertex from each row.
Proof. First assume that G contains the desired clique and let (1, j 1 ), . . . , (k, j k ) be its vertices. For t = k we construct a computation from c 0 to a configuration in C f . We have k copies of the contributor P C in the system A t . We shall denote them by P 1 , . . . , P k .
The computation starts with P i choosing the vertex (i, j i ) to store, it performs the step q 0 Ci ji) . Hence, we get as a computation on A t : Then P L writes the symbol row(1) and each contributor reads it. We get After transmitting the first row to all contributors, P L then communicates the first column by writing col(j 1 ) to the memory. Again, each contributor reads it. Note that P 1 can read col(j 1 ) since it stores exactly the vertex (1, j 1 ). A contributor P i with i = k can also read col(j 1 ) since P i stores (i, j i ), a vertex which shares an edge with (1, j 1 ) due to the clique-assumption. Hence, we get the computation j1) , . . . , q (c,1) (k,j k ) , col(j 1 )) = c 1 . Similarly, we can construct a computation leading to a configuration c 2 . By iterating this process, we get j1) , . . . , q (c,k) (k,j k ) , col(j k )). Then each contributor P i can write the symbol # i . This is done in ascending order: First, P 1 writes # 1 and P L reads it. Then, it is P 2 's turn and it writes # 2 . Again, the leader reads the symbol. After k rounds, we reach the configuration (q (#,k) L , q f C1 , . . . , q f C k , # k ) which lies in C f . Now let a t ∈ N together with a computation ρ = c 0 → * A t c with c from C f be given. Let ρ L be the subcomputation of ρ carried out by P L . Technically, the projection of ρ to P L . Then ρ L has the form ρ L = ρ 1 L .ρ 2 L with We show that the vertices (1, j 1 ), . . . , (k, j k ) form a clique in G.
Since in ρ 2 L , the leader is able to read the symbols # 1 up to # k , there must be at least k contributors writing them. Due to the structure of P C , it is not possible to write different # i symbols. Hence, we get one contributor for one symbol and thus, t ≥ k.
Let P Ci be a contributor writing # i . Then P Ci stores the vertex (i, j i ), it performs the initial move q 0 Ci . Assume P Ci stores the vertex (i ′ , j ′ ). Since the thread writes # i in the end, we get i ′ = i due to the structure of P Ci .
During the computation ρ, the thread performs the step q since P L writes the symbol col(j i ) to the memory and the computation on P Ci does not deadlock. Note that we use the following: The leader writes row(i) before col(j i ). This ensures that col(j i ) is indeed the column of the i-th transmitted vertex and the above transition is correct. However, the contributor P Ci can only do the transition if j ′ = j i . Thus, we get that (i ′ , j ′ ) = (i, j i ).
Let P (i,ji) denote a contributor that writes # i during ρ. Since the contributor P (i,ji) stores the vertex (i, j i ), the leader P L has written row(i) and col(j i ) to the memory. Now let P (i ′ ,j i ′ ) be another contributor with i ′ = i. Then also this thread needs to perform the step q since the computation does not end on P (i ′ ,j i ′ ) at this point. But by definition, the transition can only be carried out if there is an edge between (i, j i ) and (i ′ , j i ′ ). Hence, each two vertices of (1, j 1 ), . . . , (k, j k ) share an edge.

Formal Construction and Proof of Theorem 10
We define the polynomial equivalence relation R in more detail. Assume some encoding of Boolean formulas over some finite alphabet Γ . Let F ⊆ Γ * be the encodings of proper 3-SAT-instances. We say that two encodings ϕ, ψ ∈ Γ * are equivalent under R if either ϕ, ψ ∈ F and the formulas have the same number of clauses and variables or if ϕ, ψ are both not in F . Then R is a polynomial equivalence relation. We define D to be the union of the sets Intuitively, the states q w with w ∈ {0, 1} ≤log(I) form the nodes of the tree of the first phase. The remaining states are needed to store the chosen instance, a variable, and its evaluation.
The transition relation δ C contains rules for the three phases. In the first phase, P C reads the bits chosen by the leader. According to the value of the bit, it branches to the next state: q w ?(u,|w|+1) − −−−−−− → C q w.u for u ∈ {0, 1} and w ∈ {0, 1} ≤log(I)−1 . Then we get an ε-transition from those leaves of the tree that encode a proper index, lying in [1..I]. We have the rule q w ε − → C q (ch,ℓ) if w = bin(ℓ) and ℓ ∈ [1..I]. For the second phase, we need transitions to store a variable and its evaluation: .n], and v ∈ {0, 1}. In the third phase, the contributor loops. We have q ℓ The program A is defined to be A = (D , a 0 , (P L , P C )) and the LCR-instance of interest is thus (A, F L ). The correctness of the construction is shown in the following lemma. Hence, all requirements of a cross-composition are met.
Proof. We first assume that there is an ℓ ∈ [1.
.I] such that ϕ ℓ is satisfiable. Let v 1 , . . . , v n be the evaluation of the variables x 1 , . . . , x n that satisfies ϕ ℓ . Further, we let bin(ℓ) = u 1 . . . u log(I) be the binary representation of ℓ. Set t = n. Then the program A t has n copies of the contributor. Denote them by P 1 , . . . , P n . Intuitively, P i is responsible for variable x i .
We construct a computation of A t from c 0 to a configuration in C f . We proceed as in the aforementioned phases. In the first phase, P L starts to guess the first bit of bin(ℓ). This is read by all the contributors. We get a computation , q ε , . . . , q ε , (u 1 , 1)) , q u1 , q ε , . . . , q ε , (u 1 , 1)) . . .
To extend the computation, we let P L guess the remaining bits, while the contributors read and store. Hence, we get , q bin(ℓ) , . . . , q bin(ℓ) , (u log(I) , log(I))).
Before the second phase starts, the contributors perform an ε-transition to the state q (ch,ℓ) . This is possible since bin(ℓ) encodes a proper index in [1..I].
We get c (b,log(I)) → * A t (q (b,log(I)) L , q (ch,ℓ) , . . . , q (ch,ℓ) , (u log(I) , log(I))) = c (x,0) . In the second phase P L chooses the correct evaluation for the variables, it writes (x i , v i ) for each variable x i . Contributor P i reads (x i , v i ) and stores it.
Hence, we get the computation . . .
In the last phase, the contributors write the symbols # j . Since ϕ ℓ is satisfied by the evaluation v 1 , . . . , v n , there is a variable i 1 ∈ [1.
.n] such that x i1 evaluated to v i1 satisfies clause C ℓ 1 . Hence, due to the transition relation δ C , we can let P i1 write the symbol # 1 . After that, the leader reads it and moves to the next state. This amounts to the computation Similarly, we can extend the computation to reach the configuration which lies in C f . This proves the first direction. Now we assume the existence of a t ∈ N such that there is a computation ρ from c 0 to a configuration in C f . Let ρ L be the subcomputation of ρ carried out by the leader P L . Then ρ L can be split into ρ L = ρ 1 L .ρ 2 L .ρ 3 L such that Let ℓ be the natural number such that bin(ℓ) = u 1 . . . u log(I) . We show that ℓ ∈ [1.
.I] and ϕ ℓ is satisfied by evaluating the variables x i to v i . In ρ 3 L , the leader can read the symbols # 1 , . . . , # m . This means that there is at least one contributor writing them. Let P C be a contributor writing such a symbol. Then, after P L has finished ρ 1 L , the contributor P C is still active and performs the step q bin(ℓ) ε − → C q (ch,ℓ) . This is true since P C did not miss a bit transmitted by P L and P C has to reach a state where it can write the #-symbols. Thus, we get that ℓ ∈ [1..I] and P C stores ℓ in its state space.
We denote the number of contributors writing a #-symbol in ρ by t ′ ≥ 1. Each of these contributors gets labeled by C(j) = {# j1 , . . . , # j k j }, the set of #-symbols it writes during the computation ρ. Hence, we have the contributors P C(1) , . . . , P C(t ′ ) and since each symbol in {# 1 , . . . , # m } is written at least once, we have: Now we show that each P C(j) with C(j) = {# j1 , . . . , # j k j } stores a tuple (x i , v i ) such that x i evaluated to v i satisfies the clauses C ℓ j1 , . . . , C ℓ j k j . We already know that P C(j) is in state q (ch,ℓ) after P L has executed ρ 1 L . During P L executing ρ 2 L , the contributor P C(j) has to read a tuple (x i , v i ) since it has to reach a state where it can write the #-symbols. More precise, P C(j) has to perform a transition q (ch,ℓ) ?(xi,vi) − −−−− → C(j) q ℓ (xi,vi) for some i. Then the contributor writes the symbols # j1 , . . . , # j k j while looping in the current state. But by the definition of the transition relation for the contributors, this means that x i evaluated to v i satisfies the clauses C ℓ j1 , . . . , C ℓ j k j . By Equation (1) we can deduce that every clause in ϕ ℓ is satisfied by the chosen evaluation. Hence, ϕ ℓ is satisfiable.

B Proofs for Section 3.2
We give the missing constructions and proofs for Section 3.2.

Proof of Lemma 12
Before we elaborate on the proof, we introduce a few notations. Let t ∈ N and c a configuration of A t . To access the components of c we use the following projections: π L (c) returns the state of the leader in c, π D (c) the value of the shared memory. For p ∈ Q C , we denote by # C (c, p) the number of contributors that are currently in state p. Finally, we use π C (c) for the set of contributor states that appear in c. Formally, π C (c) = {p ∈ Q C | # C (c, p) > 0}. The proof of Lemma 12 is a consequence of the following stronger lemma. It states that for any reachable configuration in the program, there is a node in G reachable by v 0 such that: (1) The state of the leader and the memory value are preserved and (2) the possible states of the contributors can only increase.
where π L (c) = q, π D (c) = a, and π C (c) ⊆ S. First assume that a computation c 0 → * A t c for a t ∈ N is given. We proceed by induction on the length of the computation. In the base case, the length is 0. This means that c = c 0 is the initial configuration. But then π L (c) = q 0 L , π D (c) = a 0 , and π C (c) = {q 0 C }. This characterizes the initial node of G and there is a path v 0 → * E v 0 of length 0 which proves the base case. Suppose the statement holds for all computations of length at most ℓ. Let c 0 → * A t c be a computation of length ℓ + 1. Then, it can be split into c 0 → * , and π C (c ′ ) ⊆ S ′ . Now we distinguish two cases: (1) If c ′ → A t c is induced by a transition of the leader, the leaders' state and the memory value get updated, but the contributor states do not. We have that π C (c) = π C (c ′ ) ⊆ S ′ . Now we set q = π L (c), a = π D (c) and S ′ = S. Then, on G we have an edge (q ′ , a ′ , S ′ ) → E (q, a, S).
(2) If c ′ → A t c is induced by a transition of a contributor, we immediately get that π L (c) = π L (c ′ ) = q ′ . Let the transition of the contributor be of the Note that it can happen that p ′ is not an element of π C (c) since there might be just one contributor in state p ′ which switches to state p. We set q = q ′ , a = a ′ , and S = S ′ ∪{p}. Then we have an edge (q ′ , a ′ , S ′ ) → E (q, a, S) induced by the transition. Writes of the contributors are similar.
This shows the first direction of the lemma. For the other direction, we apply induction to prove a slightly stronger statement: For each path v 0 → * E (q, a, S), there is a t ∈ N and a computation c 0 → * A t c such that π L (c) = q, π D (c) = a, and π C (c) = S. In the proof we rely on the Copycat Lemma, presented in [17]. Roughly it states that for a computation where a state p ∈ Q C is reached by one of the contributors, there is a similar computation where p is reached by an arbitrary number of contributors. We restate the lemma in our setting.
Lemma 31 (Copycat Lemma [17]). Let t ∈ N and c 0 → * A t c a computation. Moreover, let p ∈ Q C such that # C (c, p) > 0. Then for all k ∈ N, we have a computation of the form c 0 → * A t+k d, where configuration d satisfies the following: π L (d) = π L (c), π D (d) = π D (c), # C (d, p) = # C (c, p) + k and for all p ′ = p we have # C (d, p ′ ) = # C (c, p ′ ).
We turn back to the induction on the length of the given path. In the base case, the length is 0. Then, we have that (q, a, S) = v 0 . This means q = q 0 L , a = a 0 , and S = {q 0 C }. Considering the initial configuration c 0 for an arbitrary t ∈ N, we get the computation c 0 → * A t c 0 of length 0, with π L (c 0 ) = q, π D (c 0 ) = a, and π C (c 0 ) = S.
Assume the statement holds true for all paths of length at most ℓ. Let v 0 → * E (q, a, S) be a path of length ℓ + 1. We split the path into a subpath v 0 → * E (q ′ , a ′ , S ′ ) of length ℓ and an edge (q ′ , a ′ , S ′ ) → E (q, a, S). Invoking the induction hypothesis, we get a t ∈ N and a computation c 0 → * A t c ′ such that π L (c ′ ) = q ′ , π D (c ′ ) = a ′ , and π C (c ′ ) = S ′ . We distinguish two cases: (1) The edge (q ′ , a ′ , S ′ ) → E (q, a, S) was induced by a transition of the leader. Since π L (c ′ ) = q ′ and π D (c ′ ) = a ′ , the same transition also induces a step c ′ → A t c with π L (c) = q, π D (c) = a, and π C (c) = S = S ′ .

Formal Construction and Proof of Proposition 17
The memory domain is defined by Lemma 32. There is a t ∈ N so that c 0 → * A t c with c ∈ C f if and only if there are sets S 1 , . . . S r ∈ F such that U = i∈[1,r] S i .
Proof. Let S 1 , . . . , S r ∈ F be a cover of U . We can construct a computation with t = n contributors. The leader first guesses the set S 1 . It writes all elements u ∈ S 1 to the memory and there is one contributor storing each element in its states by reading the corresponding u. Then, the leader decides for S 2 and writes the elements in the set to the memory. Now, only the new elements got stored by a contributor. Elements that were seen already are ignored. We proceed for r phases. Then, the contributors store exactly those elements that got covered by S 1 , . . . , S r . Since these cover U , the contributors can write all symbols u # to the memory in any order. The leader P L can thus read the required string and reach its final state. Now assume there is a t ∈ N together with a computation ρ on A t from c 0 to a configuration c ∈ C f . Consider ρ L , the projection of ρ to the leader P L . Then, the computation ρ L is of the form ρ L = ρ 1 L . . . ρ r L .ρ f L with: where S i = {u Si 1 , . . . , u Si ni } is a set in F , and The candidate for the cover of U is S 1 , . . . , S r ∈ F . These are the sets selected by P L during its r initial phases. A contributor can only read and thus move in its state space during the leader is in a phase ρ i L . This means that contributors can only store symbols that got covered by the chosen sets S i . Moreover, they can only write what they have stored. Since ρ f L can be carried out by the leader, the contributors can write all elements u ∈ U to the memory. Phrased differently, all elements u ∈ U were stored by contributors and hence covered by S 1 , . . . , S r .

Formal Construction and Proof of Proposition 18
The construction of Proposition 18 is similar to the construction in the following statement. It presents a lower bound for LCR based on ETH and shows that the Algorithm of Section 3.2 has an optimal exponent. For the reduction, let ϕ be a given 3-SAT-instance. We assume ϕ to have the variables x 1 , . . . , x n and clauses C 1 , . . . , C m . The construction of an LCR-instance relies on the following idea which is similar to Proposition 18. The leader P L will guess an evaluation for each variable, starting with x 1 . To this end, it will write a tuple of the form (x 1 , v 1 ), with v 1 ∈ {0, 1}, to the memory. A contributor will read the tuple and stores it in its state space. This is repeated for each variable. After the guessing-phase, the contributors can write the symbols # j , depending on whether the currently stored variable with its evaluation satisfies clause C j . As soon as the leader has read the complete string # 1 . . . # m , it moves to its final state, showing that the guessed evaluation satisfied all the clauses.
For the formal construction, let We define the leader to be the tuple P L = (Op(D ), Q L , q (x,0) L , δ L ), where the states are given by .m]}. The transition relation δ L is defined as follows. We have rules for guessing the evaluation: − − → C q (xi,v) if variable x i evaluated to v satisfies clause C j . The program A is defined as the tuple A = (D , a 0 , (P L , P C )) and the LCRinstance is (A, F L ). The correctness of the construction is proven in the following lemma.
Lemma 34. There is a t ∈ N so that c 0 → * A t c with c ∈ C f if and only if ϕ is satisfiable.
Proof. Let ϕ be satisfiable, by the evaluation v 1 , . . . , v n . We show how to construct the desired computation. First set t = n, so we have n copies of P C . Let these denoted by P 1 , . . . , P n .
The leader P L first guesses the correct evaluation of the variables. Each P i will store the evaluation for variable x i . , q (x1,v1) , . . . , q (xn,vn) , (x n , v n )) = c (x,n) .
Since v 1 , . . . , v n is a satisfying assignment, there is an index i 1 ∈ [1.
.n] such that x i1 evaluated to v i1 satisfies clause C 1 . Hence, the corresponding contributor P i1 can write the symbol # 1 . This is then read by the leader. The process gets repeated for # 2 , . . . , # m . Hence, we get c (x,n) !#1 − − → i1 (q with c (#,m) ∈ C f . For the other direction, let a t ∈ N and a computation ρ from c 0 to a configuration in C f be given. Let ρ L denote the subcomputation of ρ carried out by the leader P L . Then ρ L has the form ρ L = ρ 1 L .ρ 2 L with ρ 1 L = q We show that v 1 , . . . , v n is a satisfying assignment for ϕ.
Since P L can read the symbol # 1 during ρ 2 L , there is a contributor P ℓ writing the symbol. But this can only happen if P ℓ has stored a tuple (x i , v i ), written by P L during ρ 1 L , and if x i evaluated to v i satisfies clause C 1 . Since all symbols # 1 , . . . # m are read by P L , we get that each clause in ϕ is satisfiable by the evaluation chosen by P L during ρ 1 L .

⊓ ⊔
To prove Proposition 18, we change the above construction slightly. Let ϕ 1 , . . . , ϕ I be the given 3-SAT-instances, each pair equivalent under R, where R is the polynomial equivalence relation from Theorem 10. Then each formula has the same number of clauses m and uses the set of variables {x 1 , . . . , x n }. We assume ϕ ℓ = C ℓ 1 ∧ · · · ∧ C ℓ m . First, we let the leader chose an evaluation of the variables x 1 , . . . , x n as above. The contributors are used to store it. Then, instead of writing just # j , the contributors can write the symbols # ℓ j to mention that the currently stored variable with its evaluation satisfies clause C ℓ j . The leader can now branch into one of the I instances. It waits to read a string # ℓ 1 . . . # ℓ m for a certain ℓ ∈ [1..I]. If it can succeed, it moves to its final state.
To realize the construction, we need to slightly change the structure of the leader, extend the data domain and add more transitions to the contributors. The parameter C will not change in this construction, it is still O(n). Hence, the sizerestrictions of a cross-composition are met. The correctness of the construction is similar to Lemma 34, the only difference is the fact that P L also chooses the instance ϕ ℓ that should be satisfied.

C Proofs for Section 3.3
We give the missing constructions and proofs for Section 3.3.

Formal Construction and Proof of Proposition 19
We first give construction and proof for the W[1]-hardness of LCR(L). Lemma 35. There is a t ∈ N so that c 0 → * A t c with c ∈ C f if and only if there is a clique of size k in G.
Proof. We first assume that G contains a clique of size k. Let it be the vertices v 1 , . . . , v k . We construct a computation on A t with t = k that leads from c 0 to a configuration c in C f . The program contains k contributors, denoted by P 1 , . . . , P k . We proceed in three phases, as described above.
Note that P 1 can read (v # 1 , 1) and move since it stores exactly (v 1 , 1). Any P i with i = 1 can read (v # 1 , 1) and continue its computation since v i = v 1 and the two vertices share an edge. Similarly, one can continue the computation: c 1 → * A t c k = (q k V # , q k (v1,1) , . . . , q k (v k ,k) , (v # k , k)). In the third phase, contributor P i writes the symbol # i to the memory. The leader waits to read the complete string # 1 . . . # k . This yields the following computation: −−→ L (q 1 # , q f C , q k (v2,2) , . . . , q k (v k ,k) , # 1 ) . . .
For the other direction, let a t ∈ N and a computation ρ = c 0 → * A t c with c ∈ C f be given. We denote by ρ L the part of the computation that is carried out by the leader P L . Then we can factor ρ L into ρ L = ρ 1 .ρ 2 .ρ 3 with We show that w i = v i for any i ∈ [1.
.k] and that v i = v j for i = j. Furthermore, we prove that each two vertices v i , v j share an edge. Hence, v 1 , . . . , v k form a clique of size k in G.
Since P L is able to read the symbols # 1 , . . . , # k in ρ 3 , there are at least k contributors writing them. But a contributor can only write # i in its computation if it reads (and stores) the symbol (v i , i) from ρ 1 . Hence, there is at least one contributor storing (v i , i). We denote it by P vi .
The computation ρ 2 starts by writing (w # 1 , 1) to the memory. The contributors P vi have to read it in order to reach a state where they can write the symbol # i . Hence, P v1 reads (w # 1 , 1). By the definition of the transition relation of P v1 , this means that w 1 = v 1 . Now let P vi with i = 1. This contributor also reads (w # 1 , 1) = (v # 1 , 1). By definition this implies that v i = v 1 and the two vertices share an edge.
By induction, we get that w # i = v i , the v i are distinct and each two of the v i share an edge.
⊓ ⊔ To prove the W[1]-hardness of LCR(D), we go back to our idea to transmit vertices in binary. Let t = log(|V |) and bin : V → {0, 1} t be a binary encoding of the vertices. Instead of a single symbol (v, i) with v ∈ V and i ∈ [1..k], the leader will write a string #.(α 1 , i).#.(α 2 , i).# . . . (α t , i).# to the memory, where t = log(|V |), α 1 .α 2 . . . α t = bin(v), and # is a special padding symbol. We need the padding in order to prevent the contributors from reading a symbol (α j , i) multiple times. Note that the new data domain contains only O(k) many symbols.
The idea of the program over the changed data domain is similar to the idea above: It proceeds in three phases. In the first phase, the leader chooses the vertices of a clique candidate. This is done by repeatedly writing a string #.(α 1 , i).#.(α 2 , i).# . . . (α t , i).# to the memory, for each i ∈ [1..k]. Like above, the contributors non-deterministically decide to store a written vertex. To this end, a contributor that wants to store the i-th suggested vertex has a binary tree branching on the symbols (0, i) and (1, i). Leaves of the tree correspond to binary encodings of vertices. Hence, a particular vertex can be stored in the contributor's states. Note, as we did not assume |V | to the a power of 2, there might be leaves of the tree that do not correspond to encodings of vertices. If a computation reaches such a leaf it will deadlock.
In the second phase, the leader again writes the binary encoding of k vertices to the memory. But this time, it uses a different set of symbols: Instead of 0 and 1, the leader uses 0 # and 1 # to separate Phase two from Phase one. The contributors need to compare the suggested vertices as in the above construction. To this end, a contributor storing the vertex (v, i) proceeds in k stages. In stage j = k it can only read the encodings of those vertices which are connected to v. Hence, if the leader suggests a wrong vertex, the computation will deadlock. In stage i, the contributor can only read the encoding of the stored vertex v. This allows a verification of the clique as above.
The last phase is identical to the last phase of the above construction. The contributors write the symbols # i , while the leader waits to read the string # 1 . . . # k . This constitutes a proper clique. The formal construction and proof are omitted as they are quite similar to the above case.

D Proofs for Section 4.1
We give the missing constructions and proofs for Section 4.1.