Fast and parallel decomposition of constraint satisfaction problems

Constraint Satisfaction Problems (CSP) are notoriously hard. Consequently, powerful decomposition methods have been developed to overcome this complexity. However, this poses the challenge of actually computing such a decomposition for a given CSP instance, and previous algorithms have shown their limitations in doing so. In this paper, we present a number of key algorithmic improvements and parallelisation techniques to compute so-called Generalized Hypertree Decompositions (GHDs) faster. We thus advance the ability to compute optimal (i.e., minimal-width) GHDs for a significantly wider range of CSP instances on modern machines. This lays the foundation for more systems and applications in evaluating CSPs and related problems (such as Conjunctive Query answering) based on their structural properties.


Introduction
Many real-life tasks can be effectively modelled as CSPs, giving them a vital importance in many areas of Computer Science. As solving CSPs is a classical NP-complete problem, there is a large body of research to find tractable fragments. One such line of research focuses on the underlying hypergraph structure of a CSP instance. A key result in this area is that CSP instances whose underlying hypergraph is acyclic, can be solved in polynomial time [41]. Several generalisations of acyclicity have been identified by defining various forms of hypergraph decompositions, each associated with a specific notion of width [8,18]. Intuitively, the width measures how far away a hypergraph is from being acyclic, with a width of 1 describing the acyclic hypergraphs.
In this work, we focus on Generalized Hypertree Decompositions (GHD) [20], and generalized hypertree width (ghw). Formally, we look at the following problem: The computation of GHDs is itself intractable in the general case, already for width = 2 [15]. However, for (hypergraphs of) CSPs with realistic restrictions, this problem becomes tractable for a fixed parameter k. One such restriction is the bounded intersection property (BIP), which requires that any two constraints in a CSP only share a bounded number of variables [15]. Indeed, by examining a large number of CSPs from various benchmarks and real-life applications, it has been verified that this intersection of variables tends to be small in practice [14]. In that work, over 3,000 instances of hypergraphs of CSPs and also of Conjunctive Queries (CQs) were examined and made publicly available in the Hyper-Bench benchmark at http:// hyper bench. dbai. tuwien. ac. at.
The use of such decompositions can speed up the solving of CSPs and also the answering of CQs significantly. In fact, in [1] a speed-up up to a factor of 2,500 was reported for the CQs studied there. Structural decompositions are therefore already being used in commercial products and research prototypes, both in the CSP area as well as in database systems [1,4,5,27,33]. However, previous decomposition algorithms are limited in that they fail to find optimal decompositions (i.e., decompositions of minimal width) even for low widths. This is also the case for various GHD computation methods proposed in [14,31,38]. The overall aim of our work is therefore to advance the art of computing hypergraph decompositions and to make the use of GHDs for solving CSPs applicable to a significantly wider range of CSP instances than previous methods. More specifically, we derive the following research goals: Main Goal: Provide major improvements for computing hypergraph decompositions.
As part of this main goal, we define in particular: Sub-goal 1: Design novel parallel algorithms for structural decompositions, in particular GHDs, and Sub-goal 2: Put all this to work, by implementing and extensively evaluating these improvements.
Note that, apart from GHDs, there are also other types of hypergraph decompositions (see [18] for a comparison), notably tree decompositions (TDs) [37], hypertree decompositions (HDs) [19], and fractional hypertree decompositions (FHDs) [25]. TDs are the oldest and most intensively studied form of these decomposition methods, both in terms of their efficient computation (see e.g. [7,29]), the potential for parallel algorithms (see e.g. [32]) and their application to a wide range of problems -including constraint solving [36]. However, compared with the other types of decompositions mentioned above, TDs have a serious drawback in the context of CSP evaluation and CQ answering: CSP and CQ algorithms based on any of these decompositions essentially run in time O(n k ), where n is the size of the problem instance and k is the width of the decomposition used. However, if we have relations of arity α, then the treewidth may be up to a factor α worse than the width notions based on the other decomposition methods. This is why HDs, GHDs, and FHDs are the better choice for CSPs and CQs.
There exist systems for computing HDs [14,23,38], GHDs [14,31,38], and FHDs [13]. However, to the best of our knowledge, none of them makes use of parallelism. In theory, HD-computation is easiest among the three. Indeed, the check problem (i.e., the problem of checking if a decomposition of some fixed width exists and, in the positive case, computing such a decomposition) is tractable for HDs [19] but intractable for GHDs and FHDs even for width = 2 [15,20]. Nevertheless, despite this tractability result, HDcomputation has turned out to be computationally expensive in practice. And parallelisation of the computation is tricky in this case since, in contrast to GHDs and FHDs, HDs are based on a rooted tree. This makes it impossible to reuse the ideas of the parallel GHD computation applied here, which recursively splits the task of computing the tree underlying a GHD into subtrees which then have to be re-rooted appropriately when they are stitched together. FHDs are computationally yet more expensive than GHDs. So GHDs are a good middle ground among these 3 types of decompositions.
As a first step in pursuing the first goal, we aim at generally applicable simplifications of hypergraphs to speed up the decomposition of hypergraphs. Here, "general applicability" means that these simplifications can be incorporated into any decomposition algorithms such as the ones presented in [13,14] and also earlier work such as [23]. Moreover, we aim at heuristics for guiding the decomposition algorithms to explore more promising parts of the big search space first.
However, it will turn out that these simplifications and heuristics are not sufficient to overcome a principal shortcoming of existing decomposition algorithms, namely their sequential nature. Modern computing devices consist of multi-core architectures, and we can observe that single-core performance has mostly stagnated since the mid-2000s. So to produce programs which run optimally on modern machines, one must find a way of designing them to run efficiently in parallel. However, utilising multi-core systems is a non-trivial task, which poses several challenges. In our design of parallel GHD-algorithms, we focus on three key issues: i minimising synchronisation delay as much as possible, ii finding a way to partition the search space equally among CPUs, and thus utilising the resources optimally, iii supporting efficient backtracking, a key element of all structural decomposition algorithms presented so far.
In order to evaluate our algorithmic improvements and our new parallel GHD-algorithms, we have implemented them and tested them on the publicly available HyperBench benchmark mentioned above. For our implementation, we decided to use the programming language Go proposed by Google [10], which is based on the classical Communication Sequential Processes pattern by [28], since it reduces the need for explicit synchronisation.
To summarise, the main results of this work are as follows: • We have developed three parallel algorithms for computing GHDs, where the first two are loosely based on the balanced separator method from [3,14]. As has been mentioned above, none of the previous systems for computing HDs, GHDs, or FHDs makes use of parallelism. Our parallel approach has opened the way for a hybrid approach, which combines the strengths of parallel and sequential algorithms. This hybrid approach ultimately proved to be the best. • In addition to designing parallel algorithms, we propose several algorithmic improvements such as applying multiple pre-processing steps on the input hypergraphs and using various heuristics to guide the search for a decomposition. While most of the pre-processing steps have already been used before, their combination and, in particular, a proof that their exhaustive application yields a unique normal form (up to isomorphism) is new. Moreover, for the hybrid approach, we have explored when to best switch from one approach to the other. • We have implemented the parallel algorithms together with all algorithmic improvements and heuristics presented here. The source code of the program is available under https:// github. com/ cem-okulm us/ Balan cedGo. With our new algorithms and their implementation, dramatically more instances from HyperBench could be solved compared with previous algorithms. More specifically, we could extend the number of hypergraphs with exact ghw known by over 50%. In total, this means that for over 75% of all instances of HyperBench, the exact ghw is now known. If we leave aside the randomly-generated CSPs, and focus on the those from real world applications, we can show an increase of close to 100%, thus almost doubling the number of instances solved.
Our work therefore makes it possible to compute GHDs efficiently on modern machines for a wide range of CSPs. It enables the fast recognition of low widths for many instances encountered in practice (as represented by HyperBench) and thus lays the foundation for more systems and applications in evaluating CSPs and CQs based on their structural properties.
The remainder of this paper is structured as follows: In Section 2, we provide the needed terminology and recall previous approaches. In Section 3, we present our general algorithmic improvements. This is followed by a description of our parallelisation strategy in Section 4. Experimental evaluations are presented in Section 5. In Section 6, we summarise our main results and highlight directions for future work. This paper is an enhanced and extended version of work presented at IJCAI-PRICAI 2020 [21].

Preliminaries
CSPs & hypergraphs A constraint satisfaction problem (CSP) P is a set of constraints( where each S i = {s 0 , …s n } is a set of variables and R i a constraint relation which contains tuples of size n using values from a domain D. A solution to P is a mapping of variables to values from the domain D, such that for each constraint we map the variables to some tuple in its constraint relation. A hypergraph H is a tuple (V (H),E(H)), consisting of a set of vertices V (H) and a set of hyperedges (synonymously, simply referred to as "edges") E(H) ⊆ 2 V(H) , 1 3 where the notation 2 V (H) signifies the power set over V (H). To get the hypergraph of a CSP P, we consider V (H) to be the set of all variables in P, to be precise ⋃ i S i , and each S i to be one hyperedge. Here, we disregard the constraint relations, as they contain no additional structural information.
Recall that solving a CSP corresponds to model checking a first-order formula Φ (representing the constraints S i ) over a finite structure (made up by the relations R i ) such that the only connectives allowed in Φ are ∃ and ∧, whereas ∀,∨, and ¬ are disallowed. Hence, formally, CSP solving is equivalent to answering conjunctive queries (CQs) in the database world [30,35]. In the sequel, we will mainly concentrate on CSPs with the understanding that all our results equally apply to CQs.
The intersection size of a hypergraph H is defined as the minimum integer c, such that for any two edges e 1 ,e 2 ∈ E(H), e 1 ∩ e 2 ≤ c. A class C of hypergraphs has the bounded intersection property (BIP), if there exists a constant c such that every hypergraph H ∈ C has intersection size ≤ c.
We are frequently dealing with sets of sets of vertices (e.g., sets of edges). For S ⊆ 2 V(H) , we write ⋃ S and ⋂ S as a short-hand for taking the union or intersection, respectively, of this set of sets of vertices, i.e., for S = {s 1 For a set S of edges, we will alternatively also write V (S) to denote the vertices contained in any of the edges in S. That is, we have V(S) = ⋃ S.
) is a tree, and χ and λ are labelling functions, which map to each node n ∈ N two sets, (n) ⊆ V(H) and (n) ⊆ E(H) . For a node n we call χ(n) the bag, and λ(n) the edge cover of n. We denote with B(λ(n)) the set {v ∈ V (H)|v ∈ e for some e ∈ λ(n)}, i.e., the set of vertices "covered" by λ(n). The functions χ and λ have to satisfy the following conditions: 1. For each e ∈ E(H), there is a node n ∈ N s.t. e ⊆ (n).

For each vertex
3. For each node n ∈ N, we have that (n) ⊆ B( (n)).
The second condition is also referred to as the connectedness condition. The width of a GHD is defined as max{| (n)| | n ∈ N} . The generalized hypertree width (ghw) of a hypergraph is the smallest width of any of its GHDs. Deciding if ghw(H) ≤ k for a hypergraph H and fixed k is NP-complete, as one needs to consider exponentially many possible choices for the bag χ(n) for a given edge cover λ(n). It was shown in [14] that for any class of hypergraphs enjoying the BIP, one only needs to consider a polynomial set of subsets of hyperedges (called subedges ) to compute their ghw. This fact will be explained in more detail in Section 2.2. Fig. 1, as well as a GHD of this hypergraph. We can see that no λ-label uses more than two hyperedges, and thus this GHD has width 2, and the ghw of the hypergraph is also ≤ 2. In fact, the hypergraph contains alpha cycles [12], e.g., {e 2 ,e 3 ,e 4 ,e 5 }. Hence, we also know its ghw must be > 1. Taken together, its ghw is therefore exactly 2. It was shown in [2] that, for every GHD 〈T,χ,λ〉 of a hypergraph H, there exists a node n ∈ N such that λ(n) is a balanced separator of H. This property can be used when searching for a GHD of size k of H, as we shall recall in Section 2.2 below.

Computing hypertree decompositions (HDs)
We briefly recall the basic principles of the det-k-decomp program from [23] for computing Hypertree Decompositions (HDs), which was the first implementation of the original HD algorithm from [19]. HDs are GHDs with an additional condition to make their computation tractable in a way explained next. For fixed k ≥ 1, det-k-decomp tries to construct an HD of a hypergraph H in a top-down manner. It thus maintains a set C of edges, which is initialised to C := E(H). For a node n in the HD (initially, this is the root of the HD), it "guesses" an edge cover λ(n), i.e., (n) ⊆ E(H) and |λ(n)|≤ k. For fixed k, there are only polynomially many possible values λ(n). det-k-decomp then proceeds by determining all [λ(n)]-components C i with C i ⊆ C . The additional condition imposed on HDs (compared with GHDs) restricts the possible choices for χ(n) and thus guarantees that the [λ(n)]-components inside C and the [χ(n)]components inside C coincide. This is the crucial property for ensuring polynomial time complexity of HD-computation -at the price of possibly missing GHDs with a lower width. Now let C 1 , … , C denote the [λ(n)]-components with C i ⊆ C . By the maximality of components, these sets C i ⊆ E(H) are pairwise disjoint. Moreover, it was shown in [19] that if H has an HD of width ≤ k, then it also has an HD of width ≤ k such that the edges in each C i are "covered" in different subtrees below n. More precisely, this means that n has ℓ child nodes n 1 , … , n , such that for every i and every e ∈ C i , there exists a node n e in the subtree rooted at n i with e ⊆ (n e ) . Hence, det-k-decomp recursively searches for an HD of the hypergraphs H i with E(H i ) = C i and V(H i ) = ⋃ C i with the slight extra feature that also edges from E(H) ∖ C i are allowed to be used in the λ-labels of these HDs.

Computing GHDs
It was shown in [15] that, even for fixed k = 2, deciding if ghw(H) ≤ k holds for a hypergraph H is NP-complete. However, it was also shown there that if a class of hypergraphs satisfies the BIP, then the problem becomes tractable. The main reason for the NP-completeness in the general case is that, for a given edge cover λ(n), there can be exponentially many bags χ(n) satisfying condition 3 of GHDs, i.e., (n) ⊆ B( (n)) . In other words, to have a sound and complete procedure to check if a given hypergraph has ghw at most k, one would need to check exponentially many possible bags for any given edge cover.
i } , then we get (n) = B( � (n)) . The key to the tractability shown in [15] in case of the BIP (i.e., the intersection of any two distinct edges is bounded by a constant b) is twofold: first, it is easy to see that w.l.o.g., we may restrict the search for a GHD of desired width k to so-called "bag-maximal" GHDs. That is, for any node n, it is impossible to add another vertex to χ(n) without violating a condition from the definition of GHDs. And second, it is then shown in [15] for bag-maximal GHDs, that each e ′ i j is either equal to e i j or a subset of e i j with |e ′ i j | ≤ k ⋅ b . Hence, there are only polynomially many choices of subedges e ′ i j and also of χ(n). More precisely, for a given edge e, the set of subedges to consider is defined as follows: In [14], this property was used to design a program for GHD computation as a straightforward extension of det-k-decomp by adding the polynomially many subedges f e (H,k) for all e ∈ E(H) to E(H). In the hypergraph extended in this way, we can thus be sure that λ(n) can always be replaced by � (n) with (n) = B( � (n)).
In [14], yet another GHD algorithm was presented. It is based on the use of balanced separators and extends ideas from [3]. The motivation of this approach comes from the observation that there is no useful upper bound on the size of the subproblems that have to be solved by the recursive calls of the det-k-decomp algorithm. In fact, for some node n with corresponding component C, let C 1 , … , C denote the [λ(n)]components with C i ⊆ C . Then there may exist an i such that C i is "almost" as big as C. In other words, in the worst case, the recursion depth of det-k-decomp may be linear in the number of edges.
The Balanced Separator approach from [14] uses the fact that every GHD must contain a node whose λ-label is a balanced separator. Hence, in each recursive decomposition step for some subset E ′ of the edges of H, the algorithm "guesses" a node n ′ such that (n � ) is a balanced separator of the hypergraph with edges E ′ . Of course, this node n ′ is not necessarily a child node n i of the current node n but may lie somewhere inside the subtree T i below n. However, since GHDs can be arbitrarily rooted, one may first compute this subtree T i with n ′ as the root and with n i as a leaf node. This subtree is then (when returning from the recursion) connected to node n by rerooting T i at n i and turning n i into a child node of n. The definition of balanced separators guarantees that the recursion depth is logarithmically bounded. This makes the Balanced Separator algorithm a good starting point for our parallel algorithm to be presented in Section 4.

Algorithmic improvements
In this section, we present several algorithmic improvements of decomposition algorithms. We start with some simplifications of hypergraphs, which can be applied as a preprocessing step for any hypergraph decomposition algorithm, i.e., they are not restricted to the GHD algorithms discussed here. We shall then also mention further algorithmic improvements which are specific to the GHD algorithms presented in this paper. We note that, while the GHD-specific algorithmic improvements are new, the simplifications mentioned below have already been used before and/or are quite straightforward. We prove that their exhaustive application to an arbitrary hypergraph yields a unique normal form up to isomorphism. For the sake of completeness, we also prove the correctness and polynomial time complexity of their application.

Hypergraph preprocessing
An important step to speed up decomposition algorithms is the simplification of the input hypergraph. Before we formally present such a simplification, we observe that we may restrict ourselves to connected hypergraphs, formally those having only a single [∅]-component, since a GHD of a hypergraph consisting of several connected components can be obtained by combining the GHDs of each connected component in an "arbitrary" way, e.g., appending the root of one GHD as a child of an arbitrarily chosen node of another GHD. This can never violate the connectedness condition, since the GHDs of different components have no vertices in common. It is easy to verify that the simplifications proposed below never make a connected hypergraph disconnected. Hence, splitting a hypergraph into its connected components can be done upfront, once and for all. After that, we are exclusively concerned with connected hypergraphs. Given a (connected) hypergraph H = (V (H),E(H)), we thus propose the exhaustive application of the following reduction rules in a don't-care non-deterministic fashion: The so-called GYO reduction was introduced in [24,42]  The next reduction of hypergraphs makes use of the notion of types of vertices.
Here the type of a vertex v is defined as the set of edges e which contain v. We thus define Rule 3 as follows: Rule 3. Suppose that H contains vertices v 1 ,v 2 of the same type. Then we may delete v 2 from V (H) and thus from all edges containing v 2 .
The next reduction rule considered here uses the notion of hinges. In [26], hinge decompositions were introduced to help split CSPs into smaller subproblems. In [17], the combination of hinge decompositions and hypertree decompositions was studied. We also make use of hinge decompositions as part of our preprocessing. More specifically, we define the following reduction rule: The above simplifications (above all the splitting into smaller hypergraphs via Rule 4) may produce a hypergraph that is so small that the construction of a GHD of width ≤ k for given k ≥ 1 becomes trivial. The following rule allows us to eliminate such trivial cases: Rule 5. If |E(H)| ≤ k , then H may be deleted. It has a trivial GHD consisting of a single node n with λ(n) = E(H) and (n) = ⋃ E(H).
In Theorems 1 and 2 below, we state several crucial properties of the reductions. Most importantly, these reductions neither add nor lose solutions. Moreover, preprocessing a hypergraph with these rules can be done in polynomial time.
Note that, even though all Rules 1 -5 are applied to a single hypergraph, the result in case of Rule 4 is a set of hypergraphs. Hence, strictly speaking, these rules form a rewrite system that transforms a set of hypergraphs into another set of hypergraphs, where the starting point is a singleton consisting of the initial hypergraph only. However, to keep the notation simple, we will concentrate on the effect of these rules on a single hypergraph with the understanding that application of one of these rules comes down to selecting an element from a set of hypergraphs and replacing this element by the hypergraph(s) according to the above definition of the rules. Proof We split the proof in two main parts: first, we consider the complexity of exhaustive application of Rules 1 -5 and then we prove the soundness of the rules. The polynomialtime complexity of constructing a GHD of H from GHDs of the resulting hypergraphs {H 1 , … , H m } will be part of the correctness proof. □  have the effect that the size of H is strictly decreased by either deleting vertices or edges. Hence, there can be only linearly many applications of Rules 1 -3 and each of these rule applications is clearly feasible in polynomial time. Likewise, Rule 5, which allows us to delete a non-empty hypergraph, can only be applied linearly often and any application of this rule is clearly feasible in polynomial time. Checking if Rule 4 is applicable and actually applying Rule 4 is also feasible in polynomial time. Hence, it only remains to show that the total number of applications of Rule 4 is polynomially bounded. To see this, we first of all make the following observation on the number of edges in each H i : consider a single application of Rule 4 and suppose that, for some edge e, there are ℓ [e]-components C 1 , … , C . These [e]-components are pairwise disjoint and we have C i ⊆ E(H) ⧵ {e} for each i. Hence, if |E(H)| = n and |C i | = n i with n i ≥ 1, then n 1 + … + n ≤ n − 1 holds. Moreover, |E(H i )| = n i + 1 , since we add e to each component. We claim that, in total, when applying Rules 1 -4 exhaustively to a hypergraph H with n ≥ 3 edges, there can be at most 2n − 3 applications of Rule 4. Note that for n = 1 or n = 2, Rule 4 is not applicable at all.

Complexity of exhaustive rule application
We prove this claim by induction on n: if H has 3 edges, then an application of Rule 4 is only possible, if we find an edge e, such that there are 2 [e]-components C 1 ,C 2 , each consisting of a single edge. Hence, such an application of Rule 4 produces two hypergraphs H 1 ,H 2 with 2 edges each, to which no further application of Rule 4 is possible. Hence, the total number of applications of Rule 4 is bounded by 1 and, for n = 3, we indeed have 1 ≤ 6 − 3 ≤ 2n − 3.
For the induction step, suppose that the claim holds for any hypergraph with ≤ n − 1 edges and suppose that H has n edges. Moreover, suppose that an application of Rule 4 for some edge e is possible with ℓ ≥ 2 [e]-components C 1 , … , C and let |C i | = n i . Then H is split into ℓ hypergraphs H 1 , … , H with |E(H i )| = n i + 1 . Note that applications of any 1 3 of the Rules 1 -3 to the hypergraphs H i can never increase the number of edges. These rules may thus be ignored and we may apply the induction hypothesis to each H i . Hence, for every i, there are at most 2(n i + 1) − 3 = 2n i − 1 applications of Rule 4 in total possible for H i . Taking all the resulting hypergraphs H 1 , … , H together, the total number of applications of Rule 4 is therefore ≤ (2n 1 + … + 2n ) − . Together with the inequalities n 1 + … + n ≤ n − 1 and ℓ ≥ 2, and adding the initial application of Rule 4, we thus have, in total, ≤ 2(n − 1) − ℓ + 1 = 2n − 2 − ℓ + 1 ≤ 2n − 2 − 2 + 1 = 2n − 3 applications of Rule 4.
Soundness For the soundness of our reduction system, we have to prove the soundness of each single rule application. Likewise, for the polynomial-time complexity of constructing a GHD of H from the GHDs of the final hypergraph set {H 1 , … , H m } , it suffices to show that one can efficiently construct a GHD of the original hypergraph from the GHD(s) of the hypergraph(s) resulting from a single rule application. This is due to the fact that we have already shown above that the total number of rule applications is polynomially bounded. It thus suffices to prove the following claim: Note that we have omitted Rule 5 in this claim, since both the soundness and the polynomial-time construction of a GHD of width ≤ k are trivial. The proof of Claim A is straightforward but lengthy due to the case distinction over the 4 remaining rules. It is therefore deferred to Section 3.3.

Claim A Let H be a hypergraph and suppose that H ′ is the result of a single application of one of the Rules 1 -3 to H. Then ghw(H) ≤ k if and only if ghw(H
Note that the application of one rule may enable the application of another rule; so their combination may lead to a greater simplification compared to just any one rule alone. Now the question naturally arises if the order in which we apply the rules has an impact on the final result. We next show that exhaustive application of Rules 1 -5 leads to a unique (up to isomorphism) result, even if they are applied in a don't-care non-deterministic fashion. Proof Recall that, in Theorem 1, we have already shown that the rewrite system is terminating (actually, we have even shown that there are at most polynomially many rule applications). In order to show that the rewrite system guarantees a unique normal form (up to isomorphism), it is therefore sufficient to show that it is locally confluent [6]. That is, we have to prove the following property: Let H be a set of hypergraphs and suppose that there are two possible ways of applying Rules 1 -5 to (an element H of) H , so that H can be transformed to either H 1 or H 2 . Then there exists a set of hypergraphs H ′ , such that both H 1 and H 2 can be transformed into H ′ by a sequence of applications of Rules 1 -5. In the notation of [6], this property is succinctly presented as follows: To prove this property, we have to consider all possible pairs (i,j) of applicable Rules i and j.
This case disctinction is rather tedious (especially the cases where Rule 4 is involved) but not difficult. We thus defer the details to Section 3.4.

Finding balanced separators fast
It has already been observed in [23] that the ordering in which edges are considered is vital for finding an appropriate edge cover λ(n) for the current node n in the decomposition fast. However, the ordering used in [23] for det-k-decomp, (which was called MCSO, i.e., maximal cardinality search ordering) turned out to be a poor fit for finding balanced separators. A natural alternative was to consider, for each edge e, all possible paths between vertices in the hypergraph H, and how much the length of these paths increases after removal of e. This provides a weight for each edge, based on which we can define the maximal separator ordering. In our tests, this proved to be a very effective heuristic. Unfortunately, computing the maximal separator ordering requires solving the all-pairs shortest path problem. Using the well-known Floyd-Warshall algorithm [16,40] as a subroutine, this leads to a fairly high complexity -see Table 1 -which proved to be prohibitively expensive for practical instances. We thus explored two other, computationally simpler, heuristics, which order the edges in descending order of the following measures: • The vertex degree of an edge e is defined as denotes the degree of a vertex v, i.e., the number of edges containing v.
• The edge degree of an edge e is |{f : e ∩ f≠∅}|, i.e., the number of edges e has a nonempty intersection with.
In our empirical evaluation, we found both of these to be useful compromises between speeding up the search for balanced separators and the complexity of computing the ordering itself, with the vertex degree ordering yielding the best results, i.e., compute λ(n) by first trying to select edges with higher vertex degree. Finding the next balanced separator Finding a balanced separator fast is important for the performance of our GHD algorithm, but it is not enough: if the balanced separator thus found does not lead to a successful GHD computation, we have to try another one. Hence, it is important to find the next balanced separator fast and to avoid trying the same balanced separator multiple times. The GHD algorithm based on balanced separators presented in [14] searches through all ℓ-tuples of edges (with ℓ ≤ k) to find the next balanced separator. The number of edge-combinations thus checked is the number of edges. Note that this number of edges is actually higher than in the input hypergraph due to the subedges that have to be added for the tractability of GHD computation (see Section 2.2). Before we explain our improvement, let us formally explain how subedges factor into the search. Let us assume that we are given an edge cover (e 1 , … , e k ) , consisting of exactly k edges. Using the function f e (H,k) which generates the set of subedges to consider for any given edge e, defined in Section 2.2, we get the following set of edge combinations when factoring in all the relevant subedges: We note a significant source of redundancy in this set. If one only focuses on the combination of l ≤ k edges to intersect with e, it is possible that the same bags (when taking the union of their vertices) can be generated multiple times We can address this by shifting our focus on the actual bags χ(n) generated from each λ(n) thus computed. Therefore, we initially only look for balanced separators of size k, does not lead to a successful recursive call of the decomposition procedure, we also inspect subsets of χ(n) -strictly avoiding the computation of the same subset of χ(n) several times by inspecting different subedges of the original edge cover λ(n). We thus also do not add subedges to the hypergraph upfront but only as they are needed as part of the backtracking when the original edge cover λ(n) did not succeed. Separators consisting of fewer edges are implicitly considered by allowing also the empty set as a possible subedge.
Summary Our initial focus was to speed up existing decomposition algorithms via improvements as described above. However, even though these algorithmic improvements showed some positive effect, it turned out that a more fundamental change is needed. We have thus turned our attention to parallelisation, which will be the topic of Section 4. But first we present the missing parts of the proofs of Theorems 1 and 2 in Sections 3.3 and 3.4, respectively.

Completion of the proof of theorem 1
It remains to prove Claim A from the proof in Section 3.1.

Proof Proof of the Claim
We prove the claim for each rule separately. It is convenient to treat E(H) as a multiset, i.e., E(H) may contain several "copies" of an edge. This simplifies the argumentation below, when the deletion of vertices may possibly make two edges identical. Note that, if at all, this only happens in intermediate steps, since Rule 2 above will later lead to the deletion of such copies anyway. □

Rule 1 H = (V (H),E(H)) contains a vertex v that only occurs in a single edge e and we
delete v from e and from V (H) altogether. Let e � = e ⧵ {v} . Then We construct GHD D � = ⟨T � , � , � ⟩ as follows: the tree structure T remains unchanged, i.e., we set = T. For every node n in the tree T ′ , we define � (n) and � (n) as follows: • For all other nodes n in T ′ , we set � (n) = (n) and � (n) = (n).
It is easy to verify that D ′ is a GHD of H ′ . Moreover, the width clearly does not increase by this transformation.⇐: Let D � = ⟨T � , � , � ⟩ be a GHD of H ′ of width ≤ k. By the definition of GHDs, T ′ must contain at least one node n, such that e � ⊆ � (n) . We arbitrarily choose one such node n with e � ⊆ � (n) . Then we construct GHD D = ⟨T, , ⟩ as follows: • T contains all nodes and edges from T ′ plus one additional leaf node n ′ which we append as a child node of n. • For n ′ , we set (n � ) = {e} and (n � ) = e. • Let n be a node in T ′ with e � ∈ � (n) . Then we set (n) = ( � (n) ⧵ {e � }) ∪ {e} and we leave ′ unchanged, i.e., (n) = � (n). • For all other nodes n in T, we set (n) = � (n) and (n) = � (n).
Clearly, D can be constructed from D ′ in polynomial time. Moreover, it is easy to verify that D is a GHD of H. In particular, the connectedness condition is not violated by the introduction of the new node n ′ into the tree, since vertex v ∈ (n � ) occurs nowhere else in D and all other vertices in (n � ) are also contained in (n) for the parent node n of n ′ . Moreover, the width clearly does not increase by this transformation since the new node n ′ has | (n � )| = 1 and for all other λ-labels, the cardinality has been left unchanged.

Rule 2
Suppose that H = (V (H),E(H)) contains two edges e 1 ,e 2 , such that e 1 ⊆ e 2 and we .⇒: Let D = ⟨T, , ⟩ be a GHD of H of width ≤ k. We construct GHD D � = ⟨T � , � , � ⟩ as follows: the tree structure T remains unchanged, i.e., we set T � = T . For every node n in the tree T ′ , we define � (n) and � (n) as follows: • For all other nodes n in T ′ , we set � (n) = (n).
• For all nodes n in T ′ , we set � (n) = (n).
It is easy to verify that D ′ is a GHD of H ′ and that the width does not increase by this transformation.⇐: Let D � = ⟨T � , � , � ⟩ be a GHD of H ′ of width ≤ k. It is easy to verify that then D ′ is also a GHD of H. Indeed, we only need to verify that T ′ contains a node n with e 1 ⊆ � (n) . By the definition of GHDs, there exists a node n in T ′ with e 2 ⊆ � (n) . Hence, since we have e 1 ⊆ e 2 , also e 1 ⊆ � (n) holds. 1 ,v 2 which occur in precisely the same edges and we delete v 2 from all edges and thus from V (H) altogether, i.e.,

Rule 3 Suppose that H = (V (H),E(H)) contains two vertices v
It is convenient to introduce the following notation: suppose that We construct GHD D � = ⟨T � , � , � ⟩ as follows: the tree structure T remains unchanged, i.e., we set T � = T . For every node n in the tree T ′ , we define � (n) and � (n) as follows: It is easy to verify that D ′ is a GHD of H ′ and that the width does not increase by this transformation.⇐: Let D � = ⟨T � , � , � ⟩ be a GHD of H ′ of width ≤ k. Then we construct GHD D = ⟨T, , ⟩ as follows: the tree structure T ′ remains unchanged, i.e., we set T = T � . For every node n in the tree T, we define λ(n) and χ(n) as follows: Clearly this transformation is feasible in polynomial time and it does not increase the width. In order to show that D is indeed a GHD of H, there are two non-trivial parts, namely: (1) for every e α ∈ E(H), there exists a node n in T with e ⊆ (n) and (2) (n) ⊆ B( (n)) holds for every node n even if we add vertex v 2 to the χ-label. These are the two places where we make use of the fact that v 1 and v 2 occur in precisely the same edges in E(H).
For part (1), note that there exists a node n in T ′ (and hence in T), such that e � ⊆ � (n) . If v 1 ∉ � (n) , then v 1 ∉ e � and, therefore v 1 ∉e α . Hence, (since v 1 and v 2 have the same type) also v 2 ∉e α . We thus have e = e � and e ⊆ (n) = � (n) . On the other hand, if v 1 ∈ � (n) , then v 2 ∈ χ(n) by the above construction of D . Hence, e ⊆ (n) again holds, since e ⊆ e � ∪ {v 2 }.
For part (2), consider an arbitrary vertex v ∈ χ(n). We have to show that v ∈ B(λ(n)). First, suppose that v≠v 2 . Then we have v ∈ � (n) ⊆ B( � (n)) ⊆ B( (n)) . It remains to consider the case v = v 2 . Then, by the above construction of D , we have v 1 ∈ � (n) . We observe the following chain of implications

Rule 4
Suppose that H = (V (H),E(H)) contains an edge e with [e]-components C 1 , … , C with ℓ ≥ 2. Further, suppose that we apply Rule 4 to replace H by the hypergraphs ⇒: Let D = ⟨T, , ⟩ be a GHD of H of width ≤ k. We construct GHDs D j = ⟨T j , j , j ⟩ of each H j as follows: by the definition of GHDs, there must be a node n in T such that e ⊆ (n) holds. We choose such a node n and, w.l.o.g., we may assume that n is the root of D . Let {D 1 , … , D m } denote the [χ(n)]-components of H. It was shown in [19], that D can be transformed into a GHD D � = ⟨T � , � , � ⟩ , such that the root node n is left unchanged (i.e., in particular, we have (n) = � (n) and (n) = � (n) ) and n has m child nodes n 1 , … , n m , such that there is a one-to-one correspondence between these child nodes and the [ � (n)] -components D 1 , … , D m in the following sense: for every edge e i ∈ D i , there exists a node n ′ i in the subtree rooted at n i in T ′ such that e i ⊆ � (n � i ) . Intuitively, this means that the subtrees rooted at each of the child nodes of n "cover" precisely one [ � (n)]-component. We make the following crucial observations: This property can be seen as follows: Then, for j ∈ {1, … , } , we define a GHD D j = ⟨T j , j , j ⟩ of H j as follows: • T j is the subtree of T ′ consisting of the following nodes: -the root node n is contained in T j ; -for every i ∈ {1, … , m} , if D i ⊆ C j , then all nodes in the subtree rooted at n i are contained in T j ; -no further nodes are contained in T j .
• For every node n in T j , we set j (n) = � (n) ∩ V(H j ).
• For every node n in T j , we distinguish two cases for defining j (n): It remains to verify that D j is indeed a GHD of width ≤ k of H j .

1.
Consider an arbitrary f ∈ E(H j ). We have to show that there exists a node n in T j with f ⊆ j (n) . By the second observation above, we know that f ∈ D i for some i ≥ 0. If f ∈ D 0 , then f ⊆ j (n) for the root node n holds and we are done.
On the other hand, if f ∈ D i for some i ≥ 1, then there exists a node n in the subtree of T ′ rooted at n i with f ⊆ � (n) . Moreover, since D i ∩ D 0 = ∅, we know that f≠e and, therefore, f ∈ C j holds. By f ∈ C j and f ∈ D i , we have D i ⊆ C j . Hence, by our construc- where N j denotes the node set of T j . Let n 1 and n 2 be two nodes in N j with v ∈ j (n 1 ) and v ∈ j (n 2 ) . Then also v ∈ � (n 1 ) and v ∈ � (n 2 ) hold. Hence, in the GHD D ′ , for every node n on the path between n 1 and n 2 , we have v ∈ � (n) . Hence, every such node n also satisfies v ∈ j (n) by the definition j (n) = � (n) ∩ V(H j ).

3
3. Consider an arbitrary node n in T j . We have to show that j (n) ⊆ B( j (n)) holds. We distinguish the two cases from the definition of j (n): • If � (n) ⊆ E(H j ) holds, then we have j (n) = � (n) . Hence, from the property � (n) ⊆ B( � (n)) for the GHD D ′ and j (n) ⊆ � (n) it follows immediately that To this end, it actually suffices to show that every f � ∈ has the property f � ∩ V(H j ) ⊆ e: 4. Finally, the width of D j is clearly ≤ k since j (n) is either equal to � (n) or we add e but only after subtracting a non-empty set δ from � (n).
By the definition of GHDs and by the fact that e ∈ E(H j ) holds for every j, there exists a node n j in T j with e ⊆ j (n j ) . W.l.o.g., we may assume that n j is the root of T j . Then we construct GHD D = ⟨T, , ⟩ as follows: • The tree structure T is obtained by introducing a new node n as the root of T, whose child nodes are n 1 , … , n j and each tree T j becomes the subtree of T rooted at n j . • For the root node n, we set χ(n) = e and λ(n) = {e}. • For any other node n of T, we have that n comes from exactly one of the trees T j . We thus set (n) = j (n) and (n) = j (n).
Clearly, D can be constructed in polynomial time from the GHDs D 1 , … , D . Moreover, the width of D is obviously bounded by the maximum width over the GHDs D i . It remains to verify that D is indeed a GHD of H.

Consider an arbitrary f ∈ E(H).
We have to show that there is a node n in T, s.t. f ⊆ (n) .
By the definition of [e]-components, we either have f ∈ C i for some i or f ⊆ e . If f ∈ C i , then there exists a node n in the subtree rooted at n i with (n) = i (n) ⊇ f . If f ⊆ e , then we have f ⊆ (n). 2. Consider an arbitrary vertex v ∈ V (H). We have to show that {n ∈ N | v ∈ (n)} is a connected subtree of T, where N denotes the node set of T. Let v ∈ (n 1 ) and v ∈ (n 2 ) for two nodes n 1 and n 2 in N and let n be on the path between n 1 and n 2 . If both nodes are in some subtree T i of T, then the connectedness condition carries over from D i to D . If one of the nodes n 1 and n 2 is the root n of T, say n =n 1 , then v ∈ e. Moreover, we have e ⊆ (n i ) by our construction of D . Hence, we may again use the connectedness condition on D i to conclude that v ∈ (n) for every node n along the path between n 1 and n 2 . Finally, suppose that n 1 and n 2 are in different subtrees T i and T j . Then v ∈ V (H i ) ∩ V (H j ) holds and, therefore, v ∈ e by the construction of H i and H j via different [e]-components. Hence, we are essentially back to the previous case. That is, we have v ∈ (n) for every node n along the path from n to n 1 and for every node n along the path from n to n 2 . Together with v ∈ χ(n), we may thus conclude that v ∈ (n) indeed holds for every node n along the path between n 1 and n 2 .
3. Consider an arbitrary node n in T. We have to show that (n) ⊆ B( (n)) . Clearly, all nodes in a subtree T i inherit this property from the GHD D i and also the root node n satisfies this condition by our definition of χ(n) and λ(n).

Completion of the proof of theorem 2
We now make a case distinction over all possible pairs (i,j) of Rules    also to e 2 , e ′ 2 (thus allowing us to delete e 2 ). Hence, again, no matter whether we first delete e 1 or e 2 , we are afterwards allowed to delete also the other edge via Rule 2.
"(2,3)": Suppose that an application of Rule 2 and an application of Rule 3 to the same hypergraph H ∈ H are possible. That is, H contains edges e 1 ,e 2 , such that e 1 ⊆ e 2 and vertices v 1 ,v 2 of the same type. Hence, on one hand, we may delete e 1 from H by Rule 2 and, on the other hand, we may delete v 2 from H by Rule 3.
have the same type. Hence, Rule 3 is applicable to v 1 , v ′ 2 (thus allowing us to delete v 1 ) and also to v 2 , v ′ 2 (thus allowing us to delete v 2 ). Hence, again, no matter whether we first delete v 1 or v 2 , we are afterwards allowed to delete also the other vertex via Rule 3.
" (3,4)": Suppose that an application of Rule 3 and an application of Rule 4 to the same hypergraph H ∈ H are possible. That is, H contains vertices v 1 ,v 2 of the same type and an edge e such that H has [e]-components C = {C 1 , … , C } with ℓ ≥ 2. For any edge f, we write f ′ to denote f � = f ⧵ {v 2 }.Case 1. Suppose that v 2 ∉e. Then v 2 is contained in V (C i ) for precisely one [e]-component C i . Moreover, since v 1 has the same type as v 2 , also the set C ′ i obtained from C i by deleting v 2 from all edges remains [e]-connected. This is because that all paths that use the vertex v 2 may also use the vertex v 1 instead. Hence, after Note that here we do not even make use of the fact that v 1  Suppose that e 1 ⊆ e 2 or e 2 ⊆ e 1 holds. The cases are symmetric, so we only need to consider e 1 ⊆ e 2 . This case is very similar to "(2,4)", Case 2, where e 1 now plays the role of e from "(2,4)". Indeed, w.l.o.g., we again assume e 2 ∈ C ℓ . If Rule 4 is applied to the [e 1 ]-components first, then we end up in precisely the same situation as with "(2,4)". On the other hand, if Rule 4 is applied to the [e 2 ]-components first, then all subedges of e 2 are actually deleted -including e 1 . Hence, we again end up in precisely the same situation as with "(2,4)". Let H ′ be the set of hypergraphs obtained from H by replacing H in H by the following set of hypergraphs:  On the other hand, if e 1 is part of this path, then it must be either on the path f 1 -g or f 2 -g but not both. By symmetry, we may assume w.l.o.g., that e 1 is on the path f 1 -g. Then the path f 2 -g is an [e 1 ]-path. Again, this contradicts our assumption that g and f 2 Let H ′ be the set of hypergraphs obtained from H by replacing H in H by the following set of hypergraphs: We are assuming that {e 1 , e 2 } ⊆ R n . Hence, e 1 is in precisely one [e 2 ]-component T j inside R n and e 2 is in precisely one [e 1 ]-component S i inside R n . W.l.o.g., we may assume that e 1 ∈ T β and e 2 ∈ S α . Analogously to the Case 2.1, we claim that then all of T 1 ∪ … ∪ T −1 ∪ {e 2 } is contained in S α . This can be seen as follows: we are assuming that H is connected. Then also T 1 ∪ … ∪ T −1 ∪ {e 2 } is connected and even [e 1 ]-connected, since e 1 ∈ T β . Hence, T 1 ∪ … ∪ T −1 ∪ {e 2 } lies in a single [e 1 ]-component, namely S α . By symmetry, also S 1 ∪ … ∪ S −1 ∪ {e 1 } is contained in a single [e 2 ]-component, namely T β .
We now define the set H ′ of hypergraphs by combining the ideas of the Cases 2.1 and 2.2.1. Let H ′ be the set of hypergraphs obtained from H by replacing H in H by the following set of hypergraphs: In other words, the vertices in e 1 ∖ d only occur in a single edge of F ′ i with i ≤ n − 1, namely in the edge e 1 . We may therefore apply Rule 1 multiple times to each of the hypergraphs F ′ i with i ≤ n − 1. In this way, we replace e 1 in each of these hypergraphs by d and we indeed transform F ′ i into F i for every i ≤ n − 1.
Now consider the hypergraph G α with E(G α ) = S α ∪{e 1 }. We apply Rule 4 to the [e 2 ]components of G α . As was observed above, the [e 2 ]-components T 1 , … , T −1 of H are fully contained in S α and, hence, in E(G α ). Considering T 1 , … , T −1 as [e 2 ]-components of G α , the application of Rule 4 gives rise to It remains to consider the remaining [e 2 ]-component T β of H, but now restricted to the hypergraph G α = S α ∪{e 1 }. It suffices to show that (S α ∩ T β ) ∪{e 1 } is [e 2 ]-connected because, in this case, we would indeed get (S α ∩ T β ) ∪{e 1 } as the remaining [e 2 ]-component when applying Rule 4 to G α , and K with E(K) = (S α ∩ T β ) ∪{e 1 ,e 2 } would be the remaining hypergraph produced by this application of Rule 4. The proof follows the same line of argumentation as Case 2.1. More specifically, assume to the contrary that there are two edges f 1 ,f 2 in (S α ∩ T β ) ∪{e 1

Parallelisation strategy
As described in more detail below, we use a divide and conquer method, based on the balanced separator approach. This method divides a hypergraph into smaller hypergraphs, called subcomponents. Our method proceeds to work on these subcomponents in parallel, with each round reducing the size of the hypergraphs (i.e., the number of edges in each subcomponent) to at most half their size. Thus after logarithmically many rounds, the method will have decomposed the entire hypergraph, if a decomposition of width k exists. For the computation we use the modern programming language Go [10], which has a model of concurrency based on [28].
In Go, a goroutine is a sequential process. Multiple goroutines may run concurrently. In the pseudocode provided, these are spawned using the keyword go, as can be seen in Algorithm 1, line 16. They communicate over channels. Using a channel ch is indicated by ← ch for receiving from a channel, and by ch ← for sending to ch.

Overview
Algorithm 1 contains the full decomposition procedure, whereas Function FindBalSep details the parallel search for separators, as it is a key subtask for parallelisation. To emphasise the core ideas of our parallel algorithm, we present it as a decision procedure, which takes as input a hypergraph H and a parameter k, and returns as output either Accept if ghw(H) ≤ k or Reject otherwise. Please note, however, that our actual implementation also produces a GHD of width ≤ k in case of an accepting run.
For the GHD computation, we may assume w.l.o.g. that the input hypergraph has no isolated vertices (i.e., vertices that do not occur in any edge). Hence, we may identify H with its set of edges E(H) with the understanding that V(H) = ⋃ E(H) holds. Likewise, we may consider a subhypergraph H ′ of H as a subset H ′ ⊆ H where, strictly speaking, Our parallel Balanced Separator algorithm begins with an initial call to the procedure Decomp, as seen in line 2 of Algorithm 1. The procedure Decomp takes two arguments, a subhypergraph H ′ of H for the current subcomponent considered, and a set Sp of "special edges". These special edges indicate the balanced separators encountered so far, as can be seen in line 16, where the current separator subSep is added to the argument on the recursive call, combining all its vertices into a new special edge. The special edges are needed to ensure that the decompositions of subcomponents can be combined to an overall decomposition, and are thus considered as additional edges. The goal of procedure Decomp is to find a GHD D of H � ∪ Sp in such a way that each special edge s ∈ Sp must be "covered" by a leaf node n s of D with the properties λ(n s ) = {s} and χ(n s ) = s and s may not occur in the λ-label of any other node in the GHD, i.e., we may only use edges from H for these λ-labels. Thus the set Sp imposes additional conditions on the GHD. In the sequel, we shall refer to a pair ( H � , Sp) consisting of a subhypergraph H ′ of H and a set of special edges Sp as an "extended subhypergraph" of H. Clearly, also H itself together with the empty set of special edges is an extended subhypergraph of itself and a GHD of H also satisfies the additional conditions of a GHD of the extended subhypergraph (H,∅). Hence, the central procedure Decomp in Algorithm 1, when initially called on line 2, checks if there exists a GHD of desired width of the extended subhypergraph (H,∅), that is, a GHD of hypergraph H itself.
The recursive procedure Decomp has its base case in lines 4 to 5, when the size of H ′ and Sp together is less than or equal to 2. The remainder of Decomp consists of two loops, the Separator Loop, from lines 7 to 24, which iterates over all balanced separators, and within it the SubEdge Loop, running from lines 12 to 23, which iterates over all subedge variants of any balanced separator. New balanced separators are produced with the subprocedure FindBalSep, used in line 8 of Algorithm 1, and detailed in Function FindBalSep. After a separator is found, Decomp computes the new subcomponents in line 13. Then goroutines are started using recursive calls of Decomp for all found subcomponents. If any of these calls returns Reject, seen in line 19, then the procedure starts checking for subedges. If they have been exhausted, the procedure checks for another separator. If all balanced separators have been tried without success, then Decomp rejects in line 25.
We proceed to detail the parallelisation strategy of the two key subtasks: the search for new separators and the recursive calls over the subcomponents created from a chosen separator.

Parallel search for balanced separators
Before describing our implementation, we define some needed notions. For the search for balanced separators within an extended subhypergraph H � ∪ Sp , we can determine the set of relevant edges from the hypergraph, defined as E * = {e ∈ E(H) | e ∩ ⋃ (H � ∪ Sp) ≠ �} . We assume for this purpose that the edges in E * are ordered and carry indices in {1, … , |E * |} according to this ordering. We can then define the following notion. For two k-combinations a,b, we say b is one step ahead of a, denoted as a < 1 b, if w.r.t. the lexicographical ordering < lex on the tuples, we have a < lex b, and there exists no other k-combination c s.t. a < lex c < lex b. To generalise, we say c is i steps ahead of a with i > 1, if there exists some b s.t. a < i− 1 b < 1 c. Fig. 3. Assume that we are currently investigating the extended subhypergraph with H � = {e 3 , e 4 , e 5 } and Sp = {{a,b,e,f}}. By the Fig. 3 An example hypergraph, where the vertices are represented by letters, with explicit edge names definition above, this gives us the following set of relevant edges: E * = {e 2 ,e 3 ,e 4 ,e 5 ,e 6 }. We assume the ordering to be simply the order the edges are written in here, i.e., index 1 refers to edge e 2 , 2 refers to e 3 , etc.

Example 3 Consider the hypergraph H from
Let us assume that we are looking for separators of length 3, so k = 3. We would then start the search with the 3-combination (1,2,3), which represents the choice of e 2 ,e 3 ,e 4 . If we move one step ahead, we next get the 3-combination (1,2,4), which represents the choice of e 2 ,e 3 ,e 5 . Moving further 3 steps ahead, we produce the 3-combination (1,3,5), representing the choice of e 1 ,e 4 ,e 6 .
In our parallel implementation, while testing out a number of configurations, we settled ultimately on the following scenario, shown in Function FindBalSep: we first create w many worker goroutines, seen in lines 3 to 4, where w stands for the number of CPUs we have available. This corresponds to splitting the workspace into w parts, and assigning each of them to one worker. Each worker is passed two arguments: First, the workers are passed a channel ch, which they will use to send back any balanced separators they find. The worker procedure iterates over all candidate separators in its assigned search space, and sends back the first balanced separator it finds over the channel.
Secondly, to coordinate the search, each worker is passed a k-combination, where the needed ordering is on the relevant edges defined earlier. Furthermore, each worker starts with a distinct offset of j steps ahead, where 0 ≤ j ≤ w − 1, and will only check k-combinations that are w steps apart each. This ensures that no worker will redo the work of another one, and that together they still cover the entire search space. An illustration for this can be seen in Fig. 4.
Having started the workers, FindBalSep then waits for one of two conditions (whichever happens first): either one of the workers finds a balanced separator, lines 6 to 7, or none of them does and they all terminate on their own once they have exhausted the search space. Then the empty set is returned, as seen in line 8, indicating that no further balanced separators from edges in E * exist. We note that balanced separators composed from subedges are taken care of in Algorithm 1 on lines 19 to 21, and are therefore not relevant for the search inside the Function FindBalSep.
We proceed to explain how this design addresses the three challenges for a parallel implementation we outlined in the introduction.
i This design reduces the need for synchronisation: each worker is responsible for a share of the search space, and the only time a worker is stopped is when either it has found a balanced separator, or when another worker has done so. ii The number of worker goroutines scales with the number of available processors. This allows us to make use of the available hardware when searching for balanced separators, Fig. 4 Using k-combinations to split the workspace. Shown here with 3 workers, and k = 3 and |E * | = 5 and the design above makes it very easy to support an arbitrary number of processors for this, without a big increase in the synchronisation overhead. iii Finally, our design addresses backtracking in this way: as explained, the workers employ a set of k-combinations, called M in Function FindBalSep, to store their current progress, allowing them to generate the next separator to consider. Crucially, this data structure is stored in Decomp, seen in line 6 of Algorithm 1,even after the search is over. Therefore, in case we need to backtrack, this allows the algorithm to quickly continue the search exactly where it left off, without losing any work. If multiple workers find a balanced separator, one of them arbitrarily "wins", and during backtracking, the other workers can immediately send their found separators to FindBalSep again.

Parallel recursive calls
For the recursive calls on the produced subcomponents, we create for each such call its own goroutine, as explained in the overview. This can be seen in Algorithm 1, line 15, where the output is then sent back via the channel ch. Each call gets as arguments its own extended subhypergraph, as well as an additional special edge. The output is received at line 18, where the algorithm waits on all recursive calls to finish before it can either return accept, or reject the current separator in case any recursive call returns a reject. The fact that all recursive calls can be worked on concurrently is also in itself a major performance boost: in the sequential case we execute all recursive calls in a loop, but in the parallel algorithm -see lines 14 to 15 in Algorithm 1 -we can execute these calls simultaneously. Thus, if one parallel call rejects, we can stop all the other calls early, and thus potentially save a lot of time. It is easy to imagine a case where in the sequential execution, a rejecting call is encountered only near the end of the loop.
We state how we addressed the challenges of parallelisation in this area: i Making use of goroutines and channels makes it easy to avoid any interference between the recursive calls, and the design allows each recursive call to run and return its results fully independently. Thus when running the recursive calls concurrently, we do not have to make use of synchronisation at all. ii The second challenge, scaling with the number of CPUs, is initially limited by the number of recursive calls, which itself is dependent on the number of connected components. We can ensure, however, that we will generally have at least two, unless we manage to cover half the graph with just k edges. While this might look problematic at first, each of these recursive calls will either hit a base case, or once more start a search for a balanced separator which as outlined earlier, will always be able to make use of all cores in our CPU. This construction is aided by the fact that Go can easily manage a very large number of goroutines, scheduling them to make optimal use of the available resources. Thus our second challenge has also been addressed. iii The third challenge, regarding backtracking, was written with the search for a balanced separator in mind, and is thus not directly applicable to the calls of the procedure Decomp. To speed up backtracking also in this case, we did initially consider the use of caching -which was used to great effect in det-k-decomp [23]. The algorithm presented here, however, differs significantly from det-k-decomp by the introduction of special edges. This makes cache hits very unlikely, since both the subhypergraph H ′ and the set of special edges Sp must coincide between two calls of Decomp, to reuse a previously computed result from the cache. Hence, caching turned out to be not effective here.
Another important topic concerns the scheduling of goroutines. This is relevant for us, since during every recursive call, we start as many goroutines as there are CPUs. Luckily, Go implements a so-called "work stealing" scheduler, which allows idle CPUs to take over parts of the work of other CPUs. Since goroutines have less of an overhead than normal threads, we can be sure that our algorithm maximally utilises the given CPU resources, without creating too much of an overhead. For more information about the scheduling of goroutines, we refer to the handbook by Cox-Buday [9].
To summarise, two of the challenges were addressed and solved, while the third, which mainly targeted the search for a balanced separator, was not applicable here. The parallelisation of recursive calls therefore gives a decent speed-up as will be illustrated by the experimental results in Section 5.

Correctness of the parallel algorithm
It is important to note that this parallel algorithm is a correct decomposition procedure. More formally, we state the following property:

Theorem 3
The algorithm for checking the ghw of a hypergraph given in Algorithm 1 is sound and complete. More specifically, Algorithm 1 with input H and parameter k will accept if and only if there exists a GHD of H with width ≤ k. Moreover, by materialising the decompositions implicitly constructed in the recursive calls of the Decomp function, a GHD of width ≤ k can be constructed efficiently in case the algorithm returns Accept.
Proof A sequential algorithm for GHD computation based on balanced separators was first presented in [14]. Let us refer to it as SequentialBalSep. A detailed proof of the soundness and completeness of SequentialBalSep is given in [14]. For convenience of the reader, we recall the pseudo-code description of SequentialBalSep from [14] in Appendix A. In order to prove the soundness and completeness of our parallel algorithm for GHD computation, it thus suffices to show that, for every hypergraph H and integer k ≥ 1, our algorithm returns Accept if and only if SequentialBalSep returns a GHD of H of width ≤ k. Hence, since both algorithms operate on the same notion of extended subhypergraphs and their GHDs, we have to show that, for every k ≥ 1 and every input (H � , Sp) , the Decomp function of our algorithm returns Accept if and only if the Decompose function of the SequentialBalSep algorithm returns a GHD of H � ∪ Sp of width ≤ k.
To prove this equivalence between our new parallel algorithm and the previous SequentialBalSep algorithm from [14], we inspect the main differences between the two algorithm and argue that they do not affect the equivalence: 1. Decision problem vs. search problem. While SequentialBalSep outputs a concrete GHD of desired width if it exists, we have presented our algorithm as a pure decision procedure which outputs Accept or Reject. Note that this was only done to simplify the notation. It is easy to verify that the construction of a GHD in the SequentialBalSep algorithm on lines 5 -12 (for the base case) and on line 27 (for the inductive case) can be taken over literally for our parallel algorithm. 2. Parallelisation. The most important difference between the previous sequential algorithm and the new parallel algorithm is the parallelisation. As was mentioned before, parallelisation is applied on two levels: splitting the search for finding the next balanced separator into parallel subtasks via function FindBalSep and processing recursive calls of function Decomp in parallel. The parallelisation via function FindBalSep will be discussed separately below. We concentrate on the recursive calls of function Decomp first. On lines 13 -22 of our parallel algorithm, function Decomp is called recursively for all components of a given balanced separator and Accept is returned on line 22 if and only if all these recursive calls are successful. Otherwise, the next balanced separator is searched for. The analogous work is carried out on lines 18 -27 of the SequentialBalSep algorithm. That is, the function Decompose is called recursively for all components of a given balanced separator and (by combining the GHDs returned from these recursive calls) a GHD of the given extended subhypergraph is returned on line 27 if and only if all these recursive calls are successful. Otherwise, the next balanced separator is searched for. 3. Search for balanced separators. As has been detailed in Section 4.2, our function FindBalSep splits the search space for a balanced separator into w pieces (where w denotes the number of available workers) and searches for a balanced separator in parallel. So in principle, this function has the same functionality as the iterator BalSepIt in the SequentialBalSep algorithm. That is, the set of balanced separators of size k for an extended subhypergraph H � ∪ Sp found is the same when one calls the function FindBalSep until it returns the empty set, or when one calls the iterator BalSepIt until it has no elements to return any more. However, the calls of function FindBalSep implement one of the algorithmic improvements presented in Section 3.2: note that the SequentialBalSep algorithm assumes that all required subedges of edges from E(H) have been added to E(H) before executing this algorithm. It may thus happen that, by considering different subedges of a given k-tuple of edges, the same separator (i.e., the same set of vertices) is obtained several times. As has been explained in Section 3.2, we avoid this repetition of work by concentrating on the set of vertices of a given balanced separator (i.e., sep returned on line 8 and used to initialize subSep on line 11) and iterate through all balanced separators obtained as "legal" subsets by calling the NextSubedgeSep function on line 20. This means that we ultimately get precisely the same balanced separators (considered as sets of vertices) as in the SequentialBalSep algorithm.

Hybrid approach -best of both worlds
Based on this parallelisation scheme, we produced a parallel implementation of the Balanced Separator algorithm, with the improvements mentioned in Section 3. We already saw some promising results, but we noticed that for many instances, this purely parallel approach was not fast enough. We thus continued to explore a more nuanced approach, mixing both parallel and sequential algorithms. We now present a novel combination of parallel and sequential decomposition algorithms. It contains all the improvements mentioned in Section 3 and combines the Balanced Separator algorithm from Sections 4.1-4.3 and det-k-decomp recalled in Section 2.1.
This combination is motivated by two observations: The Balanced Separator algorithm is very effective at splitting large hypergraphs into smaller ones and in negative cases, where it can quickly stop the computation if no balanced separator for a given subcomponent exists. It is slower for smaller instances where the computational overhead to find balanced separators at every step slows things down. Furthermore, for technical reasons, it is also less effective at making use of caching. det-k-decomp, on the other hand, with proper heuristics, is very efficient for small instances and it allows for effective caching, thus avoiding repetition of work.
The Hybrid approach proceeds as follows: For a fixed number m of rounds, the algorithm tries to find balanced separators. Each such round is guaranteed to halve the number of hyperedges considered. Hence, after those m rounds, the number of hyperedges in the remaining subcomponents will be reduced to at most |E(H)| 2 m . The Hybrid algorithm then proceeds to finish the remaining subcomponents by using the det-k-decomp algorithm.
This required quite extensive changes to det-k-decomp, since it must be able to deal with Special Edges. Formally, each call of det-k-decomp runs sequentially. However, since the m rounds can produce a number of components, many calls of det-k-decomp can actually run in parallel. In other words, our Hybrid approach also brings a certain level of parallelism to det-k-decomp.

Experimental evaluation and results
We have performed our experiments on the HyperBench benchmark from [14] with the goal to determine the exact generalized hypertree width of significantly more instances. We thus evaluated how our approach compares with existing attempts to compute the ghw, and we investigated how various heuristics and parameters prove beneficial for various instances. The detailed results of our experiments 1 , in addition to the source code of our Go programs 2 are publicly available. Together with the benchmark instances, which are detailed below and also publicly available, this ensures the reproducibility of our experiments.

Benchmark instances and setting
HyperBench The instances used in our experiments are taken from the benchmark HyperBench, collected from various sources in industry and the literature, which was released in [14] and made publicly available at http:// hyper bench. dbai. tuwien. ac. at. It consists of 3648 hypergraphs from CQs and CSPs, where for many CSP instances the exact ghw was still undetermined. In this extended evaluation, we performed the evaluation on a larger set of instances when compared with the original paper [21], to reflect the newest version of the benchmark, published in [14]. We provide a more detailed overview of the various instances, grouped by their origin, in Table 2. The first two columns of "Avg sizes", refer to the sizes of instances within the groups, and the final column "Size" refers to the cardinality of the group, i.e. how many instances it includes. The two "Arity" columns refer to the maximum and average edge sizes of the hypergraphs in each group.
Hardware and software We used Go 1.2 for our implementation, which we refer to as BalancedGo. Our experiments ran on a cluster of 12 nodes, running Ubuntu 16.04.1 LTS with a 24 core Intel Xeon E5-2650v4 CPU, clocked at 2.20 GHz, each node with 256 GB of RAM. We disabled HyperThreading for the experiments.

Setup and limits
For the experiments, we set a number of limits to test the efficiency of our solution. For each run, consisting of the input (i.e., hypergraph H and integer k) and a configuration of the decomposer, we set a one hour (3600 seconds) timeout and limited the available RAM to 1 GB. These limits are justified by the fact that these are the same limits as were used in [14], thus ensuring the direct comparability of our test results. To enforce these limits and run the experiments, we used the HTCondor software [39], originally named just Condor. Note that for the test results of HtdSMT, we set the available RAM to 24 GB, as that particular solver had a much higher memory consumption during our tests.

Empirical results
The key results from our experiments are summarised in Table 4, with Table 3 acting as a comparison point. Under "Decomposition Methods" we use "ensemble" to indicate that results from multiple algorithms are collected, i.e., results from the Hybrid algorithm, the parallel Balanced Separator algorithm and det-k-decomp. To also consider the performance of one of the individual approaches introduced in Section 4, namely the results of the Hybrid approach (from Section 4.5) is separately shown in a section of the table. As a reference point, we considered on one hand the NewDet-KDecomp library from [14] and also the SAT Modulo Theory based solver HtdSMT from [38]. For each of these, we also listed the average time and the maximal time to compute a GHD of optimal-width for each group of instances of HyperBench, as well as the standard deviation. The minimal times are left out for brevity, since they are always near or equal to 0. Note that for HyberBench the instance groups "CSP Application" or "CQ Application", listed in Tables 3 and 4 are hypergraphs of (resp.) CSP or CQ instances from real world applications. Table 3 Overview of previous results: number of instances solved and running times (in seconds) for producing optimal-width GHDs in [14] and [38] Instances Decomposition Methods

Group
NewDetKDecomp by [14] HtdSMT by [ In the left part of Table 4, we report on the following results obtained with our Hybrid Approach described in Section 4.5, while the right part of that table shows the result for the "BalancedGo ensemble". Recall that by "ensemble" we mean the combination of the information gained from runs of all our decomposition algorithms. For a hypergraph H and a width k, an accepting run gives us an upper bound (since the optimal ghw(H) is then clearly ≤ k), and a rejecting run gives us a lower bound (since then we know that ghw(H) > k). By pooling multiple algorithms, we can combine these produced upper and lower bounds to compute the optimal width (when both bounds meet) for more instances than any one algorithm could determine on its own. We note that the results for NewDetKDecomp from Fischl et al. [14] are also such an "ensemble", combining the results of three different GHD algorithms presented in [14]. Across all experiments, out of the 3648 instances in HyperBench, we have thus managed to solve over 2900. By "solved" we mean that the precise ghw could be determined in these cases. It is interesting to note that the Hybrid Algorithm on its own is almost as good as the "ensemble". Indeed, the number of 2924 solved instances in case of the "ensemble" only mildly exceeds the the number of 2850 instances solved by the implementation of our Hybrid algorithm. The strength of the Hybrid algorithm stems from the fact that it combines the ability of the parallel Balanced Separator approach for quickly deriving lower bounds (i.e., detecting "Reject"cases) with the ability of det-K-decomp for more quickly deriving upper bounds (i.e., detecting "Accept"-cases). Figure 5 shows runtimes for all positive runs of the Hybrid algorithm over all instances of HyperBench with an increasing number of CPUs used, where the used width parameter ranges from 2 to 5. The blue dots signify the median times in milliseconds, and the orange bars show the number of instances which produced timeouts. We can see that increasing the CPUs either reduces the median (solving the same instances faster) or reduces the number of instances which timed out. Actually, reducing the number of time-outs is potentially a much higher speedup than merely reducing the median, and also of higher practical interest, as it allows us to decompose more instances in realistic time. It should be noted that the increase of the median time when we go from 8 CPUs to 12 CPUs does not mean at all that the performance degrades due to the additional CPUs. The additional time consumption is solely due to the increased number of solved instances, which are typically hard ones. And the computation time needed to solve them enters the statistics only if the computation does not time out (Table 5).
In order to fully compare the strengths and weaknesses of each of the discussed decomposition methods, we also investigated the number of instances that could only be solved via a specific approach. This can be seen in Table 6. We see that while our approach clearly dominates this metric, there are still many cases where other methods were more effective.
In Fig. 6 we see an overview of the distribution of the ghw of all solved instances of our approach, and as a comparison we see how many instances for each width could be determined by NewDetKDecomp.
For the computationally most challenging instances of HyperBench, those of ghw≥ 3, our result signifies an increase of over 70 % in solved instances when compared with [14]. In addition, when considering the CSP instances from real world applications, we managed to solve 763 instances, almost doubling the number from NewDetKDecomp. In total, we now know the exact ghw of around 70% of all hypergraphs from CSP instances and the exact ghw of around 75% of all hypergraphs of HyperBench.
Another aspect of our solver we wanted to explore was the memory usage, and whether lifting the restriction to merely 1 GB of RAM makes a difference in the number of GHDs that can be found. We therefore looked at all test runs of lower width, ≤ 5 where our solver timed out. There were 91 such instances. This restriction seems justified as the width parameter affects the complexity of determining the ghw exponentially, thus it is only for lower widths that one would expect memory to become a limiting factor as opposed to time. We reran these 91 test instances using 24 GB of RAM. It turned out that the increase in available memory made no difference, however, as all 91 tests still timed out. In other words, the limiting factor in computing hypergraph decompositions is time, not space. We stress that, in the first place, our data in Table 4 is not about time, but rather about the number of instances solved within the given time limit of 1 hour. And here we provide an improvement for these practical CSP instances of near 100% on the current state of the art; no such improvements have been achieved by other techniques recently. It is also noteworthy, that the Hybrid algorithm alone solved 2850 total cases, thus beating the total for NewDetKDecomp in [14], which, as mentioned, combines the results of three different GHD algorithms and also beating the total for HtdSMT [38].

Comparison with PACE 2019 Challenge In addition to experiments on HyperBench,
we also compared our implementation with various solvers presented during the PACE 2019 Challenge [11], where one track consisted in solving the exact hypertree width. We took the 100 public and 100 private instances from the challenge (themselves a subset of HyperBench), and tried to compute the exact ghw of the instances within 1800 seconds, using at most 8 GB of RAM. Since our test machine is different from the one used during PACE 2019 Challenge, we took the implementations of the winner and runner up, HtdSMT and TULongo [34] and reran them again using the same time and memory constraints. The results can be seen in Table 5. BalancedGo managed to compute 86 out of the 100 private instances, improving slightly on HtdSMT. It is noteworthy that this was accomplished while computing GHDs, instead of the simpler HDs which were asked for during the challenge.

Conclusion and outlook
We have presented several generally applicable algorithmic improvements for hypergraph decomposition algorithms and a novel parallel approach to computing GHDs. We have thus advanced the ability to compute GHDs of a significantly larger set of CSPs than previous GHD algorithms. This paves the way for more applications to use them to speed up the evaluation of CSP instances. For future work, we envisage several lines of research: first, we want to further speed up the search for a first balanced separator as well as the search for a next balanced separator in case the first one does not lead to a successful decomposition. Note that for computing any λ-label of a node in a GHD of width ≤ k, in principle, O(n k+ 1 ) combinations of edges have to be investigated for |E(H)| = n. However, only a small fraction of these combinations is actually a balanced separator, leaving a lot of potential for speeding up the search for balanced separators. Apart from this important practical aspect, it would also be an interesting theoretical challenge to prove some useful upper bound on the ratio of balanced separators compared with the total number of possible combinations of up to k edges.
So far we were focused on the efficient and parallel computation of GHDs via balanced separators. It would be interesting to explore a similar approach for the computation of hypertree decompositions (HDs) [19]. The big advantage of HDs over GHDs is that their computation is tractable (for fixed k) even without adding certain subedges. On the other hand, HDs require the root of the decomposition to be fixed. In contrast, a GHD can be rooted at any node and the GHD algorithm via balanced separators crucially depends on the possibility of re-rooting subtrees of a decomposition. Significantly new ideas are required to avoid such re-rooting in case of HDs. First preliminary steps in this direction have already been made in [3] but many further steps are required yet.
In this work, we have looked at the decomposition of (the hypergraphs underlying) CSPs. The natural next step is to apply decompositions to actually solving CSP. Hence, another interesting goal for future research is harnessing Go's excellent cloud computing capabilities to extend our results beyond the computation of GHDs to actually evaluating large real-life CSPs in the cloud.