A Faster Tree-Decomposition Based Algorithm for Counting Linear Extensions

We investigate the problem of computing the number of linear extensions of a given n-element poset whose cover graph has treewidth t. We present an algorithm that runs in time O~(nt+3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{O}}(n^{t+3})$$\end{document} for any constant t; the notation O~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{O}}$$\end{document} hides polylogarithmic factors. Our algorithm applies dynamic programming along a tree decomposition of the cover graph; the join nodes of the tree decomposition are handled by fast multiplication of multivariate polynomials. We also investigate the algorithm from a practical point of view. We observe that the running time is not well characterized by the parameters n and t alone: fixing these parameters leaves large variance in running times due to uncontrolled features of the selected optimal-width tree decomposition. We compare two approaches to select an efficient tree decomposition: one is to include additional features of the tree decomposition to build a more accurate, heuristic cost function; the other approach is to fit a statistical regression model to collected running time data. Both approaches are shown to yield a tree decomposition that typically is significantly more efficient than a random optimal-width tree decomposition.


Introduction
The concept of a partially ordered set, or poset for short, formalizes the idea that an element of a ground set may precede, or "be smaller than," some other element, however allowing some pairs of elements be incomparable. The concept plays a fundamental role in various areas of mathematics, with applications in theoretical and applied computer science. In this paper, we will consider a particular computational problem associated with partial orders, namely the problem of counting the so-called linear extensions of a given poset. The problem has applications in numerous areas, for example, in sequence analysis [24], sorting [29], preference reasoning [23], convex rank tests [26], partial order plans [27], and learning graphical models [28,34].
To formulate the problem more formally, consider a poset (V , ≺) formed by an n-element set V and an irreflexive and transitive binary relation ≺ on V , called a partial order. Another partial order < on V is a linear extension of ≺ if it contains ≺ and for any distinct elements x, y ∈ V either x < y or y < x. The problem of counting linear extensions (#LE) asks for the number of linear extensions of a given poset; #LE is equivalent to the problem of counting the topological sorts of a given directed (acyclic) graph.
Findings an efficient algorithm for #LE seems unlikely, as the problem is #Pcomplete [11]. However, #LE admits a fully polynomial randomized approximation scheme [13]. Currently, the best known asymptotic bounds for the expected running time are O( −2 n 3 log 2 log n) [6] and O( −2 n 5 log 2 n) [32], where is the number of linear extensions and the allowed relative error (with any constant success probability). These schemes, while polynomial in n and −1 , become prohibitively slow in practice if, say, one requires an accuracy of = 0.01 and n is around one hundred.
Exact and parameterized algorithms offer an alternative paradigm to design practical algorithms for the problem. If we measure the complexity of an algorithm by the required number of arithmetic operations, which is a common practice in the literature, the best known worst-case bound is O(2 n n), achieved by a simple dynamic programming algorithm. For several special instance classes better bounds are known: O(n w w) for width-w posets, O(n 2 ) for series-parallel posets [25], and O(n 2 ) also for posets whose cover graph is a forest [5]; the cover graph of a poset (V , ≺) is the directed graph (V , E) where the edge set E is the transitive reduction of ≺ (a.k.a. Hasse diagram). If parameterized by the treewidth of the cover graph, t, the problem can be solved with O(n t+3 ) arithmetic operations by an inclusion-exclusion algorithm [18]. On the other hand, the problem parameterized by t is W[1]-hard [14], and so it seems difficult to find an algorithm that would run in time O( f (t)n d ) for some computable function f and constant d.
In this article, we investigate whether there exist faster exact algorithms for #LE, parameterized by the treewidth of the cover graph. We are interested in the worst-case asymptotic time complexity as well as the running time on moderate-size instances in practice. This article extends a preliminary version of the work published in a conference proceedings [19]; in the next two subsections we introduce our main contributions and highlight the features that are new in the present article.

Theoretical Contributions
Throughout this paper, we writeÕ( f ) andΩ( f ) as shorthands for f · log O(1) f and f / log O (1) f , respectively; in other words, the notations hide factors that are polylogarithmic in its argument. Our main result is the following: The linear extensions of a given n-element poset can be counted with O(t! n t+3 ) bit-operations, where t is the treewidth of the cover graph of the poset.
We state the bound in terms of bit-complexity to emphasize the fact that the dominating computations deal with addition and multiplication of large integers. Indeed, the bounds stated in the previous paragraphs refer to the number of arithmetic operations with O(n log n)-bit integers. Thus, in particular, for any constant t our bound improves the previous bound of Kangas et al. [18] by a factor of n, up to polylogarithmic factors. For large n and small t the improvement is relatively significant; for instance, for t = 2 the bound is reduced fromÕ(n 6 ) toÕ(n 5 ). For large t, however, the present bound is inferior due to the factor t!. It turns out that we can, in fact, get rid of this factor under certain assumptions that are relatively mild in practice, while too complicated theoretically to be included in a succinct theorem statement.
Perhaps more interestingly, our algorithm is radically different from the inclusionexclusion algorithm. In the latter, the idea is to view a linear extension as a bijective mapping and then remove the global bijectivity constraint by inclusion-exclusion, similarly to previous applications to matrix permanent [31], Hamiltonian path [20], and set partitioning [21], but incurring only a polynomial overhead. Once the bijectivity constraint is removed, what remains is a collection of simpler subproblems with local constraints. The subproblems can be handled by standard routines that exploit small treewidth [7,12]; see Sect. 2.2 for some additional details.
The present algorithm, in contrast, takes care of the bijectivity constraint within dynamic programming along a tree decomposition and is, with this respect, similar to a folklore t O(t) n-time algorithm for the Hamiltonian path problem. However, #LE being W[1]-hard one may expect it to require a significantly larger dynamic programming table. We give a formulation, where each node of a tree decomposition is associated with O(n t+1 ) counts. This formulation leads to a challenge: a step in the dynamic program that combines two (or more) arrays of such counts appears to require, in the worst case, a quadratic number of arithmetic operations, Θ(n 2t+2 ), if implemented in a straightforward manner. Fortunately, we discover that the key ingredient of the step takes a form of multidimensional convolution, which we can compute efficiently using known results for fast multiplication of multivariate polynomials.
Concerning space complexity we only make a couple of observations here: Both our algorithm and the inclusion-exclusion algorithm by Kangas et al. [18] requirẽ O(n t+2 ) bits of space. One could reduce this by a factor about linear in n by carrying the computations modulo several small relative primes and constructing the final output using the Chinese remainder theorem. For comparison, the simple dynamic programming algorithm also requires lots of space,Ω(2 n ) bits.
In addition to some minor changes in exposition, the present article gives a more detailed treatment of the needed result for multiplication of multivariate polynomials.
Specifically, we include the calculations omitted in the preliminary version [19] and also correct a minor error in the statement of the needed result (Fact 2, Sect. 2).

Empirical Contributions
We also address the practical value of the algorithm. Given that the present algorithm is technically more convoluted than the inclusion-exclusion algorithm, it is natural to ask, whether the obtained improvement in the asymptotic worst case time requirement is reflected as significant expedition in practice. Our interest is particularly in instances where t is small (at most four) and n ranging up to a few hundred.
A well known challenge in practical implementation of tree-decomposition based algorithms is that finding an optimal-width tree decomposition may be insufficient for minimizing the computational cost: the running time of the dynamic programming algorithm can be sensitive to the shape of the tree decomposition. Bodlaender and Fomin [9] addressed this issue from a theoretical viewpoint by studying the complexity of finding a tree decomposition that minimizes a sum of costs associated with each node of a tree decomposition. In their f -cost framework the cost of a node is allowed to depend only on the width of the node (i.e., the size of the associated bag; see Sect. 2). Recently, Abseher et al. [1,2] presented a more general and more practical heuristic approach. Their htd library [1] allows a user to generate a variety of optimal-width tree decompositions and also (locally) optimize a given cost function. Moreover, they proposed and evaluated [2] a method to learn an appropriate cost function, or regression model, from empirical running time data on a collection of "training" instances. The method can be viewed as an instantiation of the method of empirical hardness models [22] for the algorithm selection problem [30].
Following these ideas we have implemented and tested our algorithm for #LE using a collection of synthetically generated instances (posets) together with a variety of tree decompositions generated by htd for each instance. We will report on and discuss our observations, which suggest that selecting the tree decomposition using a learned regression model can make a difference, at least for the smallest treewidth (t = 2): compared to the median running time over generated tree decompositions, the selected one typically yields almost an order-of-magnitude speedup.
The present work extends the preliminary study [19] with two additions. First, we include a direct comparison to an implementation of the inclusion-exclusion algorithm, VEIE [18]. Second, we also include in the experiments a new heuristic cost function, and show that its performance is competitive to that of the learned model.

Organization
The rest of this article is organized as follows. Some basic terminology, notation, and facts are given in Sect. 2. Section 3 is devoted to proving Theorem 1. In Sect. 4 we describe some implementation details and report on empirical results. Section 5 concludes by summarizing and discussing the main observations.

Preliminaries
We denote by N the set of natural numbers {0, 1, 2, . . .}. For two sets S and U we write S U for the set of functions from U to S. If m ∈ N we write [m] for the set {1, . . . , m}. By S m we denote the set of m-tuples a = (a i ) m i=1 with a i ∈ S. The restriction of a function α : U → S to a subset A ⊂ U is defined in the standard manner and denoted by α |A ; conversely, we say that α is an extension of α |A . We also denote by α v →i the extension of α that we obtain by mapping an added element v / ∈ U to i. We use the Iverson's bracket notation: for a proposition P, the expression [P] evaluates to 1 if P is true, and to 0 otherwise.

Fact 1 If a is a tuple of nonegative integers, then a! divides |a|!.
Another fact we need concerns the complexity of multiplying two multivariate polynomials. We assume that a polynomial is represented by a list of its coefficients. It is well known that the multiplication takes nearly linear time when parameterized by the sum of the maximum degrees. We derive the following formulation of the result from a more general bound by van der Hoeven and Lecerf [33, Cor. 1] in Appendix.

Fact 2 Two k-variate polynomials whose maximum degrees sum up to n ≥ k and whose coefficients areÕ(n)-bit integers can be multiplied withÕ(n k+1 ) bit-operations.
While the above result will suffice for proving Theorem 1, it does not exploit the property that the multivariate polynomials we encounter are, in fact, sparse in the sense that they have a total degree at most n. Also in this case, multiplication only requires nearly linear time, now amounting toÕ n+k k nk bit-operations, assuming a sufficiently large prime p (exponential in n) and a primitive element of the finite field F p are given for free [33,Cor. 4]. This assumption is, however, relatively strong: we do not know, whether it can be satisfied within the time complexity bound, or even in time polynomial in n, without some number theoretic hypotheses. On the other hand, the needed numbers only depend on n and finding them can be considered as a precomputation that is efficient in practice; for further details, see the discussion of by van der Hoeven and Lecerf [33,Sect. 5.3]. The improved bound for multiplying sparse multivariate polynomials saves a factor of about k!, which suffice for reducing the factor t! in the bound of Theorem 1, as mentioned in the previous section.

Tree Decomposition
The width of the tree decomposition is the largest bag size minus one, max x∈I |B x |−1, and the treewidth of a graph is the minimum width over all its tree decompositions.
For a graph of treewidth t, we can find a tree decomposition of width t in time t O(t 3 ) n using Bodlaender's algorithm [8] or in timeÕ(n t+2 ) using the approach of Arnborg et al. [4].
A tree decomposition is rooted if the edges of the tree are directed so that there is a unique node, the root, that has no parent. Clearly, one obtains a rooted tree decomposition by simply choosing one node as the root and directing the edges accordingly.
x has no children and |B x | = 1; (introduce) x has a unique child y and x has a unique child y and B x = B y \{v} for some v ∈ B y ; (join) x has exactly two children y, z and B x = B y = B z .
We can convert a given tree decomposition of width t and O(n) nodes into a nice tree decomposition of width t and O(tn) nodes in time t O(1) n [10].

Counting Linear Extensions via Inclusion-Exclusion
Kangas et al. [18] showed that the number of linear extensions of a poset (V , ≺) whose cover graph is G = (V , E) is given by the formula where τ runs over all functions from V to [k]. Supposing G has treewidth t, it is well known that there is an elimination ordering v 1 , . . . , v n of the vertices, such that when removing the vertices from the graph in this order and always connecting the neighbors of the removed vertex, the size of the largest clique in each obtained graph is t +1. The n-dimensional inner summation over the variables τ (v i ), for i ∈ [n], can be processed iteratively along such an ordering, the ith one-dimensional summation over τ (v i ) requiring O(n i k t+1 ) additions and multiplications of O(n log n)-bit numbers, for some n 1 , . . . , n n that sum up to O(n). In total, the evaluation of the inclusion-exclusion formula thus requiresÕ(n t+4 ) bitoperations. We omit a more detailed treatment of the algorithm, as the method applied for computing the inner summation is standard. (The original analysis of Kangas et al. [18] uses a looser bound of n i = O(n) for each i, arriving at a bound that is larger by a factor of n.)

The Algorithm: Proof of Theorem 1
We implement a standard recipe of tree-decomposition based algorithms. The outline of the algorithm is as follows.
A1 Compute the cover graph of the input poset.
A2 Find a minimum-width nice tree decomposition of the cover graph. A3 Run dynamic programming over the nice tree decomposition.
We will next consider each step in detail. We will see that the last step dominates our asymptotic running time bounds.

Computing the Cover Graph
The cover graph G = (V , E) is obtained by computing the transitive reduction of the input poset (V , ≺). The transitive reduction can be computed in time

Finding a Minimum-Width Nice Tree Decomposition
As mentioned in Sect. 2, if the cover graph has treewidth t, then a width-t nice tree decomposition of the cover graph can be found inÕ(n t+2 ) time.

Dynamic Programming
Suppose now that a width-t nice tree decomposition (T , B) of the cover graph G is available. Our idea will be to associate each node of T with an array of numbers such that (i) the numbers at the root node are sufficient for computing the number of linear extensions and (ii) the array of a node can be computed from the arrays of its child nodes.
The following notation will be useful. Denote by V x the set of vertices covered by the subtree of T rooted at x, that is, V x is the union of the bags B y of nodes y to which there is a directed path from x. Write n x for the size |V x | and E x for set of edges in the induced graph G[V x ]. Now, for each node x ∈ T and injection α ∈ [n x ] B x , define x (α) as the number of bijections π ∈ [n x ] V x such that π(v) = α(v) for all v ∈ B x , and π(u) < π(v) whenever uv ∈ E x . In other words, x (α) is the number of ways to extend α to a linear extension of the induced poset (V x , ≺ ∩(V x × V x )), where we view a linear extension as a bijection from V x to [n x ] that satisfies the ordering constraints.
We begin by showing that the values x (α) are sufficient for computing the number of linear extensions of the poset, that is, they satisfy the listed conditions (i) and (ii). After that we consider the time requirement of computing the values x (α) for each node of the nice tree decomposition.
Consider first the root node.
Proof Since V r = V , E r = E, and n r = n, we have that where α and π run over all injections in [n] B r and [n] V , respectively.
Next we will show separately for each node type of the nice tree decomposition, how the values x (α) are determined by the corresponding values for the child node or child nodes of x. For all but join nodes the results are immediate, and we omit the proofs.
For an introduce node, we simply restrict the injection α to the bag of its child and check that the ordering constraint holds.

Lemma 3 (Introduce) If x is an introduce node with child y and B
For a forget node, we extend the injection α to the bag of its child by mapping the new vertex to some value that is not in the image of α.

Lemma 4 (Forget) If x is a forget node with child y and B
To handle a join node, we introduce some convenient notation. Let α be an injection from a k-element set S to a range of integers [m]. Label the elements of S so that α(v 1 ) < · · · < α(v k ). For i = 1, . . . , k − 1, denote by α i the number of integers between α(v i ) and α(v i+1 ), that is, Furthermore, if β is another injection from S to [m ], write β ∼ α if β and α specify the same linear order on S ∩ S , that is, β(u) < β(v) if and only if α(u) < α(v) for all u, v ∈ S ∩ S .

Lemma 5 (Join) If x is a join node with children y and z, then
where β and γ run over all injections in [n y ] B y and [n z ] B z , respectively.
Proof By definition, where π runs over all bijections from V x to [n x ] that extend α. Observe that by the tree decomposition properties, the sets V y \B x and V z \B x are disjoint and their union is V x \B x . Thus we may represent any bijection π : V x → [n x ] that extends α uniquely by a pair of injections β : V y → [n x ] and γ : V z → [n x ] whose restrictions to B x are equal to α and whose images β (V y ) and γ (V z ) cover [n x ]. We get that where β and γ run over all injections that extend α in [n x ] V y and [n x ] V z , respectively. Consider then a mapping that "compresses" any such injection β into a bijection β : V y → [n y ] by letting β (v) := |{u ∈ V y : β (u) ≤ β (v)}|; let γ denote the bijection obtained similarly from an injection γ . Let β denote the restriction of β to B x and γ the restriction of γ to B x . We have that β and γ extend α if and only if β ∼ α and γ ∼ α. Thus we get that where β and γ run over all bijections in [n y ] B y and [n z ] B z , respectively. The product of the binomial coefficients α i β i is the number of pairs (β , γ ) that map to the same pair (β , γ ), that is, the number of interleavings of the β i + γ i elements to the range of α i elements.
To complete the proof, it suffices to write the summation over β as a doublesummation: the outer summation being over all injections β : B y → [n y ] and the inner summation being over all bijections β : V y → [n y ] that extend β; similarly for the summation over γ .  (σ, a), where a = (a i ) k+1 i=1 with a i = α i−1 and σ is a bijection from B x to [k] that captures the specified linear order, that is, Clearly, the mapping α → (σ, a) is a bijection when we require that a i ∈ N and |a| = n x − k. Using this representation and Lemma 5, write where b and c run over N k+1 . By writing x (σ, a) := x (σ, a)/a!, we get the convolution form To treat this as a multiplication of multivariate polynomials, consider a fixed bijection σ and let P x (r 1 , . . . , r k ) be the k-variate polynomial where the coefficient of r a 1 1 · · · r a k k equals n! · x (σ, a); we define P y and P z similarly. Note that k variables suffice, since a k+1 is determined by the fixed |a|. Here we multiplied by the factorial n! to get integer coefficients (by Fact 1, since n ≥ |a|, |b|, |c|). We have that n! · P x = P y P z , that is, we obtain P x by multiplying P y and P z and dividing each coefficient of the resulting polynomial by n!.
We bound the bit-complexity of the polynomial multiplication using Fact 2. The total degrees of the polynomials are at most n − k. Each coefficient of the polynomials is aÕ(n)-bit integer. Thus the multiplication takesÕ(n k+1 ) bit-operations.
Multiplying the obtained bound by the number of bijections σ , we get that all

Experiments
In this section we present an empirical study. We first describe our implementation of the algorithm and the test instances used in the experiments. Then we show how the performance on the algorithm depends on the way we choose the tree decomposition.

Implementation
We have implemented the algorithm of Sect. 4 in a C++ program Countle. 1 For multiplication of polynomials we used the C library FLINT [17], which offers fast multiplication of univariate polynomials; to this end, we transformed the multivariate polynomials into univariate polynomials using an appropriate Kronecker substitution, as detailed in the next paragraph. We decided to not employ the asymptotically faster algorithm mentioned after Fact 2, because it has been observed to run faster only when the total degree (which is less than n in our case) is large, say, several hundreds [33, Table 7]. For finding an optimal tree decomposition we used the C++ library htd [1]. We ran all experiments on machines with Intel Xeon E5540 CPUs. Two implementation details are worth mentioning. First, we transformed the multivariate polynomials into univariate polynomials using the following Kronecker substitution. Separately for each node x of the tree decomposition and the considered vertex ordering (bijection) σ , we encoded a k-variate polynomial in variables r 1 , . . . , r k as a univariate polynomial in variable s by substituting r j := s (d 1 +1)···(d j−1 +1) , where each d i is an upper bound for the degree of r i in the polynomial. Using knowledge associated with the node x and ordering σ , we aimed at determining a value d i that is smaller than the trivial upper bound n x − k. To this end, we set d i to the sum of the largest realized exponents of r i in the already computed polynomials for the two child nodes of x.
Second, we wish to ignore any impossible ordering σ at a node x of the tree decomposition, and so save both time and space. The key observation is that, even if the value x (σ, a) is nonzero, we can ignore it if σ assigns some two vertices in the bag B x an order that violates the partial order ≺, that is, for some u, v ∈ B x we have u ≺ v and σ (u) > σ (v).

Instances
We generated random instances (posets) of different sizes for small values of treewidth t. We varied the number of elements n from 10 to 199 (t = 2), 109 (t = 3), and 59 (t = 4). For each pair (t, n) we generated 5 posets; following Kangas et al. [18], we let each poset be a "grid tree," constructed by randomly joining t-by-t grids along the (boundary) edges, orienting the edges so that no directed cycles are introduced, and finally taking the transitive closure.

Growth and Variation of Running Times
The running time of Countle may be sensitive to the particular (nice) tree decomposition selected. Therefore, we ran the program on 50 optimal-width tree decompositions, which we generated using htd; we checked that for every encountered instance, htd indeed generated tree decompositions of optimal width. We allowed each individual run to take up to 20 min of CPU time and 30 GB of memory.
We examined the scaling of Countle in terms of the number of elements n and treewidth t. We observed that, while the growth of the running time follows the rate suggested by the worst case bound, there is significant variance in the running times for any fixed (n, t), due to differences in the five posets and the 50 tree decompositions per poset (compare median to best in Fig. 2). Compared to an implementation of the inclusion-exclusion algorithm by Kangas et al. [18], VEIE, we find that Countle is an order of magnitude faster. For example, Countle can solve a typical (median) poset with a typical (median) tree decomposition in about 20 s if n = 100 and t = 2, or if n = 50 and t = 3, while VEIE requires about 200 s on such instances. The same pattern also holds for t = 4; we note, however, that such posets can actually be handled faster by another, exponential time algorithm of Kangas et al. [18, Fig. 8]. VEIE does not optimize the shape of the tree decomposition beyond its width and, in fact, the analysis of Sect. 2.2 suggests that the algorithm is not very sensitive to the shape of the tree decomposition.

Selecting Among Optimal-Width Tree Decompositions
Then we investigated whether one can efficiently select a near-optimal tree decomposition from a collection of generated candidates. The observed variance in running times suggests that, if successful, this could lead to a significant expedition of Countle, by up to one order of magnitude. For constructing a "selector" we implemented and compared two approaches.
One is the machine learning method of Abseher et al. [2]. We applied it in a straightforward manner, as follows: C1 We collected a data set of measured running times for multiple pairs of posets and tree decompositions. We used the procedure described in the previous section, except that we used a single poset (instead of five) for each combination of n and t. If a run was not completed within the 20-min time limit, we simply discarded the instance (and thus introduced some bias). C2 We computed for each tree decomposition the values of several features, such as statistics of bag sizes (by node type), node depths (by node type), and distances between join nodes; for a full feature list, see Abseher et al. [2]. C3 We fitted a multivariate linear regression model, separately for each t = 2, 3, 4, with the features as the predictor variables and the logarithm of the running time as the response variable. We used the machine learning software WEKA 3.6.13 [16] with default options. To select a tree decomposition for a given new poset, we first generated 50 candidate tree decompositions for the poset, and then selected the one for which the model predicted the shortest running time. The other approach is to "handcraft" a heuristic cost function based on a smaller set of features of the tree decomposition. Our cost function, we call memory heuristic, is simply a sum of node-wise costs. Specifically, for each node x, we define its cost as p x n x |B x | n x log 2 n x , where p x is the number of orderings σ of B x compatible with the poset; this cost is an upper bound (up to a constant factor) of the memory requirement of the corresponding dynamic programming step, obtained from the algorithm in a straightforward manner. The rationale is that for each node the memory requirement is well aligned with the time requirement and that the node-wise bounds are more accurate than the worst case bound over all nodes.
We observe (Fig. 2) that, for t = 2, the learned model almost always selects a top-3 tree decomposition, which yields a nearly as short running time as the best among the 50 tree decompositions. For t = 3 the performance degrades: the model is usually able to select a top-10 tree decomposition, which yields a running time that is systematically better than for a typical (median) tree decomposition, yet not quite achieving the performance of the best among the generated candidates. For t = 4 the performance degrades further, yet being better than by selecting a random tree decomposition. In more quantitative terms, the proportions of tree decompositions (among 50) better than the selected one were 2.5%, 12%, 28% for t = 2, 3, 4, respectively; these numbers are medians of averages over 5 test posets (one per fixed n and t).
The memory heuristic is seen to achieve almost as good performance as the machine learning method for t = 2 and t = 3 (Fig. 2). For t = 4, however, memory heuristic is inferior and even yields tree decompositions that are worse than a random (median) tree decomposition.

Concluding Remarks
We have presented a new tree-decomposition based algorithm for counting linear extensions. The algorithm relies on fast multiplication of multivariate polynomials, thus differing radically from the inclusion-exclusion approach of Kangas et al. [18]. For any constant treewidth t the obtained asymptotic speedup is about linear in the number of elements n.
A question not settled here is whether one could save another factor of n, that is, solve the problem in timeÕ(n t+2 ). The present authors find this question particularly intriguing for two reasons: one is that for finding an optimal-width tree decomposition, the best known time complexity bound isÕ(n t+2 ), assuming we let t grow at least logarithmically in n. The other reason is that for posets whose cover graph is a tree (t = 1), Atkinson's [5] algorithm takes-at least seemingly-a different approach and runs in timeÕ(n 3 ). Furthermore, Atkinson's algorithm is monotonic in the sense that all arithmetic operations are carried out with nonnegative numbers. This is in sharp contrast to both the present algorithm and the inclusion-exclusion algorithm, which crucially rely on a richer algebraic structure.
Our empirical study confirmed that the improvement in the asymptotic bound consistently transfers to the running times measured in practice. That said, the observed speedup, for n around one hundred, was by one order of magnitude rather than two. This "leak" of efficiency can, at least in part, be explained by the present algorithm's higher sensitivity to the shape of the selected tree decomposition. Indeed, we observed that the best of 50 generated tree decompositions typically yields a 5-to 10-fold speedup in relation to an average (i.e., median) tree decomposition.
We also showed that there is an efficient way to select the best or close-to-best tree decomposition using a linear regression model that was fitted to a collected data set of instances along with the measured running times, following the machine learning method of Abseher et al. [2]. However, we observed that the performance of the regression method rapidly degraded as the treewidth t increases. This suggests that the general-purpose method may not suit well for the problem of counting linear extensions. A potential reason for suboptimal performance is that the default set of features [2] does not include perhaps the most informative quantity associated with a node x in a tree decomposition, namely the term n k x (or some variant of it), which combines the size k of the bag of x with the number of vertices in the subtree rooted at x. This issue could be addressed by extending the feature set accordingly, or, potentially, by using some nonlinear regression model. On the other hand, we did inspect how well a single feature can predict a well-performing tree decomposition. We observed that for t = 2 and t = 3 the average depth of join nodes alone yielded predictions that were almost as good as the predictions by the full regression model with all the features.
We also experimented with a simpler solution for selecting a good tree decomposition. We manually constructed a heuristic function that adds up estimated contributions of each tree decomposition node. For each node x, our estimate relied on three parameters: the above mentioned k and n x and the number of possible permutations of the element in the bag of the node, p x ≤ k!. This cost function goes beyond the f -cost framework [9], in which the contribution of each node can only depend on the size of the bag. We observed that for t = 2 and t = 3 this heuristic yielded almost as good tree decompositions as the machine learning method. But for t = 4 neither this method was able to find tree decomposition substantially better than an average one.
Since the best tree decomposition was seen to yield an order-of-magninute shorter running times than the median tree decomposition, it remains as an obvious challenge for future research to construct or learn a more accurate cost function. That being said, it has to be admitted that the presented tree decomposition based algorithm is practical only for very small treewidth, t ≤ 3: for t = 4 already, a quite different, worst-case exponential-time algorithm is expected to be faster in practice [18,Fig. 8]; that algorithm is generally the faster, the denser the poset (i.e., the cover graph) is. and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: Proof of Fact 2
We will need the following notation. We consider polynomials in variables z 1 , . . . , z k with integer coefficients. We index a monomial by the tuple of its degrees e =  (e 1 , . . . , e k ). The support of a polynomial P is the set of tuples e for which the coefficient P e of the corresponding monomial in P is non-zero. We denote by d P the size of the smallest Cartesian product of the form × k i=1 {0, 1, . . . , n i } that contains the support of P. Furthermore, we denote h P := max e l P e , where for any integer i we write l i := log 2 (|i| + 1) =Õ(log |i|) for its bit-size.
Finally, we let μ(l) denote the number of bit-operations needed to multiply two integers of bit-size at most l; we have that μ(l) = O l(log l)2 log * l [15], which is O(l). Proposition 1 (Cor. 1 of van der Hoeven and Lecerf [33]) Given P, Q ∈ Z[z 1 , . . . , z k ], we can compute R = P Q with bit-operations, where h := h P + h Q + l min{d P ,d Q } .
We apply this result to two polynomials P, Q ∈ Z[z 1 , . . . , z k ] withÕ(n)-bit integer coefficients, n ≥ k, and maximum degrees at most n P and n Q such that n P + n Q ≤ n. Clearly the maximum degree of R = P Q is at most n. Thus we have that d P ≤ (n P + 1) k , d Q ≤ (n Q + 1) k , and d R ≤ (n + 1) k . Consequently, h =Õ(n).
By Proposition 1 we get that R can be computed with O h(n + 1) k + k 2 log(n + 1) + (n + 1) k =Õ n k+1 bit-operations. This completes the proof of Fact 2.