# Constrained Multilinear Detection and Generalized Graph Motifs

## Abstract

We introduce a new algebraic sieving technique to detect constrained multilinear monomials in multivariate polynomial generating functions given by an evaluation oracle. The polynomials are assumed to have coefficients from a field of characteristic two. As applications of the technique, we show an \(O^*(2^k)\)-time polynomial space algorithm for the \(k\)-sized Graph Motif problem. We also introduce a new optimization variant of the problem, called Closest Graph Motif and solve it within the same time bound. The Closest Graph Motif problem encompasses several previously studied optimization variants, like Maximum Graph Motif, Min-Substitute Graph Motif, and Min-Add Graph Motif. Finally, we provide a piece of evidence that our result might be essentially tight: the existence of an \(O^*((2-\epsilon )^k)\)-time algorithm for the Graph Motif problem implies an \(O((2-\epsilon ')^n)\)-time algorithm for Set Cover.

### Keywords

Constrained multilinear detection Graph motif FPT algorithm## 1 Introduction

Many hard combinatorial problems can be reduced to the framework of detecting whether a multivariate polynomial \(P(\mathbf {x})=P(x_1,x_2,\ldots ,x_n)\) has a monomial with specific properties of interest. In such a setup, \(P(\mathbf {x})\) is not available in explicit symbolic form but is implicitly defined by the problem instance at hand, and our access to \(P(\mathbf {x})\) is restricted to having an efficient algorithm for computing values of \(P(\mathbf {x})\) at points of our choosing. This framework was pioneered by Koutis [14], Williams [21], and Koutis and Williams [17] for use in the domain of parameterized subgraph containment problems, and it currently underlies the fastest known parameterized algorithms for many basic tasks such as path and packing problems [4].

The present paper is motivated by recent works of Guillemot and Sikora [13] and Koutis [16], who observed that functional motif discovery problems in bioinformatics are also amenable to efficient parameterized solution in the polynomial framework. Following Koutis [16], applications in this domain require one to detect monomials in \(P(\mathbf {x})\) that are both *multilinear* and further *constrained* by means of colors assigned to variables \(\mathbf {x}\), so that the combined degree of variables of each color in the monomial may not exceed a given maximum multiplicity for that color. Our objectives in this paper are to (1) present an improved algebraic technique for constrained multilinear detection in polynomials over fields of characteristic 2, (2) generalize the technique to allow for approximate matching at cost, and (3) derive improved algorithms for graph motif problems, together with evidence that our algorithms may be optimal in the exponential part of their running time. We also introduce a new common generalization—the *closest graph motif problem*—that tracks the weighted edit distance between the target motif and each candidate pattern; this in particular generalizes both the minimum substitution and minimum addition variants of the graph motif problem introduced by Dondi et al. [10].

Let us now describe our main results in more detail, starting with algebraic contributions and then proceeding to applications in graph motifs. All the algebraic contributions rely essentially on what can be called the “substitution-sieving” method in characteristic 2 [2, 4].

### 1.1 Multilinearity

To ease the exposition and the subsequent proofs, it will be convenient to start with a known, non-constrained version of the substitution sieve that exposes multilinear monomials.

Let \(P(\mathbf {x})=P(x_1,x_2,\ldots ,x_n)\) be a multivariate polynomial over a field of characteristic 2 such that every monomial \(x_1^{d_1}x_2^{d_2}\cdots x_n^{d_n}\) has total degree \(d_1+d_2+\cdots +d_n=k\). A monomial is *multilinear* if \(d_1,d_2,\ldots ,d_n\in \{0,1\}\).

For an integer \(n\), let us write \([n]=\{1,2,\ldots ,n\}\). Let \(L\) be a set of \(k\) labels. For each index \(i\in [n]\) and label \(j\in L\), introduce a new variable \(z_{i,j}\). Denote by \(\mathbf {z}\) the vector of all variables \(z_{i,j}\).

**Lemma 1**

(**Non-constrained multilinear detection** [2, 4])

*Remark* We can now observe the basic structure of the sieve (1): by making \(2^k\) substitutions of the new variables \(\mathbf {z}\) into \(P(\mathbf {x})\), we reduce the question of existence of a multilinear monomial in \(P(\mathbf {x})\) into the question whether the polynomial \(Q(\mathbf {z})\) is not identically zero. The latter can be tested probabilistically by one *evaluation* of \(Q(\mathbf {z})\) at a random point, which reduces via (1) into *evaluations* of \(P(\mathbf {x})\) at \(2^k\) points. This will be the basic structure in all our subsequent algorithm designs.

### 1.2 Constrained Multilinearity

We are now ready to state our main algebraic contribution. Let \(C\) be a set of at most \(n\)*colors* such that each color \(q\in C\) has a *maximum multiplicity*\(m(q)\in \{0,1,\ldots ,n\}\). Associate with each index \(i\in [n]\) a color \(c(i)\in C\). Let us say that a monomial \(x_1^{d_1}x_2^{d_2}\cdots x_n^{d_n}\) is *properly colored* if the number of occurrences of each color is at most its maximum multiplicity, or equivalently, for all \(q\in C\) it holds that \(\sum _{i\in c^{-1}(q)}d_i\le m(q)\).

Associate with each color \(q\in C\) a set \(S_q\) of \(m(q)\)*shades* of the color \(q\), such that \(S_q\) and \(S_{q'}\) are disjoint whenever \(q\ne q'\). Let \(S=\cup _{q\in C}S_q\).

For each index \(i\in [n]\) and each shade \(d\in S_{c(i)}\), introduce a new variable \(v_{i,d}\). For each shade \(d\in S\) and each label \(j\in L\), introduce a new variable \(w_{d,j}\).

**Lemma 2**

**Constrained multilinear detection**) The polynomial \(P(\mathbf {x})\) has at least one monomial that is both multilinear and properly colored if and only if the polynomial

*Remark*

This lemma enables us to (probabilistically) detect a constrained multilinear monomial of degree \(k\) using \(2^k\) evaluations of \(P(\mathbf {x})\), assuming that we are working over a sufficiently large field of characteristic 2. This solves an open problem posed by Koutis at a Dagstuhl seminar in 2010 [15], and forms the core of our algorithm in Theorem 4.

### 1.3 Cost-Constrained Multilinearity

The previous setting admits a generalization where we associate *costs* to decisions to arrive at a proper coloring. Accordingly, we assume that no coloring \(c:[n]\rightarrow C\) has been fixed *a priori*, but instead associate with each index \(i\in [n]\) and each color \(q\in C\) a nonnegative integer \(\kappa _i(q)\), the *cost* of assigning the color \(q\) to \(i\).

Once a coloring \(c:[n]\rightarrow C\) has been assigned, the *cost* of a monomial \(x_1^{d_1}x_2^{d_2}\cdots x_n^{d_n}\) in the assigned coloring is \(\sum _{i\in [n]} d_i\kappa _i(c(i))\). The objective now becomes to detect a multilinear monomial that has the minimum cost under a proper coloring.

For each index \(i\in [n]\) and each shade \(d\in S\), introduce a new variable \(v_{i,d}\). For each shade \(d\in S\) and each label \(j\in L\), introduce a new variable \(w_{d,j}\). Introduce a new variable \(\eta \).

**Lemma 3**

**Cost-constrained multilinear detection**) The polynomial \(P(\mathbf {x})\) has at least one monomial that is both multilinear and admits a proper coloring with cost \(\sigma \) if and only if the polynomial

*Remark*

The previous lemma may be extended to track multiple cost parameters \(\eta _1,\eta _2,\ldots \) simultaneously. In fact, this will be convenient in our algorithm underlying Theorem 5. We also observe that in applications one typically works with a (random) evaluation in the variables \(\mathbf {v}\) and \(\mathbf {w}\), but seeks to recover an explicit polynomial in \(\eta \) as the output of the sieve, typically by a sequence of evaluations at distinct points, followed by interpolation to recover the polynomial in \(\eta \).

### 1.4 Graph Motif Problems

*Background*Graph motif problems were introduced by Lacroix et al. [18] and motivated by applications in bioinformatics, specifically in metabolic network analysis. The Maximum Graph Motif problem was introduced by Dondi et al. [9]. It is known to be NP-hard even when the given graph is a tree of maximum degree 3 and each color may occur at most once [12]. However, in practice the parameter \(k\) is expected to be small, what motivates the research on so-called FPT algorithms parameterized by \(k\), that is, algorithms with running times bounded from above by a function \(f(k)\) times a function polynomial in the input size, which is commonly abbreviated by \(O^*(f(k))\). Indeed, Fellows et al. [11] discovered that such an algorithm exists, which was followed by a rapid series of improvements to \(f(k)\) [1, 11, 13], culminating in the \(O^*(2.54^k)\)-time algorithm of Koutis [16] (see Table 1).

Progress on FPT algorithms for the \(k\)-sized graph motif problem

From a high-level prespective the two key ideas underlying our main theorem in this section are (1) an observation of Guillemot and Sikora [13] that *branching walks* [19] yield an efficient polynomial generating function for connected sets, and (2) Lemma 2 that builds on work by Koutis [16]. More precisely, our approach is inspired by Koutis’s beautiful idea of assigning random subspaces of dimension equal to the prescribed multiplicities of the colors. Koutis used group algebras \(\mathbb {F}_2[\mathbb {Z}_2^k]\) for his construction, whereas ours appear to require an extension to \(\mathbb {F}_{2^\beta }[\mathbb {Z}_2^k]\) for \(\beta = \varOmega (\log k)\). Rather than proving the result in terms of a group algebra as Koutis suggests, we provide a construction based on inclusion–exclusion principle and labelled indeterminates. As in [4], a paper using the technique co-authored by a subset of the present authors, we find it more convenient to work in this alternative setting.

*Our results.* The coefficient \(\mu =O(\log k\log \log k\log \log \log k)\) in the following theorem reflects the time complexity of basic arithmetic (addition, multiplication) in a finite field of size \(O(k)\) and characteristic 2 [6].

**Theorem 4**

There exists a Monte Carlo algorithm for Maximum Graph Motif that runs in \(O(2^kk^2 e\mu )\) time and in polynomial space, with the following guarantees: (i) the algorithm always returns NO when given a NO-instance as input, (ii) the algorithm returns YES with probability at least \(1/2\) when given a YES-instance as input.

*Remark*

We observe that the algorithm in Theorem 4 runs in linear time in the number of edges \(e\) in the host graph \(H\). Furthermore, the exponential part \(2^k\) of the running time is caused by the sieve (2), implying that the algorithm can be executed in parallel on up to \(2^k\) processors with essentially linear speedup. A caveat of the algorithm is that it solves only the YES/NO-decision problem, however, it can be extended to extract a solution set \(K\) at additional multiplicative cost \(k\) to the running time; this extension will be pursued elsewhere.

### 1.5 Weighted Edit Distance and the Closest Motif Problem

Koutis [16] gives an \(O^*(2.54^k)\)-time algorithm for Min-Add Graph Motif and an \(O^*(5.08^k)\)-time algorithm for Min-Substitute Graph Motif.

Our objective here is to generalize the graph motif framework to weighted edit distance between \(M\) and \(c(K)\) by introducing a common generalization, the Closest Graph Motif problem. We then use Lemma 3 to obtain an \(O^*(2^k)\)-time algorithm for the problem.

- (S)
substitute one occurrence of a color \(q\in M\) with a color \(q'\in C_0\),

- (I)
insert one occurrence of a color \(q\in C_0\) to \(M\), and

- (D)
delete one occurrence of a color \(q\in M\) from \(M\).

*cost*\(\sigma _{\mathrm {S}}\), \(\sigma _{\mathrm {I}}\), \(\sigma _{\mathrm {D}}\).

*cost*(or

*weighted edit distance*) to match \(M\) with \(N\) is the minimum cost of a sequence of basic operations that transforms \(M\) to \(N\), where the cost of the sequence is the sum of costs of the basic operations in the sequence.

*Our results* Our main result in this section is as follows.

**Theorem 5**

There exists a Monte Carlo algorithm for Closest Graph Motif that runs in \(O((2^kk^4+|C_0|k^3)e\mu )\) time and in polynomial space, with the following guarantees: (i) the algorithm always returns NO when given a NO-instance as input, (ii) the algorithm returns YES with probability at least \(1/2\) when given a YES-instance as input.

### 1.6 A Lower Bound

We show that for any \(\epsilon >0\) the existence of an \(O^*((2-\epsilon )^k)\)-time algorithm for Maximum Graph Motif implies an \(O((2-\epsilon ')^n)\)-time algorithm for Set Cover, for some \(\epsilon '>0\). Thus, instead of trying to improve our algorithm one should rather directly attack Set Cover, for which all attempts to obtain a \(O((2-\epsilon )^n)\)-time algorithm have failed, despite extensive effort. Indeed, the nonexistence of such an algorithm is already used as a basis for hardness results [7]. Furthermore, it is conjectured [7] that an \(O((2-\epsilon )^n)\)-time algorithm for Set Cover contradicts the Strong Exponential Time Hypothesis (SETH), which states that if \(k\)-CNF SAT can be solved in \(O^*(c_k^n)\) time, then \(\text{ lim }_{k\rightarrow \infty } c_k=2\). This conjecture is further supported by the fact that the number of solutions to an instance of Set Cover cannot be computed in \(O((2-\epsilon )^n)\) time for any \(\epsilon >0\) unless SETH fails [7]. A yet further consequence of such a counting algorithm would be the existence of an \(O((2-\epsilon ')^n)\)-time algorithm to compute the permanent of an \(n\times n\) integer matrix [3].

**Theorem 6**

- 1.
each color may occur at most once, or

- 2.
there are exactly two colors.

### 1.7 Organization

Our two main lemmas, Lemma 2 and Lemma 3, are proved in Sect. 2. Theorem 4 is proved in Sect. 3. Theorem 5 is proved in Sect. 4. Theorem 6 is proved in Sect. 5.

## 2 Algebraic Tools

This section proves Lemma 2 and Lemma 3. We start with a proof of Lemma 1 that will act as a building block of both proofs.

### 2.1 Proof of Lemma 1.

*image*of \(f\) by

*surjective*if \(I(f)=L\). Since all but surjective \(f\) cancel, from (5) and the previous analysis we thus have

So suppose there exists at least one bad index \(b\in [n]\) with \(d_b\ge 2\). Let us fix \(b\) to be the minimum such index. Consider an arbitrary surjective \(n\)-tuple \(f=(f_1,f_2,\ldots ,f_n)\). Since \(|L|=k=d_1+d_2+\cdots +d_n\) and \(f\) is surjective, we must have that for every \(i\in [n]\) the function \(f_i\) is bijective, in particular thus \(f_b(1)\ne f_b(2)\).

*mate*\(f'\) of \(f\) by setting \(f_i'=f_i\) for all \(i\in [n]\setminus \{b\}\) and

The lemma now follows by linearity. Indeed, an arbitrary multivariate polynomial \(P(x_1,x_2,\ldots ,x_n)\) is a sum of monomials of the form \(\alpha \, x_1^{d_1}x_2^{d_2}\cdots x_n^{d_n}\), where \(\alpha \) is a coefficient from the considered field of characteristic two. \(\square \)

### 2.2 Proof of Lemma 2.

Let us say that a function \(h:K\rightarrow S\) that associates a shade \(h(i)\in S\) to each \(i\in K\) is *valid* if it holds that \(h(i)\in S_{c(i)}\) for all \(i\in K\). Observe in particular that an *injective* valid \(h:K\rightarrow S\) exists if and only if \(\prod _{i\in K} x_i\) is properly colored.

Now, let us fix an arbitrary valid \(h:K\rightarrow S\). We will show that the inner sum in (9) evaluates to zero in characteristic 2 unless \(h\) is injective.

*mate*\(g'\) of \(g\) by setting

So suppose that \(h\) is injective (Recall that such an \(h\) exists if and only if \(K\) defines a properly colored multilinear monomial.) Let us study the inner sum in (9). Fix an arbitrary bijective \(g:K\rightarrow L\) and study the inner monomial \(\prod _{i\in K}v_{i,h(i)}w_{h(i),g(i)}\). From the variables \(v_{i,d}\) in the monomial we can reconstruct the set \(K\) and the mapping \(h\). Because \(h\) is injective, we can reconstruct the mapping \(g\) from the variables \(w_{d,j}\) in the monomial by setting \(g(h^{-1}(d))=j\) for each relevant pair \((d,j)\). Since the three-tuple \((K,h,g)\) can be reconstructed from the inner monomial, no cancellation happens in characteristic 2.

The lemma follows again by linearity. \(\square \)

### 2.3 Proof of Lemma 3

So suppose that \(h\) is injective. Observe that we can reconstruct the three-tuple \((K,h,g)\) from the corresponding monomial in (10) exactly as in the proof of Lemma 2, and thus no further cancellation happens in characteristic 2. The degree of \(\eta \) is clearly the cost of the monomial \(\prod _{i\in K}x_i\) in its coloring \(c=\pi h\). In particular, we have that \(\prod _{i\in K}x_i\) is properly colored in \(c\) since \(h\) is injective.

The lemma follows again by linearity. \(\square \)

### 2.4 Remarks

It is immediate from the proofs that the polynomial \(P(\mathbf {x})\) may have additional variables \(P(\mathbf {x},\mathbf {y})\) without changing the conclusion as regards multilinearity and proper coloring of the monomials when restricted to the variables \(\mathbf {x}\). Furthermore, any monomial that has total degree less than \(k\) in the variables \(\mathbf {x}\) will cancel.

We observe that Lemma 3 subsumes Lemma 2. Indeed, given a coloring \(c:[n]\rightarrow C\) we can set the costs for Lemma 3 so that \(\kappa _i(q)=0\) if \(c(i)=q\) and \(\kappa _i(q)=1\) otherwise. Then, \(P(\mathbf {x})\) has at least one monomial that is both multilinear and properly colored if and only if \(Q(\mathbf {v},\mathbf {w},\eta )\) has at least one monomial whose degree in the variable \(\eta \) is \(\sigma =0\).

## 3 An Algorithm for the Maximum Graph Motif Problem

This section illustrates the use of Lemma 2 in a concrete algorithm design for Maximum Graph Motif. In particular, we proceed to give a proof of Theorem 4.

Consider an instance \((H,M,C,c,k)\) of Maximum Graph Motif. Let us write \(m(q)\) for the number of occurrences of color \(q\in C\) in the multiset \(M\). Also recall that we assume that the host graph \(H\) is connected with \(n\) vertices and \(e\) edges; in particular, \(e\ge n-1\). By preprocessing we may assume that \(m(q)\le k\) for each \(q\in C\).

Our first objective is to arrive at a generating polynomial \(P_k(\mathbf {x},\mathbf {y})\) that we can use with Lemma 2. There are two key aspects to this quest: (i) the multilinear monomials need to reflect the connected vertex sets of size \(k\) in \(H\), and (ii) we must have a fast algorithm for evaluating the polynomial at specific points.

### 3.1 Branching Walks

The concept of branching walks was first introduced by Nederlof [19] to sieve for Steiner trees, followed by Guillemot and Sikora [13] who observed that branching walks can also be employed to span connected vertex sets of size \(k\) in the host graph \(H\). Our approach here is to capitalize on this observation and span connected sets via branching walks.

Let us write \(V=V(H)=\{1,2,\ldots ,n\}\) for the vertex set and \(E=E(H)\) for the edge set of the host graph \(H\). A mapping \(\varphi :V(T)\rightarrow V(H)\) is a *homomorphism* from a graph \(T\) to the host \(H\) if for all \(\{a,b\}\in E(T)\) it holds that \(\{\varphi (a),\varphi (b)\}\in E(H)\). We adopt the convention of calling the elements of \(V(T)\)*nodes* and the elements of \(V(H)\)*vertices*.

A *branching walk* in \(H\) is a pair \(W=(T,\varphi )\) where \(T\) is an ordered rooted tree with node set \(V(T)=\{1,2,\ldots ,|V(T)|\}\) such that every node \(a\in V(T)\) coincides with its rank in the preorder traversal of \(T\), and \(\varphi :V(T)\rightarrow V(H)\) is a homomorphism from \(T\) to \(H\).

Let \(W=(T,\varphi )\) be a branching walk in \(H\). The walk *starts* from the vertex \(\varphi (1)\) in \(H\). The walk *spans* the vertices \(\varphi (V(T))\) in \(H\). The *size* of the walk is \(|V(T)|\). The walk is *simple* if \(\varphi \) is injective. Finally, the walk is *properly ordered* if any two sibling nodes \(a<b\) in \(T\) satisfy \(\varphi (a)<\varphi (b)\) in \(H\).

### 3.2 A Generating Polynomial for Branching Walks

We now define a generating polynomial for properly ordered branching walks of size \(k\) in \(H\). Introduce a variable \(x_u\) for each vertex \(u\in V(H)\) and two variables \(y_{(u,v)}\) and \(y_{(v,u)}\) for each edge \(\{u,v\}\in E(H)\).

*monomial fingerprint*

Define the generating polynomial \(P_{k,s}(\mathbf {x},\mathbf {y})\) as the sum of the monomial fingerprints of the properly ordered branching walks that start from \(s\) and have size \(k\). Let \(P_k(\mathbf {x},\mathbf {y})=\sum _{s\in V(H)} x_{s} P_{k,s}(\mathbf {x},\mathbf {y})\). Observe that all monomial in \(P_k(\mathbf {x},\mathbf {y})\) have total degree \(2k-1\).

**Lemma 7**

A monomial in \(P_k(\mathbf {x},\mathbf {y})\) is multilinear in the variables \(\mathbf {x}\) if and only if it originates from a monomial fingerprint of a simple branching walk. Moreover, such a simple branching walk can be reconstructed from its monomial fingerprint.

*Proof*

For the first claim it suffices to consider an arbitrary monomial of \(P_k(\mathbf {x},\mathbf {y})\) and observe that the degree of the variable \(x_u\) indicates how many times \(u\in V(H)\) occurs in the image of \(\varphi \). In particular, \(\varphi \) is injective if and only if the monomial is multilinear in the variables \(\mathbf {x}\).

For the second claim, let \(W=(T,\varphi )\) be a simple and properly ordered branching walk that starts from \(s\). We must reconstruct \(W\) from its monomial fingerprint that has been multiplied by \(x_s\). Since \(\varphi \) is injective, we can immediately reconstruct (up to labels of the vertices) the rooted tree structure of \(T\) because the degrees of the variables \(y_{(u,v)}\) in the monomial (if any) reveal both the edges and the orientation of each edge in \(T\). Since \(W\) is properly ordered, we can reconstruct (up to labels of the vertices) the ordering of \(T\). Finally, we can reconstruct the vertex labels of \(T\) by carrying out a preorder traversal of \(T\).

An immediate corollary of Lemma 7 is that \((H,M,C,c,k)\) is a YES-instance of Maximum Graph Motif if and only if \(P_k(\mathbf {x},\mathbf {y})\) has a monomial that is both properly colored and multilinear in the sense of Lemma 2. Indeed, a multilinear monomial corresponds to a simple branching walk, which by definition spans a connected set of vertices. Conversely, every connected set of vertices admits at least one simple branching walk. Thus, to complete the proof of Theorem 4 it remains to derive a fast way to evaluate the polynomial \(P_k(\mathbf {x},\mathbf {y})\) and then apply Lemma 2 to obtain an algorihtm design.

### 3.3 Evaluating the Generating Polynomial

This section develops a dynamic programming recurrence to evaluate the polynomial \(P_{k}(\mathbf {x},\mathbf {y})\) at a given assignment of values to the variables \(\mathbf {x},\mathbf {y}\).

For a vertex \(u\in V(H)\), denote the ordered sequence of neighbors of \(u\) in \(H\) by \(u_1<u_2<\cdots <u_{\deg _H(u)}\).

### 3.4 The Algorithm

We are now ready to describe the algorithm for Theorem 4. Assume an instance \((H,M,C,c,k)\) of the Maximum Graph Motif has been given as input.

Let \(b=\lceil \log _2 6k\rceil \) and consider the finite field \(\mathbb {F}_{2^b}\) of order \(2^b\). Introduce variables \(v_{i,d}\) and \(w_{d,j}\) as in the setup of Lemma 2. Assign a value from \(\mathbb {F}_{2^b}\) uniformly and independently at random to each of these variables. Similarly, as in the setup of Sect. 3.2, introduce two variables \(y_{(r,s)}\) and \(y_{(s,r)}\) to each edge \(\{r,s\}\in E(H)\) and assign a value to each variable uniformly and independently at random from \(\mathbb {F}_{2^b}\). We thus have three vectors of values in \(\mathbb {F}_{2^b}\), namely \(\mathbf {v}\), \(\mathbf {w}\), and \(\mathbf {y}\).

### 3.5 Running Time

To analyse the running time of the algorithm, observe that we can assume that \(m(q)\le k\). Thus, computing the values \(\mathbf {u}^A(\mathbf {v},\mathbf {w})\) for a fixed \(A\subseteq L\) takes \(O(k^2n)\) arithmetic operations in \(\mathbb {F}_{2^b}\), and each such operation can be implemented to run in time \(\mu =O(b\log b\log \log b)\) [6]. Furthermore, each evaluation of (11), (12), and (13) for a fixed \(A\) takes \(O(k^2e)\) arithmetic operations in \(\mathbb {F}_{2^b}\). Hence, recalling that \(e\ge n-1\), the total running time of the algorithm is \(O(2^kk^2e\mu )\).

### 3.6 Correctness

To establish the desired properties of the algorithm, observe that from Sect. 3.2 and Lemma 2 it follows that (14) —viewed as a polynomial in the variables \(\mathbf {v}\), \(\mathbf {w}\), and \(\mathbf {y}\)—is not identically zero if and only if \((H,M,C,c,k)\) is a YES-instance of Maximum Graph Motif. Thus, if \((H,M,C,c,k)\) is a NO-instance, then (14) evaluates to zero and the algorithm gives a NO output. Furthermore, if \((H,M,C,c,k)\) is a YES-instance, then (14) is an evaluation of a nonzero multivariate polynomial of total degree \(3k-1\) at a point \((\mathbf {v},\mathbf {w},\mathbf {y})\) selected uniformly at random. Recalling that \(2^b\ge 6k\), the following lemma thus implies that the value \(Q(\mathbf {v},\mathbf {w},\mathbf {y})\) is nonzero (and hence the algorithm outputs YES) with probability at least 1/2.

**Lemma 8**

([8, 20, 22]) A nonzero polynomial \(P(z_1,z_2,\ldots ,z_\ell )\) of total degree \(d\) with coefficients in the finite field \(\mathbb {F}_q\) has at most \(dq^{\ell -1}\) roots in \(\mathbb {F}_q^\ell \).

This completes the proof of Theorem 4. \(\square \)

### 3.7 Minor Variants and Extensions

The basic framework presented above immediately allows for some minor variants and extensions, such as seeking an exact match instead of the maximum match by setting \(|M|=k\). Similarly, one may extend from a fixed coloring \(c:V(H)\rightarrow C\) into a *list coloring* version where each vertex \(i\in V(H)\) gets associated a list \(C(i)\subseteq C\) of valid colors instead of a single color \(c(i)\), and the motif \(M\) may match against any one of the colors in the list. This variant can be implemented by simply changing the inner sum in Lemma 2 to \(u_{i,j}=\sum _{d\in \cup _{q\in C(i)}S_q}v_{i,d}w_{d,j}\). That is, we sum over the shades of all the colors \(q\) in \(C(i)\).

## 4 An Algorithm for the Closest Graph Motif Problem

This section gives a proof of Theorem 5 using Lemma 3 and the generating function developed in Sect. 3.2.

Consider an instance \((H,M,C_0,c,\sigma _{\mathrm {S}},\sigma _{\mathrm {I}},\sigma _{\mathrm {D}},\tau ,k)\) of Closest Graph Motif with \(V(H)=\{1,2,\ldots ,n\}\). Let us again write \(m(q)\) for the number of occurrences of color \(q\in C_0\) in the multiset \(M\). We may assume that \(m(q)\le k\). Furthermore, since \(H\) is connected, the number of vertices \(n\) and the number of edges \(e\) satisfy \(e\ge n-1\).

The key step in arriving at Theorem 5 is to transport weighted edit distance into the setting of Lemma 3.

### 4.1 Optimum Edit Sequences

It will be convenient to have available the following lemma that characterizes the structure of a sequence of operations that realizes the minimum cost to transform a multiset \(M\) to the multiset \(N\), where both multisets are over \(C_0\).

Let \(k=|N|\). Consider an arbitrary sequence of basic operations that transforms \(M\) to \(N\). As the sequence is executed, each original element of \(M\) gets assigned into one of three classes. First, there are \(k_{\mathrm {U}}\) elements in \(M\) that remain untouched (and hence in \(N\)) when the execution terminates. Second, there are \(k_{\mathrm {S}}\) elements in \(M\) that undergo at least one substitution—which we may view as “recoloring” of the element—and remain in \(N\) when the execution terminates. Third, the remaining \(|M|-k_{\mathrm {U}}-k_{\mathrm {S}}\) elements of \(M\) get deleted during execution. Thus, at least \(k-k_{\mathrm {U}}-k_{\mathrm {S}}\) insertions must occur in the sequence. Let us call the values \(k_{\mathrm {U}}\) and \(k_{\mathrm {S}}\) the *parameters* of the sequence.

**Lemma 9**

*Proof*

The inequality is immediate from the preceding analysis; the sequence that meets equality (i) does nothing for the \(k_{\mathrm {U}}\) untouched original elements, (ii) substitutes the correct final color with one substitution for each of the \(k_{\mathrm {S}}\) originals, (iii) deletes each of the \(|M|-k_{\mathrm {U}}-k_{\mathrm {S}}\) remaining originals, and (iv) finally inserts \(k-k_{\mathrm {U}}-k_{\mathrm {S}}\) new elements to match with \(N\).

### 4.2 The Algorithm

Assume an instance \((H,M,C_0,c,\sigma _{\mathrm {S}},\sigma _{\mathrm {I}},\sigma _{\mathrm {D}},\tau ,k)\) of Closest Graph Motif has been given as input.

The algorithm now proceeds as follows. Let \(b=\lceil \log _2 6k\rceil \) and consider the finite field \(\mathbb {F}_{2^b}\) of order \(2^b\). Introduce variables \(v_{i,d}\) and \(w_{d,j}\) as in the setup of Lemma 3. Assign a value from \(\mathbb {F}_{2^b}\) uniformly and independently at random to each of these variables. Similarly, as in the setup of Sect. 3.2, introduce two variables \(y_{(r,s)}\) and \(y_{(s,r)}\) to each edge \(\{r,s\}\in E(H)\) and assign a value to each variable uniformly and independently at random from \(\mathbb {F}_{2^b}\). We thus have three vectors of values in \(\mathbb {F}_{2^b}\), namely \(\mathbf {v}\), \(\mathbf {w}\), and \(\mathbf {y}\).

### 4.3 Running Time

The analysis is essentially similar to Sect. 3.5, with two differences. First, the outer loop in the main part introduces a multiplicative factor \(k^2\) compared with Sect. 3.5. Second, the implementation of (20) requires us to sum over all the shades originating from \(M\) and the \(k\) shades of the color “\(*\)”. This can be done efficiently by precomputing the inner sums \(\sum _{d\in S_q}v_{i,d}w_{d,j}\) for each color \(q\in C\), index \(i\in [n]\), and label \(j\in L\), which takes \(O\bigl ((|M|+k)kn\mu \bigr )\) time outside the main loops. In the outer loop of the main part it thus suffices to compute only the outer sum in (20) for each choice of \((\eta _{\mathrm {S}},\eta _{\mathrm {ID}})\), which leads to \(O\bigl (|C_0|kn\mu \bigr )\) time for each iteration of the outer loop. In the inner loop over \(A\subseteq L\), it takes \(O(kn\mu )\) time to prepare the vector \(\mathbf {u}^A(\mathbf {v},\mathbf {w},\eta _{\mathrm {S}},\eta _{\mathrm {ID}})\). Compared with Sect. 3.5, this gives a further contributing factor of \(|C_0|k\) outside the inner loop (The running time cost of the final interpolation step and the checking of the at most \(k^2\) monomials of the bivariate polynomial \(Q(\mathbf {v},\mathbf {w},\mathbf {y},\eta _{\mathrm {S}},\eta _{\mathrm {ID}})\) with respect to (21) is assumed to be subsumed by the running time bound.)

### 4.4 Correctness

We start by observing that (17) and (18) imply that (19) has total degree at most \(k\) in the variables \(\eta _{\mathrm {S}}\) and \(\eta _{\mathrm {ID}}\), thus implying that Lagrange interpolation will correctly recover the polynomial in \(\eta _{\mathrm {S}}\) and \(\eta _{\mathrm {ID}}\) from the \((k+1)^2\) evaluations computed in the main loop.

Let us say that (19) —viewed as a polynomial in all the variables \(\mathbf {v}\), \(\mathbf {w}\), \(\mathbf {y}\), \(\eta _{\mathrm {S}}\), \(\eta _{\mathrm {ID}}\)—is *witnessing* if there exists at least one monomial whose degrees \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) satisfy (21).

**Lemma 10**

The polynomial (19) is witnessing if and only if the given input is a YES-instance of Closest Graph Motif.

*Proof*

In the “only if” direction, consider a monomial of (19) whose degrees \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) satisfy (21). From Lemma 3 we have that the polynomial \(P_k(\mathbf {x},\mathbf {y})\) has at least one monomial that is both multilinear in \(\mathbf {x}\) and admits a proper coloring with \(\mathrm {S}\)-cost \(k_{\mathrm {S}}\) and \(\mathrm {ID}\)-cost \(k_{\mathrm {ID}}\). From Sect. 3.2 it follows that this monomial of \(P_k(\mathbf {x},\mathbf {y})\) corresponds to a simple branching walk in \(H\) and thus identifies a connected set \(K\subseteq V(H)\) of vertices in \(H\). Furthermore, the existence of a proper coloring of the monomial implies by (17), (18), and Lemma 9 that there exists a sequence of basic operations that transforms the multiset \(M\) to the multiset \(c(K)\) with total cost (16). In particular, since \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) satisfy (21), we have that \((H,M,C_0,c,\sigma _{\mathrm {S}},\sigma _{\mathrm {I}},\sigma _{\mathrm {D}},\tau ,k)\) is a YES-instance of Closest Graph Motif.

In the “if” direction, let \((H,M,C_0,c,\sigma _{\mathrm {S}},\sigma _{\mathrm {I}},\sigma _{\mathrm {D}},\tau ,k)\) be a YES-instance of Closest Graph Motif. Let \(K\subseteq V(H)\) be a solution set and consider an associated sequence \(\varDelta \) of basic operations that transforms \(M\) to \(c(K)\) with cost at most \(\tau \). We may without loss of generality assume that the cost of the sequence \(\varDelta \) satisfies equality in Lemma 9. In particular, from (16) we thus observe that the parameters \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) of the sequence \(\varDelta \) thus satisfy (21). Consider a simple branching walk of size \(k\) in \(H\) that spans the vertices in \(K\). From Sect. 3.2 we observe that there is a corresponding multilinear monomial in \(P_k(\mathbf {x},\mathbf {y})\). Next observe that we can properly color this monomial in the sense of Lemma 3 by (i) assigning the color \(*\) to each of the \(k_{\mathrm {ID}}\) values \(i\in K\) that correspond to elements inserted in \(\varDelta \), (ii) assigning the substituted color to each of the \(k_{\mathrm {S}}\) values \(i\in K\) that correspond to elements of \(M\) receiving substitutions in \(\varDelta \), and (iii) assigning the color \(c(i)\) to each of the remaining \(k-k_{\mathrm {S}}-k_{\mathrm {ID}}\) values \(i\in K\) that correspond to elements of \(M\) that are not touched by \(\varDelta \). Furthermore, by (17) and (18), this proper coloring has \(\mathrm {S}\)-cost \(k_{\mathrm {S}}\) and \(\mathrm {ID}\)-cost \(k_{\mathrm {ID}}\). From Lemma 3 we thus have that (19) —viewed as a polynomial in the variables \(\mathbf {v}\), \(\mathbf {w}\), \(\mathbf {y}\), \(\eta _{\mathrm {S}}\), \(\eta _{\mathrm {ID}}\)—has at least one monomial whose degrees \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) satisfy (21).

Let us now study the operation of the algorithm in more detail. We have that the given input is a NO-instance if and only if (19) is not witnessing. Thus, given a NO-instance as input, the algorithm always gives a NO output.

So suppose that the given input is a YES-instance. Since (19) is witnessing, there exist degrees \(k_{\mathrm {S}}\) and \(k_{\mathrm {ID}}\) that are present in a monomial of (19) such that (21) holds. In particular, coefficient of the monomial \(\eta _{\mathrm {S}}^{k_{\mathrm {S}}}\eta _{\mathrm {ID}}^{k_{\mathrm {ID}}}\) computed by the algorithm is an evaluation of a nonzero multivariate polynomial of total degree \(3k-1\) at a point \((\mathbf {v},\mathbf {w},\mathbf {y})\) selected uniformly at random. Recalling that \(2^b\ge 6k\), Lemma 8 thus implies that the coefficient is nonzero (and hence the algorithm outputs YES) with probability at least 1/2. This completes the proof of Theorem 5. \(\square \)

## 5 A Lower Bound Reduction from Set Cover

We base our proof of Theorem 6 on the following theorem, which can be extracted from the proof of Theorem 4.4 in a recent paper by Cygan et al. [7].

**Theorem 11**

([7]) If Set Cover can be solved in \(O((2-\epsilon )^{n+t})\) time for some \(\epsilon >0\) then it can also be solved in \(O((2-\epsilon ')^{n})\) time, for some \(\epsilon '>0\).

### 5.1 Proof of Theorem 6

Let \((\mathcal {S},t)\) be an instance of Set Cover. We are going to show a polynomial-time reduction to Maximum Graph Motif so that in the resulting instance \((H,C,m,c,k)\) we have \(\sum _{q\in C}m(q)=k=n+t+1\). Combined with Theorem 11, this reduction will immediately establish our claim.

The graph \(H\) is defined as follows. The vertex set consists of the universe \(U\), \(t\) copies of the family \(\mathcal {S}\), and a special vertex \(r\), that is, \(V(H) = U \cup \{s^j_i:i=1,2,\ldots ,m,\ j=1,2,\ldots ,t\} \cup \{r\}\). The edge set is \(E(H)=\{\{a,s^j_i\}:a \in S_i\} \cup \{\{r,s^j_i\}:i=1,2,\ldots ,m,\ j=1,2,\ldots ,t\}\). Let \(k=n+t+1\).

To establish part (1), let \(C=\{1,2,\ldots ,n+t+1\}\) with \(m(q)=1\) for each \(q\in C\). Furthermore, assign the colors to vertices so that \(c(s^j_i)=j\) for every \(i=1,2,\ldots ,m,\ j=1,2,\ldots ,t\) and \(c(r)=t+1\). Finally, assign the \(n\) colors \(t+2,t+3,\ldots ,n+t+1\) bijectively to the vertices in \(U\).

We show that \((\mathcal {S},t)\) is a YES-instance if and only if \((H,C,m,c,k)\) is a YES-instance. To establish the “only if” direction, suppose that \(S_{i_1},S_{i_2},\ldots ,S_{i_t}\) is a solution of \((\mathcal {S},t)\). Then let \(K=\{r\}\cup U \cup \{s^{j}_{i_j}:j=1,2,\ldots ,t\}\). It is clear that \(c(K)=C\) and that \(H[\{r\} \cup \{s^{j}_{i_j}:j=1,2,\ldots ,t\}]\) is connected. Since for every \(a\in U\) there is \(j=1,2,\ldots ,t\) such that \(a\in S_{i_j}\), so \(\{a,s^j_{i_j}\} \in E(G[K])\). It follows that \(G[K]\) is connected, and hence \(K\) is a solution of \((H,C,m,c,K)\). To establish the “if” direction, suppose that \(K\) is a solution of \((H,C,m,c,k)\). Then for every \(j=1,2,\ldots ,t\) there is exactly one \(i_j\in \{1,2,\ldots ,m\}\) such that \(s^j_{i_j} \in K\), since \(c(K)=C\). Moreover, since \(G[K]\) is connected we observe that for every \(a\in U\) there is a \(j=1,2,\ldots ,t\) such that \(\{a,s^j_{i_j}\}\in E(G[K])\). But then \(a\in S_{i_j}\) and it follows that \(S_{i_1},S_{i_2},\ldots ,S_{i_t}\) is a solution of \((\mathcal {S} ,t)\).

To establish part (2), let \(C=\{1,2\}\) with \(m(1)=n+1\) and \(m(2)=t\). Set \(c(r)=1\) and \(c(a)=1\) for every \(a\in U\). All the remaining vertices are colored with 2. The proof of equivalence is similar to part (1) and is left to the reader. \(\square \)

## Notes

### Acknowledgments

A preliminary conference abstract of this work has appeared as [5]. This research was supported in part by the Swedish Research Council, Grant VR 2012-4730 (A. B.), the Academy of Finland, Grants 252083 and 256287 (P. K.), and by the National Science Centre of Poland, Grants N206 567140 and UMO-2013/09/B/ST6/03136 (Ł. K.). The third author thanks Sylwia Antoniuk, Marek Cygan, Michal Debski, and Matthias Mnich for helpful discussions on related topics.

### References

- 1.Betzler, N., Fellows, M.R., Komusiewicz, C., Niedermeier, R.: Parameterized algorithms and hardness results for some graph motif problems. In: Proceedings of CPM’08. LNCS, vol. 5029, pp. 31–43 (2008)Google Scholar
- 2.Björklund, A.: Determinant sums for undirected hamiltonicity. In: Proceedings of the FOCS’10, pp. 173–182 (2010)Google Scholar
- 3.Björklund, A.: Counting perfect matchings as fast as Ryser. In: Proceedings of the SODA’12, pp. 914–921 (2012)Google Scholar
- 4.Björklund, A., Husfeldt, T., Kaski, P., Koivisto, M.: Narrow sieves for parameterized paths and packings. CoRR. abs/1007.1161 (2010)
- 5.Björklund, A., Kaski, P., Kowalik, L.: Probably optimal graph motifs. In: Portier, N., Wilke, T. (eds.) STACS. LIPIcs, vol. 20, pp. 20–31. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Wadern (2013)Google Scholar
- 6.Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, Grundlehren der mathematischen Wissenschaften, vol. 315. Springer, New York (1997)Google Scholar
- 7.Cygan, M., Dell, H., Lokshtanov, D., Marx, D., Nederlof, J., Okamoto, Y., Paturi, R., Saurabh, S., Wahlström, M.: On problems as hard as CNF-SAT. In: IEEE Conference on Computational Complexity, pp. 74–84 (2012)Google Scholar
- 8.DeMillo, R.A., Lipton, R.J.: A probabilistic remark on algebraic program testing. Inf. Process. Lett.
**7**, 193–195 (1978)CrossRefMATHGoogle Scholar - 9.Dondi, R., Fertin, G., Vialette, S.: Maximum motif problem in vertex-colored graphs. In: Proceedings of the CPM’09. LNCS, vol. 5577, pp. 221–235 (2009)Google Scholar
- 10.Dondi, R., Fertin, G., Vialette, S.: Finding approximate and constrained motifs in graphs. In: Proceedings of the CPM’11. LNCS, vol. 6661, pp. 388–401 (2011)Google Scholar
- 11.Fellows, M.R., Fertin, G., Hermelin, D., Vialette, S.: Sharp tractability borderlines for finding connected motifs in vertex-colored graphs. In: Proceedings of the ICALP’07. LNCS, vol. 4596, pp. 340–351 (2007)Google Scholar
- 12.Fellows, M.R., Fertin, G., Hermelin, D., Vialette, S.: Upper and lower bounds for finding connected motifs in vertex-colored graphs. J. Comput. Syst. Sci.
**77**(4), 799–811 (2011)CrossRefMathSciNetMATHGoogle Scholar - 13.Guillemot, S., Sikora, F.: Finding and counting vertex-colored subtrees. In: Proceedings of the MFCS’10. LNCS, vol. 6281, pp. 405–416 (2010)Google Scholar
- 14.Koutis, I.: Faster algebraic algorithms for path and packing problems. In: Proceedings of the ICALP’08. LNCS, vol. 5125, pp. 575–586 (2008)Google Scholar
- 15.Koutis, I.: The power of group algebras for constrained multilinear monomial detection. In: Dagstuhl meeting 10441 (2010)Google Scholar
- 16.Koutis, I.: Constrained multilinear detection for faster functional motif discovery. Inf. Process. Lett.
**112**(22), 889–892 (2012)CrossRefMathSciNetMATHGoogle Scholar - 17.Koutis, I., Williams, R.: Limits and applications of group algebras for parameterized problems. In: ICALP (1). LNCS, vol. 5555, pp. 653–664 (2009)Google Scholar
- 18.Lacroix, V., Fernandes, C.G., Sagot, M.F.: Motif search in graphs: application to metabolic networks. IEEE/ACM Trans. Comput. Biol. Bioinform.
**3**(4), 360–368 (2006)CrossRefGoogle Scholar - 19.Nederlof, J.: Fast polynomial-space algorithms using Möbius inversion: improving on Steiner tree and related problems. In: Proceedings of the ICALP’09. LNCS, vol. 5555, pp. 713–725 (2009)Google Scholar
- 20.Schwartz, J.T.: Fast probabilistic algorithms for verification of polynomial identities. J. ACM
**27**(4), 701–717 (1980)CrossRefMATHGoogle Scholar - 21.Williams, R.: Finding paths of length \(k\) in \(O^*(2^k)\) time. Inf. Process. Lett.
**109**(6), 315–318 (2009)CrossRefMATHGoogle Scholar - 22.Zippel, R.: Probabilistic algorithms for sparse polynomials. In: Proceedings of the International Symposium on Symbolic and Algebraic Computation. LNCS, vol. 72, pp. 216–226 (1979)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.