Background

Introduction

Of the many challenging problems related to understanding the biological function of DNA, RNA, proteins, and metabolic and signalling pathways, one of the most important is comparing the structure of different molecules. The hypothesis is that structure determines function and therefore it should follow that molecules with similar structure should have similar function. Evaluating the similarity of structures can be reduced to a comparison of a set of abstracted graphs if the biological structures can be abstracted as graphs.

Using bioinformatic techniques, biological structure matching can be formulated as a problem of finding the maximum common subgraph. The solution to this problem has important practical applications in many areas of bioinformatics as well as in other areas, such as pattern recognition and image processing [13]. For example, protein threading, an effective method to predict protein tertiary structure [48], and RNA structural homology searching, a method for annotating and identifying new non-coding RNAs [912], both align a target structure against structure templates in a template database.

Song et al [13] makes the following definitions and proposes the following graphical models for RNA structural homology searching: A structural unit in a biopolymer sequence is a stretch of contiguous residues (nucleotides or amino acids). A non-structural stretch between two consecutive structural units is called a loop. A structure of the sequence is characterized by interactions among structural units. For example, structural units in a tertiary protein are α helices and β strands, called cores. Given a biopolymer sequence, a structure graph H = (V, E, A) can be defined such that each vertex in V(H) represents a structural unit, each edge in E(H) represents the interaction between two structural units, and each arc in A(H) represents the loop ended by two structural units. Similarly, the target sequence can also be represented as a mixed graph G, called a sequence graph. Based on the graphical representations, the structure-sequence alignment problem can be formulated as the problem of finding in the sequence graph G a subgraph isomorphic to the structure graph H such that the objective function optimizes the alignment score.

Problem Definition

Throughout this paper, we will use the basic definitions and terminology from [1]: All graphs are simple, undirected graphs. Two graphs are isomorphic if there is a one-to-one correspondence between their vertices and there is an edge between two vertices in one graph if and only if there is an edge between the two corresponding vertices in the other graph. If edge (u, v) is an edge connecting u and v, then an induced subgraph G' of a graph G = (V, E) consists of a vertex subset V' ⊆ V and for all edges (u, v) ∈ E where u, v ∈ V'. A graph G12 is a common induced subgraph of two given graphs G1 and G2 if G12 is isomorphic to one induced subgraph G'1 of G1 as well as one induced subgraph G'2 of G2. A maximum common induced subgraph (MCIS) of two given graphs G1 and G2 is the common induced subgraph G12 with the maximum number of vertices. Similarly, the maximum common edge subgraph (MCES) is a subgraph with the maximum number of edges common to the two given graphs. The MCIS (or MCES) between two graphs can be further divided into a connected case and a disconnected case. All the different cases of the problem are useful within different biological contexts.

Figure 1 gives an illustration of MCIS of two graphs. In this figure, the maximum common induced subgraph of G1 and G2 contains four vertices (2, 3, 4 and 5) and the maximum common edge subgraph of them involves five vertices (1 through 5).

Figure 1
figure 1

MCIS of two graphs. For G1 and G2, the maximum common induced subgraph of them contains four vertices, and the maximum common edge subgraph of them involves five vertices.

MCES can be transformed into a formulation of MCIS. Interested readers are referred to [1] for details of the transformation. Here we focus on the maximum common induced subgraph (MCIS) problem. For convenience, we call it the maximum common subgraph problem.

The maximum common subgraph problem is NP-complete [14] and therefore polynomial-time algorithms for it do not exist unless P = NP. In fact, the maximum common subgraph problem is APX-hard [15] which means that it has no constant ratio approximation algorithms. This problem is a famous combinatorial intractable problem. Approaches for the maximum common subgraph problem and different variants of this problem are intensively studied in the literature [1].

In this paper, we derive a strong lower bound result for the maximum common subgraph problem in the light of the current research progress in the research area of parameterized computation. We then design the approaches for addressing this problem.

Methods

Parameterized Computation and Recent Progress on Parameterized Intractability

Many problems with important real-world applications in life science are NP-hard within the context of the theory of NP-completeness. This excludes the possibility of solving them in polynomial time unless P = NP. For example, the problems of cleaning up data, aligning multiple sequences, finding the closest string, and identifying the maximum common substructure are all famous NP-hard problems in bioinformatics [1618, 1]. A number of approaches have been proposed in dealing with these NP-hard problems. For example, the highly-acclaimed approximation approach [19] tries to come up with a "good enough" solution in polynomial time instead of an optimal solution for an NP-hard optimization problem [2023].

The theory of parameterized computation [17] is a newly developed approach introduced to address NP-hard problems with small parameters. It tries to give exact algorithms for an NP-hard problem when its natural parameter is small (even if the problem size is big). A parameterized problem Q is a decision problem consisting of instances of the form (x, k), where x is the problem description and the integer k = 0 is called the parameter. The parameterized problem Q is fixed-parameter tractable [17] if it can be solved in time f(k)|x|O(1), where f is a recursive function. The class FPT contains all the problems that are fixed-parameter tractable. In this paper, we assume that complexity functions are "nice" with both the domain and range being non-negative integers and the values of the functions and their inverses are easily computed. For two functions f and g, we write f(n) = o(g(n)) if there is a nondecreasing and unbounded function λ such that f(n) = g(n)/λ(n). A function f is subexponential if f(n) = 2O(n).

For a problem in the class FPT, research is focused on identifying more efficient, parameterized algorithms. There are many effective techniques to design parameterized algorithm including the methods of "bounded search tree" and "reduction to a problem kernel". Another example is the vertex cover problem.

Definition

Vertex cover problem: given a graph G and an integer k, determine if G has a vertex cover C of k vertices, i.e., a subset C of k vertices in G such that every edge in G has at least one endpoint in C. Here, the parameter is k.

Given a graph of n vertices, there is a parameterized algorithm that can solve the vertex cover problem in time O(kn + 1.286k) [24].

Accompanying the work on designing efficient and practical parameterized algorithms, a theory of parameter intractability has previously been developed [17]. In parameterized complexity, to classify fixed-parameter intractable problems, a hierarchy of classes (the W-hierarchy ∪t = 0 W [t], where W [t] ⊆ W [t+1] for all t = 0) have been introduced in which the 0-th level W [0] is the class FPT. The hardness and completeness have been defined for each level W [i] of the W-hierarchy for i = 1, and a large number of W [i]-hard parameterized problems have been identified [17]. For example, the clique problem is W[1]-hard.

Definition

Clique problem: given a graph G and an integer k, determine if G has a clique C of k vertices, i.e., a subset C of k vertices in G such that there is an edge in G between any two of these k vertices, i.e., the k vertices induce a complete subgraph of G. Here the parameter is k.

The clique problem can be solved in time O(nk), based on the enumeration of all the vertex subsets of size k for a given graph with n vertices.

It has become commonly accepted that no W[1]-hard (and W [i]-hard, i > 1) problem can be solved in time f(k)nO(1) for any function f (i.e., W[1] ? FPT). W[1]-hardness has served as the hypothesis for fixed-parameter intractability. An example is a recent result by Papadimitriou and Yannakakis [25], showing that the database query evaluation problem is W[1]-hard. This provides strong evidence that the problem cannot be solved by an algorithm whose running time is of the form f(k)nO(1), thus excluding the possibility of a practical algorithm for the problem even if the parameter k (the size of the query) is small as in most practical cases.

Based on the W[1]-hardness of the clique algorithm, computational intractability of problems in bioinformatics has been derived [2631], the author point out that "Unless an unlikely collapse in the parameterized hierarchy occurs, the results proved in [31] that the problems longest common subsequence and shortest common supersequence are W[1]-hard rule out the existence of exact algorithms with running time f(k)nO(1) (i.e., exponential only in k) for those problems. This does not mean that there are no algorithms with much better asymptotic time-complexity than the known O(nk) algorithms based on dynamic programming, e.g., algorithms with running time nvk are not deemed impossible by our results."

Recent investigation has derived stronger computational lower bounds for well-known NP-hard parameterized problems [32, 33]. For example, for the clique problem – which asks if a given graph of n vertices has a clique of size k – it is proved that unless an unlikely collapse occurs in parameterized complexity theory, the problem is not solvable in time f(k)no(k) for any function f. Note that this lower bound is asymptotically tight in the sense that the trivial algorithm that enumerates all subsets of k vertices in a given graph to test the existence of a clique of size k runs in time O(nk).

Based on the hardness of the clique problem, lower bound results for a number of bioinformatics problems have been derived [34]. For example, our results for the problem's longest common subsequence and shortest common supersequence have strengthened the results in [31] significantly and advanced the understanding on the complexity of the problems. We show that it is actually unlikely that the problems can be solved in time nγ(k) for any sublinear function γ(k) and the known dynamic programming algorithms of running time O(nk) for the problems are actually asymptotically optimal.

In the following section, we derive the lower bound for exact algorithms of the maximum common subgraph problem.

Lower Bound for Maximum Common Subgraph Problem

The formal parameterized version of the maximum common subgraph problem is described above; we choose the number of vertices in the common subgraph as the parameter. Based on the reduction from the parameterized clique problem to the parameterized common subgraph problem, we derive the hardness result of the parameterized common subgraph problem.

An NP optimization problem Q is a four-tuple (IQ, SQ, fQ, optQ) [19], where:

  1. 1.

    IQ is the set of input instances. It is recognizable in polynomial time;

  2. 2.

    For each instance x ∈ IQ, SQ(x) is the set of feasible solutions for x, which is defined by a polynomial p and a polynomial time computable predicate π (p and π only depend on Q); SQ(x) = {y: |y| = p(|x|) and π(x, y)};

  3. 3.

    fQ(x, y) is the objective function mapping a pair x ∈ IQ and y ∈ SQ(x) to a non-negative integer; the function fQ is computable in polynomial time;

  4. 4.

    optQ∈ {max, min}. Q is called a maximization problem if optQ = max and a minimization problem if optQ = min.

An NP optimization problem Q can be parameterized in a natural way as follows [35, 32]:

Definition

Let Q = (IQ, SQ, fQ, optQ) be an NP optimization problem. The parameterized version of Q is defined as:

  1. 1.

    If Q is a maximization problem, then the parameterized version of Q is defined as Q = {(x, k) | x ∈ IQ ^ optQ(x) = k };

  2. 2.

    If Q is a minimization problem, then the parameterized version of Q is defined as Q = {(x, k) | x ∈ IQ ^ optQ(x) = k}.

We now provide the definitions of the maximum common subgraph problem and the parameterized common subgraph problem.

Definition

Maximum common subgraph problem:

Input: two graphs G1 = (V1, E2) and G2= (V2, E2).

Output: the maximum common vertex-induced subgraph of the two graphs G1 and G2.

Definition

Parameterized common subgraph problem:

Input: two graphs G1 = (V1, E2) and G2= (V2, E2), and a positive integer k;

Parameter: k;

Output: "Yes", if there is a common vertex-induced subgraph of k vertices, i.e., a common subgraph of size k of the two graphs G1 and G2. Otherwise, output "No".

Lemma 1

The parameterized common subgraph problem is W[1]-hard.

Proof: We will give an FPT-reduction from clique to the parameterized common subgraph problem as follows.

Given an instance (G, k) of the clique problem, where the graph G has n vertices and k is a positive integer, we construct an instance of the parameterized common subgraph problem as follows: let G1 be the graph G, and G2 a complete graph of k vertices. The problem can therefore be stated as "Is a common vertex-induced subgraph of k vertices for the graphs G1 and G2?"

We can verify that the graph G has a clique of size k if and only if the graphs G1 and G2 have a common subgraph of k vertices. Since the reduction may be finished in polynomial time O(nk), the reduction is an FPT-reduction from clique to parameterized common subgraph problem.

To prove our main result, we will use the definition of linear FPT-reduction and W1[1]-hard [36]:

Definition

A parameterized problem Q is linear FPT-reducible, or more precisely, FPT l -reducible, to a parameterized problem Q' if there exist a function f and an algorithm A of running time f(k)nO(1) that, on each (k, n)-instance x of Q, produces a (k', n')-instance x' of Q', where k' = O(k), n' = nO(1), and x is a yes-instance of Q if and only if x' is a yes-instance of Q'.

Linear FPT-reduction has the transitivity property [36, 34]. The transitivity of the FPTl-reduction is proved in the following lemma:

Lemma 2

Let Q1, Q2 and Q3 be three parameterized problems. If Q1 is FPTl-reducible to Q2, and Q2 is FPTl-reducible to Q3, then Q1 is FPTl-reducible to Q3.

Proof: If Q1 is FPTl-reducible to Q2, then there exists a function f1 and an algorithm A1 of running time f1(k1)n1o(k1)m1O(1), such that for each (k1, n1, m1)-instance x1 of Q1, the algorithm A1 produces a (k2, n2, m2)-instance x2 of Q2, where n2 = n1O(1), m2 = m1O(1), and k2 = c1k1, where c1 is a constant.

If Q2 is FPTl-reducible to Q3, then there exists a function f2 and an algorithm A2 of running time f2(k2)n2O(k2) m2O(1), such that on each (k2, n2, m2)-instance x2 of Q2, the algorithm A2 produces a (k3, n3, m3)-instance x3 of Q3, where k3 = O(k2), n3 = n2O(1), m3 = m2O(1).

We now have an algorithm A that reduces Q1 to Q3, as follows: For a given (k1, n1, m1)-instance x1 of Q1, A first calls the algorithm A1 on x1 to construct a (k2, n2, m2)-instance x2 of Q2, where k2 = c1k1, n2 = n1O(1), and m2 = m1O(1). Then A calls the algorithm A2 on x2 to construct a (k3, n3, m3)-instance x3 of Q3. It is therefore obvious that x3 is a yes-instance of Q3 if and only if x1 is a yes-instance of Q1. Moreover, from k2 = c1k1 and k3 = O(k2), we have k3 = O(k1), and from n2 = n1O(1), m2 = m1O(1), n3 = n2O(1), m3 = m2O(1), we get n3 = n1O(1) and m3 = m1O(1). Finally, since the invocation of algorithm A1 on x1 takes time f1(k1)n1o(k1) m1O(1), the invocation of algorithm A2 on x2 takes time f2(k2)n2O(k2) m2O(1), k2 = c1k1, n2 = n1O(1), and m2 = m1O(1), we conclude that the running time of algorithm A is bounded by f1(k1)n1O(k1) m1O(1), where f(k1) = f1(k1) + f2(c1k1). By definition, A is an FPTl-reduction from Q1 to Q3; i.e., Q1 is FPTl-reducible to Q3.

Definition

A parameterized problem Q is W[1]-hard under the FPTl-reduction, or more precisely Wl[1]-hard, if the Weighted antimonotone CNF 2SAT (abbreviated wcnf -2sat-) problem is FPTl-reducible to Q.

In particular, it has been shown [32, 33] that the clique problem is Wl[1]-hard.

Lemma 3

(From theorem 5.2 of [33]) Unless all SNP problems are solvable in subexponential time, no Wl[1]-hard problem can be solved in time f(k)nO(k) for any recursive function f.

Note Papadimitriou and Yannakakis [30] have introduced the class SNP which contains many well-known NP-hard problems. Some of these problems have been the major targets in the study of exact algorithms, but have so far resisted all efforts for the development of subexponential time algorithms to solve them. Thus, it has been commonly agreed that it is unlikely that all SNP problems are solvable in subexponential time. A recent result showed the equivalence between the statement that "all SNP problems are solvable in subexponential time" and the collapse of a parameterized class called Mini[1, 37] to FPT, which is also considered as an unlikely collapse in parameterized computation.

Lemma 4

The parameterized common subgraph problem is Wl[1]-hard.

Proof: Referring to the proof of Lemma 1, the reduction from a clique to a parameterized common subgragh problem is a linear FPT-reduction.

Based on the transitivity property of the linear FPT-reduction of Lemma 2, and the fact that the clique problem is Wl[1]-hard, the parameterized common subgraph problem could not be solved in time f(k)nO(k), where k is the number of vertices in the common subgraph and f is any recursive function, unless some unlikely collapse (Mini[1] = FPT) occurs in parameterized computation.

From Lemma 4 and Proposition 3, we have the following theorem:

Theorem

Given two graphs G1 and G2 with each graph having n vertices, there is no algorithm of time f(k)nO(k) for the parameterized common subgraph problem, where k is the number of vertices in the common subgraph and f is any recursive function, unless some unlikely collapse (Mini[1] = FPT) occurs in parameterized computation.

In consideration of the upper-bound result, we now show that our lower-bound result for the maximum common subgraph problem presented here is asymptotically tight.

Upper Bound – Clique Based Approaches

The following approach for the maximum common subgraph problem is based on the reduction [15, 1] from a maximum common subgraph problem to the maximum clique problem.

From two graphs G1= (V1, E1) and G2= (V2, E2), a new graph G= (V, E) is derived as follows: Let V = V1 × V2 and call V a set of pairs. Call two pairs <u1, u2> and <v1, v2> compatible if u1 ≠ v1 and u2 ≠ v2 and if they preserve the edge relation, that is, there is an edge between u1 and v1 if and only if there is an edge between u2 and v2. Let E be the set of compatible edges. A k-clique in the new graph G can be interpreted as a matching between two induced k-node subgraphs. The two subgraphs are isomorphic since the compatible pairs preserve the edge relations. The new graph G is called the modular product graph of the two graphs G1 and G2.

We suppose n = |V1| = |V2| (The analysis for the case when |V1| ? |V2|, is similar, and thus is omitted). From the construction of G, we have |V| = n2. By a close observation of the new graph G, we can see that G is indeed an n-partite graph, where the vertices are partitioned into n disjoint partitions with each partition having n vertices.

We may use a matrix to denote the n2 vertices of the n-partite graph with n vertices in each partition.

v{1,1}, v{1,2}, ..., v{1,n}

v{2,1}, v{2,2}, ..., v{2,n}

... ...

v{n,1}, v{n,1}, ..., v{n,n}

The n vertices of the first row v{1,i}, 1 = i = n, belong to partition one of the n-partite graph. The n vertices of the second row v{2,i}, 1 = i = n, belong to partition two and so on.

There is no edge between any two vertices within the same partition. Edges only appear between two vertices that are in two different partitions. So, at most one vertex from each partition (of the n vertices) could be in a clique of the graph. Therefore, to find a clique of size k, there will be nk possible ways for choosing the clique vertices. For each possible way, the algorithm needs O(k2) time to check if it constructs a clique of size k. Therefore, this gives an algorithm of time O(nkk2) for the maximum common subgraph problem. We call this algorithm ALG-COMMON SUBGRAPH for the convenience of the following discussion.

This problem – when the maximum clique size k is equal to n – has been studied by Sze et al [38]:

Definition

Given an n-partite graph G with n vertices in each part, the n-CLIQUEnp problem finds an n-clique in the graph G.

For this problem, they developed a fast and exact divide-and-conquer approach. The basic idea of this novel approach is to subdivide the given n-partite graph into several n0-partite subgraphs with n0 < n and solve each smaller subproblem independently using a branch-and-bound approach as long as the number of cliques of size n0 in each subproblem is not too high. The reader is referred to [38] for the details of this divide-and-conquer approach. However, their approach in the worst case still has the same upper bound.

Given this O(nk k2)-time algorithm for the maximum common subgraph problem, the lower bound result of our Theorem is asymptotically tight.

When the number of vertices in the common subgraph k is not very far away from the value of n, we define k = n – c, where c is a constant. We illustrate the basic idea for c = 1 as follows [39]: Suppose the n-partite graph G has a clique C of size k-1. We add one more vertex to each of the n partitions. And we also add edges from this vertex to any vertices (except the newly added vertices) that are not in the same partition. Now we get a new graph G'. G' is an n-partite graph with n + 1 vertices in each partition. The new graph G' has a clique C' of size n if and only if the original n-partite graph G has a clique of size (n-1). The vertices of this clique C' include the vertices of the original clique C and one newly added vertex.

For the newly constructed graph G', we can now apply the algorithm ALG-COMMON SUBGRAPH without any change. And we need time O((n+1)n n2). After we find the clique C', we just remove the newly added vertex and return the other vertices of C'.

Similarly, if the n-partite graph G has a clique of size k – c, where c is a positive integer constant, we can find the clique by adding c new vertices and associated edges as described above and then applying the algorithm ALG-COMMON SUBGRAPH which runs in time O((n+c)n n2).

This simple idea of dealing with cliques of a size less than n is useful since it makes the algorithm ALG-COMMON SUBGRAPH work uniformly for finding cliques of different sizes on n-partite graphs. In the following, we give the following algorithm for finding cliques of size k – c.

Algorithm for (K-C)-CLIQUE

INPUT: an n-partite graph G, with n vertices in each partition, and a small constant c, where c is a positive integer;

OUTPUT: a clique of size no less than k – c;

Step 1: For i = 0 to c do

  • Step 1.1: Construct a new graph G1, by adding i new vertices to each partition of the graph G and adding edges from each of the new vertices to any vertices (except the newly added vertices) that are not in the same partition.

  • Step 1.2: Apply the algorithm ALG-COMMON SUBGRAPH on the graph G1.

  • Step 1.3: If a clique C1 is found, then return "a clique C of size k – i has been found" (C is constructed by removing all the newly-added vertices from the clique C1).

  • Endfor

Step 2: Return "no clique has been found".

We now propose two approaches for the maximum common subgraph problem which are based on the relationship between the vertex cover problem and the clique problem:

Algorithm 1: ALG-APPROX-CLIQUE

INPUT: an n-partite graph G, with n vertices in each partition, and a small constant c, where c is a positive integer;

OUTPUT: a clique for the graph G.

Step 1. Compute the complement graph G' of the modular product graph G = (V, E) of graph G1 and G2;

Step 2. Apply the approximation algorithm for the vertex cover problem to get a vertex cover C;

Step 3. Return V – C as the clique vertex set.

ALG-APPROX-CLIQUE gives an approximate solution for the maximum common subgraph problem in polynomial time. This approach uses the following approximation algorithm for the vertex cover problem with an approximation ratio 2 in [40]:

ALG-APPROX-VERTEX COVER

INPUT: a graph G = (V, E);

OUTPUT: a vertex cover C of approximation ratio 2 for the graph G.

Step 1. C ← Φ;

Step 2. E' ← E(G);

Step 3. While E' ≠ Φ

  • Step 3.1. Let (u, v) be an arbitrary edge of E';

  • Step 3.2. C = C ∪ {u, v};

  • Step 3.3. Remove from E' every edge incident on either u or v;

Step 4. Return C as the vertex cover set.

In this algorithm, ALG-APPROX-VERTEX COVER selects an edge from the set of edges of the graph G = (V, E) and adds it to C. Repeating this procedure for (u, v) ∈ E(G) and deleting edges from E' that are covered by u or v results in a running time of O(V+E).

Algorithm 2: ALG-EXACT-MAXCLIQUE

INPUT: an n-partite graph G, with n vertices in each partition, and a small constant c;

OUTPUT: a clique for the graph G.

Step 1. Compute the complement graph G' of the modular product graph G = (V, E) of graph G1 and G2;

Step 2. Apply the parameterized exact algorithm for the Vertex Cover problem on G' and compute the minimum vertex cover C0.

Step 3. Return the maximum clique with the vertex set V – C0.

Alternatively, ALG-EXACT-MAXCLIQUE could apply in Step 2 the current best algorithm for vertex cover [24] which is of time O(kn + 1.286k). By running the vertex cover algorithm for at most n times, we produce the minimum vertex cover of the product graph G.

Results

In this paper we investigated the lower-bound result for the maximum common subgraph problem. We proved that it is unlikely that there is an algorithm of time f(k)nO(k) for the problem, where k is the number of vertices in the common subgraph and f is any recursive function. We then presented the upper bound of algorithms which solve this problem: O(nkk2) time where k is the number of vertices in the common subgraph. In consideration of the upper-bound result, we point out that our lower-bound result for the maximum common subgraph problem is asymptotically tight.

Conclusion

Parameterized computation is a viable approach with great potential for investigating many applications within bioinformatics, such as the maximum common subgraph problem studied in this paper. With an improved hardness result and the proposed approaches in this paper, future research can be focused on further exploration of efficient approaches for different variants of this problem within the constraints imposed by real applications.