Introduction

Graph traversal edit distance (GTED) [1] is an elegant measure of the similarity between the strings represented by edge-labeled Eulerian graphs. For example, given two de Bruijn assembly graphs [2], computing GTED between them measures the similarity between two genomes without the computationally intensive and possibly error-prone process of assembling the genomes. Using an estimation of GTED between assembly graphs of Hepatitis B viruses, Ebrahimpour Boroojeny et al. [1] group the viruses into clusters consistent with their taxonomy. This can be extended to inferring phylogeny relationships in metagenomic communities or comparing heterogeneous disease samples such as cancer. There are several other methods to compute a similarity measure between strings encoded by two assembly graphs [3,4,5,6]. GTED has the advantage that it does not require prior knowledge on the type of the genome graph or the complete sequence of the input genomes. The input to the GTED problem is two unidirectional, edge-labeled Eulerian graphs, which are defined as:

Definition 1

(Unidirectional, edge-labeled Eulerian Graph). A unidirectional, edge-labeled Eulerian graph is a connected directed graph \(G = (V, E, \ell , \Sigma )\), with node set V, edge multi-set E, constant-size alphabet \(\Sigma\), and single-character edge labels \(\ell : E\rightarrow \Sigma\), such that G contains an Eulerian trail that traverses every edge \(e\in E\) exactly once. The unidirectional condition means that all edges between the same pair of nodes are in the same direction.

Such graphs arise in genome assembly problems (e.g. the de Bruijn subgraphs). Computing GTED is the problem of computing the minimum edit distance between the two most similar strings represented by Eulerian trails in each input graph. A trail in a graph is a walk that contains distinct edges and may contain repeated nodes.

Problem 1

(Graph Traversal Edit Distance (\(\textsc {GTED}\)) [1]). Given two unidirectional, edge-labeled Eulerian graphs \(G_1\) and \(G_2\), compute

$$\begin{aligned} \textsc {GTED} (G_1,G_2) \triangleq \min _{\begin{array}{c} t_1\in \text {trails}(G_1)\\ t_2\in \text {trails}(G_2) \end{array}} \text {edit}(\text {str}(t_1), \text {str}(t_2)). \end{aligned}$$
(1)

Here, \(\text {trails}(G)\) is the collection of all Eulerian trails in graph G, \(\text {str}(t)\) is a string constructed by concatenating labels on the Eulerian trail \(t=(e_0,e_1,\dots ,e_n)\), and \(\text {edit}(s_1, s_2)\) is the edit distance between strings \(s_1\) and \(s_2\).

Ebrahimpour Boroojeny et al. [1] claim that GTED is polynomially solvable by proposing an integer linear programming (ILP) formulation of GTED and arguing that the constraints of the ILP make it polynomially solvable. This result, however, conflicts with several complexity results on string-to-graph matching problems. Kupferman and Vardi [7] show that it is NP-complete to determine if a string exactly matches an Eulerian tour in an edge-labeled Eulerian graph. Additionally, Jain et al. [8] show that it is NP-complete to compute an edit distance between a string and strings represented by a labeled graph if edit operations are allowed on the graph. On the other hand, polynomial-time algorithms exist to solve string-to-string alignment [9] and string-to-graph alignment [8] when edit operations on graphs are not allowed.

We resolve the conflict among the results on complexity of graph comparisons by revisiting the complexity of and the proposed solutions to GTED. We prove that computing GTED is NP-complete by reducing from the hamiltonian path problem, reaching an agreement with other related results on complexity. Further, we point out with a counter-example that the optimal solution of the ILP formulation proposed by Ebrahimpour Boroojeny et al. [1] does not solve GTED.

We give two ILP formulations for GTED. The first ILP has an exponential number of constraints and can be solved by subtour elimination iteratively [10, 11]. The second ILP has a polynomial number of constraints and shares a similar high-level idea of the global ordering approach [11] in solving the traveling salesman problem [12].

In Qiu and Kingsford [13], Flow-GTED (FGTED), a variant of GTED is proposed to compare two sets of strings (instead of two strings) encoded by graphs. \(\textsc {FGTED}\) is equal to the edit distance between the most similar sets of strings spelled by the decomposition of flows between a pair of predetermined source and sink nodes. The similarity between the sets of strings reconstructed from the flow decomposition is measured by the Earth Mover’s Edit Distance [13, 14]. \(\textsc {FGTED}\) is used to compare pan-genomes, where both the frequency and content of strings are essential to represent the population of organisms. Qiu and Kingsford [13] reduce \(\textsc {FGTED}\) to \(\textsc {GTED}\), and via the claimed polynomial-time algorithm of \(\textsc {GTED}\), argued that \(\textsc {FGTED}\) is also polynomially solvable. We show that this claim is false by proving that \(\textsc {FGTED}\) is also NP-complete.

While the optimal solution to the ILP proposed in Ebrahimpour Boroojeny et al. [1] does not solve GTED, it does compute a lower bound to GTED that we call closed-trail cover traversal edit distance (\(\textsc {CCTED}\)). We characterize the cases when GTED is equal to \(\textsc {CCTED}\). In addition, we point out that solving this ILP formulation finds a minimum-cost matching between closed-trail decompositions in the input graphs, which may be used to compute the similarity between repeats in the genomes. Ebrahimpour Boroojeny et al. [1] claim their proposed ILP formulation is solvable in polynomial time by arguing that the constraint matrix of the linear relaxation of the ILP is always totally unimodular. We show that this claim is false by proving that the constraint matrix is not always totally unimodular and showing that there exists optimal fractional solutions to its linear relaxation. We prove that, in fact, \(\textsc {CCTED}\) is also NP-complete.

We evaluate the efficiency of solving ILP formulations for GTED and \(\textsc {CCTED}\) on simulated genomic strings and show that it is impractical to compute GTED or CCTED on larger genomes.

In summary, we revisit two important problems in genome graph comparisons: Graph Traversal Edit Distance (GTED) and its variant FGTED. We show that both GTED and FGTED are NP-complete, and provide the first correct ILP formulations for GTED. We also show that the ILP formulation proposed by Ebrahimpour Boroojeny et al. [1], i.e. \(\textsc {CCTED}\), is a lower bound to GTED and is also NP-complete. We evaluate the efficiency of the ILPs for GTED and \(\textsc {CCTED}\) on genomic sequences. These results provide solid algorithmic foundations for continued algorithmic innovation on the task of comparing genome graphs and point to the direction of heuristics that estimate GTED and CCTED efficiently.

GTED and FGTED are NP-complete

Conflicting results on computational complexity of GTED and string-to-graph matching

The natural decision versions of all of the computational problems described above and below are clearly in NP. Under the assumption that \(\text {P} \ne \text {NP}\), the results on the computational complexity of GTED and string-to-graph matching claimed in Boroojeny et al. [1] and Kupferman and Vardi [7], respectively, cannot be both true.

Kupferman and Vardi [7] show that the problem of determining if an input string can be spelled by concatenating edge labels in an Eulerian trail in an input graph is NP-complete. We call this problem eulerian trail equaling word. We show in Theorem 1 that we can reduce \(\textsc {ETEW}\) to \(\textsc {GTED}\), and therefore if \(\textsc {GTED}\) is polynomially solvable, then \(\textsc {ETEW}\) is polynomially solvable. The complete proof is in Appendix "Reduction from Hamiltonian Path to GTED".

Problem 2

(Eulerian Trail Equaling Word [7]). Given a string \(s\in \Sigma ^*\), an edge-labeled Eulerian graph G, find an Eulerian trail t of G such that \(\text {str}(t) = s\).

Theorem 1

If \(\textsc {GTED} \in \text {P}\) then \(\textsc {ETEW} \in \text {P}\).

Proof sketch

We first convert an input instance \(\langle s,G\rangle\) of \(\textsc {ETEW}\) into an input instance \(\langle G_1, G_2 \rangle\) to \(\textsc {GTED}\) by (a) creating graph \(G_1\) that only contains edges that reconstruct string s and (b) modifying G into \(G_2\) by extending the anti-parallel edges so that \(G_2\) is unidirectional. We show that if \(\textsc {GTED} (G_1, G_2) = 0\), there must be an Eulerian trail in G that spells s, and if \(\textsc {GTED} (G_1, G_2) > 0\), G must not contain an Eulerian trail that spells s. \(\square\)

Hence, an (assumed) polynomial-time algorithm for \(\textsc {GTED}\) solves \(\textsc {ETEW}\) in polynomial time. This contradicts Theorem 6 of Kupferman and Vardi [7] of the NP-completeness of \(\textsc {ETEW}\) (under \(\text {P} \ne \text {NP}\)).

Reduction from Hamiltonian Path to GTED and FGTED

We resolve the contradiction by showing that GTED is NP-complete. The details of the proof are in Appendix Reduction from Hamiltonian Path to GTED.

Theorem 2

\(\textsc {GTED}\) is NP-complete.

Proof sketch

We reduce from the hamiltonian path problem, which asks whether a directed, simple graph G contains a path that visits every vertex exactly once. Here simple means no self-loops or parallel edges. The reduction is almost identical to that presented in Kupferman and Vardi [7], and from here until noted later in the proof the argument is identical except for the technicalities introduced to force unidirectionality (and another minor change described later).

Let \(\langle G=(V,E)\rangle\) be an instance of hamiltonian path, with \(n=|V|\) vertices. We first create the Eulerian closure of G, which is defined as \(G'=(V', E')\) where

$$\begin{aligned} V' = \{v^{in}, v^{out} : v \in V\} \cup \{w\}. \end{aligned}$$
(2)

Here, each vertex in V is split into \(v^{in}\) and \(v^{out}\), and w is a newly added vertex. \(E'\) is the union of the following sets of edges and their labels:

  • \(E_1 = \{(v^{in}, v^{out}): v \in V\}\), labeled a,

  • \(E_2 = \{(u^{out}, v^{in}): (u,v) \in E\}\), labeled b,

  • \(E_3 = \{(v^{out}, v^{in}): v \in V\}\), labeled c,

  • \(E_4 = \{(v^{in}, u^{out}): (u,v) \in E\}\), labeled c,

  • \(E_5 = \{(u^{in},w): u \in V\}\), labeled c,

  • \(E_6 = \{(w,u^{in}): u \in V\}\), labeled b.

\(G'\) is an Eulerian graph by construction but contains anti-parallel edges. We further create \(G''\) from \(G'\) by adding dummy nodes so that each pair of antiparallel edges is split into two parallel, length-2 paths with labels x#, where x is the original label.

We also create a graph C that has the same number of edges as \(G''\) and spells out a string

$$\begin{aligned} q = \texttt {a\#}(\texttt {b\#a\#})^{n-1}(\texttt {c\#})^{2n-1}(\texttt {c\#b\#})^{|E| + 1}. \end{aligned}$$
(3)

We then argue that G has a Hamiltonian path if and only if \(G''\) spells out the string q, which uses the same line of arguments and graph traversals as in Kupferman and Vardi [7]. We then show that \(\textsc {GTED} (G'', C) = 0\) if and only if \(G''\) spells q. \(\square\)

Following a similar argument, we show that FGTED is also NP-complete, and its proof is in Appendix FGTED is NP-complete.

Theorem 3

\(\textsc {FGTED}\) is NP-complete.

Revisiting the correctness of the proposed ILP solutions to GTED

In this section, we revisit two proposed ILP solutions to GTED by Boroojeny et al. [1] and show that the optimal solution to these ILP is not always equal to GTED.

Alignment graph

The previously proposed ILP formulations for GTED are based on the alignment graph constructed from input graphs. The high-level concept of an alignment graph is similar to the dynamic programming matrix for the string-to-string alignment problem [9].

Fig. 1
figure 1

a An example of two edge labeled Eulerian graphs \(G_1\) (top) and \(G_2\) (bottom). b The alignment graph \(\mathcal {A} (G_1,G_2)\). The cycle with red edges is the path corresponding to \(\textsc {GTED} (G_1, G_2)\). Red solid edges are matches with cost 0 and red dashed-line edge is mismatch with cost 1

Definition 2

(Alignment graph). Let \(G_1,\,G_2\) be two unidirectional, edge-labeled Eulerian graphs. The alignment graph \(\mathcal {A} (G_1,G_2) = (V, E, \delta )\) is a directed graph that has vertex set \(V = V_1 \times V_2\) and edge multi-set E that equals the union of the following:

Vertical edges:

\([(u_1,u_2),(v_1,u_2)]\) for \((u_1,v_1) \in E_1\) and \(u_2 \in V_2\),

Horizontal edges:

\([(u_1,u_2),(u_1,v_2)]\) for \(u_1 \in V_1\) and \((u_2,v_2) \in E_2\),

Diagonal edges:

\([(u_1,u_2),(v_1,v_2)]\) for \((u_1,v_1) \in E_1\) and \((u_2,v_2) \in E_2\).

Each edge is associated with a cost by the cost function \(\delta :E\rightarrow \mathbb {R}\).

Each diagonal edge \(e=[(u_1, u_2),(v_1, v_2)]\) in an alignment graph can be projected to \((u_1, v_1)\) and \((u_2, v_2)\) in \(G_1\) and \(G_2\), respectively. Similarly, each vertical edge can be projected to one edge in \(G_1\), and each horizontal edge can be projected to one edge in \(G_2\).

An example of an alignment graph is shown in Fig. 1b. The horizontal edges correspond to gaps in strings represented by \(G_1\), vertical edges correspond to gaps in strings represented by \(G_2\), and diagonal edges correspond to the matching between edge labels from the two graphs. In the rest of this paper, we assume that the costs for horizontal and vertical edges are 1, and the costs for the diagonal edges are 1 if the diagonal edge represents a mismatch and 0 if it is a match. The cost function \(\delta\) can be defined to capture the cost of matching between edge labels or inserting gaps. This definition of alignment graph is also a generalization of the alignment graph used in string-to-graph alignment [8].

We define that the edge projection function, \(\pi _i\), projects an edge from the alignment graph to an edge in the input graph \(G_i\). If the alignment edge is a vertical or horizontal edge, it is projected to one edge in only one input graph. We also define that the path projection function, \(\Pi _i\), projects a trail in the alignment graph to a trail in the input graph \(G_i\). For example, let a trail in the alignment graph be \(p = (e_1, e_2,\dots ,e_m)\), and \(\Pi _i(p) = (\pi _i(e_1), \pi _i(e_2),\dots ,\pi _i(e_m))\) is a trail in \(G_i\).

The first previously proposed ILP for GTED

Lemma 1 in Ebrahimpour Boroojeny et al. [1] provides a model for computing GTED by finding the minimum-cost trail in the alignment graph. We reiterate it here for completeness.

Lemma 1

[1]. For any two edge-labeled Eulerian graphs \(G_1\) and \(G_2\),

$$\begin{aligned} \begin{aligned} \textsc {GTED} (G_1, G_2) = \text {minimize}_c \quad&\delta (c)\\ \text {subject to}\quad&c\text { is a trail in }\mathcal {A} (G_1, G_2),\\&\Pi _i(c)\text { is an Eulerian trail in } G_i \text { for } i=1,2, \end{aligned} \end{aligned}$$
(4)

where \(\delta (c)\) is the total edge cost of c, and \(\Pi _i(c)\) is the projection from c to \(G_i\).

An example of such a minimum-cost trail is shown in Fig. 1b for input graphs in Fig. 1a. Ebrahimpour Boroojeny et al. [1] provide the following ILP formulation and claim that it is a direct translation of Lemma 1:

$$\begin{aligned} \underset{x\in \mathbb {N}^{|E|}}{\text {minimize}}\quad&\sum _{e \in E} x_e\delta (e) \end{aligned}$$
(5)
$$\begin{aligned} \text {subject to}\quad&Ax = 0 \end{aligned}$$
(6)
$$\begin{aligned}{}&\sum _{e\in E} x_eI_i(e, f) = 1\quad \text {for }i=1,2\text { and for all }f\in E_i, \end{aligned}$$
(7)

where

$$\begin{aligned} A_{ue} = {\left\{ \begin{array}{ll} -1\quad \text {if }e=(u,v)\in E\text { for some vertex }v\in V\\ 1\quad \text {if }e=(v,u)\in E\text { for some }u\in V\\ 0 \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(8)

Here, E is the edge set of \(\mathcal {A} (G_1, G_2)\). A is the negative incidence matrix of \(\mathcal {A} (G_1, G_2)\) of size \(|V|\times |E|\). \(I_i(e,f)\) is an indicator function that is 1 if edge e in E projects to edge f in the input graph \(G_i\) (and 0 otherwise). We define the domain of each \(x_e\) to include all non-negative integers. However, due to constraints (7), the values of \(x_e\) are limited to either 0 or 1. We describe this ILP formulation with the assumption that both input graphs have closed Eulerian trails, which means that each node has equal numbers of incoming and outgoing edges. We show that any input graph that contains open Eulerian trails can be converted to a graph with closed Eulerian trails in section "New ILP solutions to GTED".

The edges where \(x_e = 1\) in optimal solutions to the ILP in (5)–(8) form a subgraph of the alignment graph. The ILP in (5)–(8) allows the solutions that form in disjoint cycles in the alignment graph. However, the projection of such disjoint cycles does not correspond to a single string represented by either of the input graphs. Therefore, when solutions to the ILP in (5)–(8) form disjoint cycles in the alignment graph, the optimal objective value of the ILP is not equal to GTED.

Fig. 2
figure 2

a The subgraph in the alignment graph induced by an optimal solution to the ILP in (5)–(8) and the ILP in (11)–(12) with input graphs on the left and top. The red and blue edges in the alignment graph are edges matching labels in red and blue font, respectively, and are part of the optimal solution to the ILP in (5)–(8). The cost of the red and blue edges are zero. b The subgraph induced by \(x^{init}\) with \(s_1 = u_1\) and \(s_2=v_1\) according to the ILP in (11)–(12). The rest of the edges in the alignment graph are omitted for simplicity

We give an example of disjoint cycles in the ILP formulated based on an input alignment graph. Construct two input graphs as shown in Fig. 2a. Specifically, \(G_1\) spells circular permutations of TTTGAA and \(G_2\) spells circular permutations of TTTAGA. It is clear that \(\textsc {GTED} (G_1, G_2) = 2\) under Levenshtein edit distance. On the other hand, as shown in Fig. 2a, an optimal solution in \(\mathcal {A} (G_1, G_2)\) contains two disjoint cycles with nonzero \(x_e\) values that have a total edge cost equal to 0. This solution is a feasible solution to the ILP in (5)–(8). It is also an optimal solution because the objective value is zero, which is the lower bound on the ILP in (5)–(8). This optimal objective value, however, is smaller than \(\textsc {GTED} (G_1, G_2)\). Therefore, the ILP in (5)–(8) does not solve GTED since it allows the solution to be a set of disjoint components.

The second previously proposed ILP formulation of GTED

We describe the second proposed ILP formulation of GTED by Ebrahimpour Boroojeny et al. [1].

Following Ebrahimpour Boroojeny et al. [1], we use simplices, a notion from geometry, to generalize the notion of an edge to higher dimensions. A k-simplex is a k-dimensional polytope which is the convex hull of its \(k+1\) vertices. For example, a 1-simplex is an undirected edge, and a 2-simplex is a triangle. We use the orientation of a simplex, which is given by the ordering of the vertex set of a simplex up to an even permutation, to generalize the notion of the edge direction [15, p. 26]. We use square brackets \([\cdot ]\) to denote an oriented simplex. For example, \([v_0, v_1]\) denotes a 1-simplex with orientation \(v_0\rightarrow v_1\), which is a directed edge from \(v_0\) to \(v_1\), and \([v_0, v_1, v_2]\) denotes a 2-simplex with orientation corresponding to the vertex ordering \(v_0\rightarrow v_1 \rightarrow v_2\rightarrow v_0\). Each k-simplex has two possible unique orientations, and we use the signed coefficient to connect their forms together, e.g. \([v_0, v_1] = -[v_1, v_0]\).

Fig. 3
figure 3

a A graph that contains an unoriented 2-simplex with three unoriented 1-simplices. b, c The same graph with two different ways of orienting the simplices and the corresponding boundary matrices

For each pair of graphs \(G_1\) and \(G_2\) and their alignment graph \(\mathcal {A} (G_1,G_2)\), we define an oriented 2-simplex set \(T(G_1, G_2)\) which is the union of:

  • \([(u_1,u_2),(v_1,u_2), (v_1,v_2)]\) for all \((u_1,v_1) \in E_1\) and \((u_2,v_2) \in E_2\), or

  • \([(u_1,u_2),(u_1,v_2),(v_1,v_2)]\) for all \((u_1,v_1) \in E_1\) and \((u_2,v_2) \in E_2\),

We use the boundary operator [15, p. 28], denoted by \(\partial\), to map an oriented k-simplex to a sum of oriented \((k-1)\)-simplices with signed coefficients.

$$\begin{aligned} \partial [v_0,v_1,\dots , v_k]=\sum _{i=0}^{p}(-1)^{i}[v_0,\dots , \hat{v_i},\dots , v_k], \end{aligned}$$
(9)

where \(\hat{v_i}\) denotes the vertex \(v_i\) is to be deleted. Intuitively, the boundary operator maps the oriented k-simplex to a sum of oriented \((k-1)\)-simplices such that their vertices are in the k-simplex and their orientations are consistent with the orientation of the k-simplex. For example, when \(k=2\), we have:

$$\begin{aligned} \partial [v_0, v_1, v_2] = [v_1, v_2] - [v_0, v_2] + [v_0, v_1] =[v_1, v_2] + [v_2, v_0] + [v_0, v_1]. \end{aligned}$$
(10)

We reiterate the second ILP formulation proposed in Ebrahimpour Boroojeny et al. [1]. Given an alignment graph \(\mathcal {A} (G_1, G_2)=(V, E, \delta )\) and the oriented 2-simplex set \(T(G_1,G_2)\),

$$\begin{aligned} \begin{aligned} \underset{x\in \mathbb {N}^{|E|},y\in \mathbb {Z}^{|T(G_1,G_2)|}}{\text {minimize}}\quad&\sum _{e \in E} x_e\delta (e)\\ \text {subject to}\quad&x = x^{init} + [\partial ] y \end{aligned} \end{aligned}$$
(11)

Entries in x and y correspond to 1-simplices and 2-simplices in E and \(T(G_1, G_2)\), respectively. \([\partial ]\) is a \(|E|\times |T(G_1, G_2)|\) boundary matrix where each entry \([\partial ]_{i,j}\) is the signed coefficient of the oriented 1-simplex (the directed edge) in E corresponding to \(x_i\) in the boundary of the oriented 2-simplex in \(T(G_1, G_2)\) corresponding to \(y_j\). The index ij for each 1-simplex or 2-simplex is assigned based on an arbitrary ordering of the 1-simplices in E or the 2-simplices in \(T(G_1,G_2)\). An example of the boundary matrix is shown in Fig. 3. \(\delta (e)\) is the cost of each edge. \(x^{init} \in \mathbb {R}^{|E|}\) is a vector where each entry corresponds to a 1-simplex in E with \(|E_1|+|E_2|\) nonzero entries that represent one Eulerian trail in each input graph. \(x^{init}\) is a feasible solution to the ILP. Let \(s_1\) be the source of the Eulerian trail in \(G_1\), and \(s_2\) be the sink of the Eulerian trail in \(G_2\). Each entry in \(x^{init}\) is defined by

$$\begin{aligned} x_e^{init} = {\left\{ \begin{array}{ll} 1\quad &{}\text {if}\;e=[(u_1, s_2),(v_1, s_2)]\,\text {or}\,e=[(s_1, u_2),(s_1, v_2)],\\ 0\quad &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$
(12)

If the Eulerian trail is closed in \(G_i\), \(s_i\) can be any vertex in \(V_i\). An example of \(x^{init}\) is shown in Fig. 2b.

We provide a complete proof in Appendix "Equivalence between two ILPs proposed by [1]" that the ILP in (5)–(8) is equivalent to the ILP in (11)–(12). Therefore, the example we provided in section "The first previously proposed ILP for GTED" is also an optimal solution to the ILP in (11)–(12) but not a solution to GTED. Thus, the ILP in (11)–(12) does not always solve GTED.

Fig. 4
figure 4

Modified alignment graphs based on input types. a \(G_1\) has open Eulerian trails while \(G_2\) has closed Eulerian trails. b Both \(G_1\) and \(G_2\) have closed Eulerian trails. c Both \(G_1\) and \(G_2\) have open Eulerian trails. Solid red and blue nodes are the source and sink nodes of the graphs with open Eulerian trails. “s” and “t” are the added source and sink nodes. Colored edges are added alignment edges directing from and to source and sink nodes, respectively

New ILP solutions to GTED

To ensure that our new ILP formulations are applicable to input graphs regardless of whether they contain an open or closed Eulerian trail, we add a source node s and a sink node t to the alignment graph. Figure 4 illustrates three possible cases of input graphs.

  1. 1.

    If only one of the input graphs has closed Eulerian trails, wlog, let \(G_1\) be the input graph with open Eulerian trails. Let \(a_1\) and \(b_1\) be the start and end of the Eulerian trail that have odd degrees. Add edges \([s, (a_1, v_2)]\) and \([(b_1, v_2), t]\) to E for all nodes \(v_2\in V_2\) (Fig. 4a). Let the labels on the newly added edges be the empty character \(\epsilon\).

  2. 2.

    If both input graphs have closed Eulerian trails, let \(a_1\) and \(a_2\) be two arbitrary nodes in \(G_1\) and \(G_2\), respectively. Add edges \([s, (a_1, v_2)]\), \([s, (v_1, a_2)]\), \([(a_1, v_2), t]\) and \([(v_1, a_2), t]\) for all nodes \(v_1\in V_1\) and \(v_2\in V_2\) to E (Fig. 4b).

  3. 3.

    If both input graphs have open Eulerian trails, add edges \([s, (a_1, a_2)]\) and \([t,(b_1, b_2)]\), where \(a_i\) and \(b_i\) are start and end nodes of the Eulerian trails in \(G_i\), respectively (Fig. 4c).

According to Lemma 1, we can solve \(\textsc {GTED} (G_1, G_2)\) by finding a trail in \(\mathcal {A} (G_1, G_2)\) that satisfies the projection requirements. This is equivalent to finding a s-t trail in \(\mathcal {A} (G_1, G_2)\) that satisfies constraints:

$$\begin{aligned} \sum _{(u,v)\in E} x_{uv} I_i((u,v), f) = 1\quad \text {for all }(u,v)\in E,f\in G_i,\,u\ne s,\,v\ne t, \end{aligned}$$
(13)

where \(I_i(e,f) = 1\) if the alignment edge e projects to f in \(G_i\), and \(x_{uv}\) is the ILP variable for edge \((u,v)\in E\). An optimal solution to GTED in the alignment graph must start and end with the source and sink node because they are connected to all possible starts and ends of Eulerian trails in the input graphs.

Since a trail in \(\mathcal {A} (G_1, G_2)\) is a flow network, we use the following flow constraints to enforce the equality between the number of in- and out-edges for each node in the alignment graph except the source and sink nodes.

$$\begin{aligned} \sum _{(s,u)\in E}x_{su}&= 1 \end{aligned}$$
(14)
$$\begin{aligned} \sum _{(v,t)\in E}x_{vt}&= 1\end{aligned}$$
(15)
$$\begin{aligned} \sum _{(u,v)\in E} x_{uv}&= \sum _{(v,w)\in E} x_{vw} \quad \text {for all }v\in V \end{aligned}$$
(16)

Constraints (13) and (16) are equivalent to constraints (7) and (6), respectively. Therefore, we rewrite the ILP in (5)–(8) in terms of the modified alignment graph.

$$\begin{aligned} \underset{x\in \mathbb {N}^{|E|}}{\text {minimize}}\quad & \sum _{e \in E} x_e\delta (e)\\ \text {subject to}\quad & \text {constraints}\,(13){-}(16). \end{aligned}$$
(lower bound ILP)

As we show in section "The first previously proposed ILP for GTED", constraints (13)–(16) do not guarantee that the ILP solution is one trail in \(\mathcal {A} (G_1, G_2)\), thus allowing several disjoint covering trails to be selected in the solution and fails to model GTED correctly. We show in section "Closed-trail cover traversal edit distance" that the solution to this ILP is a lower bound to GTED.

According to Lemma 1 in Dias et al. [11], a subgraph of a directed graph G with source node s and sink node t is a s-t trail if and only if it is a flow network and every strongly connected component (SCC) of the subgraph has at least one edge outgoing from it. Thus, in order to formulate an ILP for the GTED problem, it is necessary to devise constraints that prevent disjoint SCCs from being selected in the alignment graph. In the following, we describe two approaches for achieving this.

Enforcing one trail in the alignment graph via constraint generation

Section 3.2 of Dias et al. [11] proposes a method to design linear constraints for eliminating disjoint SCCs, which can be directly adapted to our problem. Let \(\mathcal {C}\) be the collection of all strongly connected subgraphs of the alignment graph \(\mathcal {A} (G_1,G_2)\). We use the following constraint to enforce that the selected edges form one s-t trail in the alignment graph:

$$\begin{aligned} \text {If }\sum _{(u,v)\in E(C)}x_{uv}=|E(C)|,\text { then }\sum _{(u,v)\in \varepsilon ^{+}(C)}x_{uv}\ge 1 \quad \text {for all }C\in \mathcal {C}, \end{aligned}$$
(17)

where E(C) is the set of edges in the strongly connected subgraph C and \(\varepsilon ^{+}(C)\) is the set of edges (uv) such that u belongs to C and v does not belong to C. \(\sum _{(u,v)\in E(C)}x_{uv}=|E(C)|\) indicates that C is in the subgraph of \(\mathcal {A} (G_1,G_2)\) constructed by all edges (uv) with positive \(x_{uv}\), and \(\sum _{(u,v)\in \varepsilon ^{+}(C)}x_{uv}\ge 1\) guarantees that there exists an out-going edge of C that is in the subgraph.

We use the same technique as Dias et al. [11] to linearize the “if-then” condition in (17) by introducing a new variable \(\beta\) for each strongly connected component:

$$\begin{aligned}{}&\sum _{(u,v)\in E(C)}x_{uv}\ge |E(C)|\beta _{C}\quad \text {for all }C\in \mathcal {C} \end{aligned}$$
(18)
$$\begin{aligned}{}&\sum _{(u,v)\in E(C)}x_{uv}-|E(C)|+1-|E(C)|\beta _{C}\le 0\quad \text {for all }C\in \mathcal {C} \end{aligned}$$
(19)
$$\begin{aligned}{}&\sum _{(u,v)\in \varepsilon ^{+}(C)}x_{uv}\ge \beta _{C}\quad \text {for all }C\in \mathcal {C}\end{aligned}$$
(20)
$$\begin{aligned}{}&\beta _{C}\in \{0,1\}\quad \text {for all }C\in \mathcal {C} \end{aligned}$$
(21)

To summarize, given any pair of unidirectional, edge-labeled Eulerian graphs \(G_1\) and \(G_2\) and their alignment graph \(\mathcal {A} (G_1,G_2)=(V,E,\delta )\), \(\textsc {GTED} (G_1, G_2)\) is equal to the optimal solution of the following ILP formulation:

$$\begin{aligned} \begin{aligned} \underset{x\in \{0,1\}^{|E|}}{\text {minimize}}\quad&\sum _{e \in E} x_e\delta (e)\\ \text {subject to}\quad&\text {constraints}\,(13){-}(16) \text { and } \\&\text {constraints}\,(18){-}(21). \end{aligned} \end{aligned}$$
(exponential ILP)

This ILP has an exponential number of constraints as there is a set of constraints for every strongly connected subgraph in the alignment graph. To solve this ILP more efficiently, we can use the procedure similar to the iterative constraint generation procedure in Dias et al. [11] (Algorithm 1). Initially, solve the ILP with only constraints (13)–(16). Create a subgraph, \(G'\), induced by edges with positive \(x_{uv}\). For each disjoint SCC in \(G'\) that does not contain the sink node, add constraints (18)–(21) for edges in the SCC and solve the new ILP. Iterate until no disjoint SCCs are found in the solution.

Algorithm 1
figure a

Iterative constraint generation algorithm to solve (exponential ILP)

A compact ILP for GTED with polynomial number of constraints

In the worst cases, the number of iterations to solve (exponential ILP) via constraint generation is exponential. As an alternative, we introduce a compact ILP with only a polynomial number of constraints. The intuition behind this ILP is that we can impose a partially increasing ordering on all the edges so that the selected edges forms a s-t trail in the alignment graph. This idea is similar to the Miller-Tucker-Zemlin ILP formulation of the travelling salesman problem (TSP) [12].

We add variables \(d_{uv}\) that are constrained to provide a partial ordering of the edges in the s-t trail and set the variables \(d_{uv}\) to zero for edges that are not selected in the s-t trail. Intuitively, there must exist an ordering of edges in a s-t trail such that for each pair of consecutive edges (uv) and (vw), the difference in their order variable \(d_{uv}\) and \(d_{vw}\) is 1. Therefore, for each node v that is not the source or the sink, if we sum up the order variables for the incoming edges and outgoing edges respectively, the difference between the two sums is equal to the number of selected incoming/outgoing edges. Lastly, the order variable for the edge starting at source is 1, and the order variable for the edge ending at sink is the number of selected edges. This gives the ordering constraints as follows:

$$\begin{aligned} \text {If } x_{uv} = 0,\text { then }d_{uv}&= 0 \quad \text {for all }(u,v)\in E \end{aligned}$$
(22)
$$\begin{aligned} \sum _{(v,w)\in E}d_{vw} - \sum _{(u,v)\in E} d_{uv}&= \sum _{(v,w)\in E} x_{vw}\quad \text {for all }v\in V\setminus \{s, t\} \end{aligned}$$
(23)
$$\begin{aligned} \sum _{(s,u)\in E}d_{su}&= 1\end{aligned}$$
(24)
$$\begin{aligned} \sum _{(v,t)\in E}d_{vt}&= \sum _{(u,v)\in E} x_{uv} \end{aligned}$$
(25)

We enforce that all variables \(x_e\in \{0,1\}\) and \(d_e\in \mathbb {N}\) for all \(e\in E\).

The “if-then” statement in Eq. (22) can be linearized by introducing an additional binary variable \(y_{uv}\) for each edge [11, 16]:

$$\begin{aligned} -x_{uv} - |E|y_{uv}&\le -1\end{aligned}$$
(26)
$$\begin{aligned} d_{uv} - |E|(1-y_{uv})&\le 0 \end{aligned}$$
(27)
$$\begin{aligned} y_{uv}&\in \{0,1\}. \end{aligned}$$
(28)

Here, \(y_{uv}\) is an indicator of whether \(x_{uv} \ge 0\). The coefficient |E| is the number of edges in the alignment graph and also an upper bound on the ordering variables. When \(y_{uv} = 1\), \(d_{uv}\le 0\), and \(y_{uv}\) does not impose constraints on \(x_{uv}\). When \(y_{uv} = 0\), \(x_{uv}\ge 1\), and \(y_{uv}\) does not impose constraints on \(d_{uv}\).

Correctness of (compact ILP) for GTED

To show that the optimal objective value of (compact ILP) is equal to GTED, we show that the optimal solutions to (compact ILP) always form one connected component.

Lemma 2

Let \(x_e\) and \(d_e\) be ILP variables. Let \(G'\) be a subgraph of \(\mathcal {A} (G_1, G_2)\) that is induced by edges with \(x_e = 1\). If \(x_e\) and \(d_e\) satisfy constraints (13)–(25) for all \(e\in E\), \(G'\) is connected with one trail from s to t that traverses each edge in \(G'\) exactly once.

Proof

We prove the lemma in 2 parts: (1) all nodes except s and t in \(G'\) have an equal number of in- and out-edges, (2) \(G'\) contains only one connected component.

The first statement holds because the edges of \(G'\) form a flow from s to t, and is enforced by constraints (16).

We then show that \(G'\) does not contain isolated subgraphs that are not reachable from s or t. Due to constraint (16), the only possible scenario is that the isolated subgraph is strongly connected. Suppose for contradiction that there is a strongly connected component, C, in \(G'\) that is not reachable from s or t.

The sum of the left hand side of constraint (23) over all vertices in C is

$$\begin{aligned} \sum _{v\in C} \left( \sum _{(u,v)\in C} d_{uv} - \sum _{(v,w)\in C} d_{vw}\right)&= \sum _{v\in C} \sum _{(u,v)\in C} d_{uv} - \sum _{v\in C} \sum _{(v,w)\in C} d_{vw} \end{aligned}$$
(29)
$$\begin{aligned}{}&= \sum _{(u,v)\in E(C)} d_{uv} - \sum _{(v,w)\in E(C)} d_{vw} = 0. \end{aligned}$$
(30)

However, the right-hand side of the same constraints is always positive. Hence we have a contradiction. Therefore, \(G'\) has only one connected component. \(\square\)

Due to Lemma 1 and Lemma 2, given input graphs \(G_1\) and \(G_2\) and the alignment graph \(\mathcal {A} (G_1, G_2)\), \(\textsc {GTED} (G_1, G_2)\) is equal to the optimal objective of

$$\begin{aligned} \begin{aligned} \underset{x\in \{0,1\}^{|E|}}{\text {minimize}}\quad&\sum _{e \in E} x_e\delta (e)\\ \text {subject to}\quad&\text {constraints}\,(13){-}(16), \\&\text {constraints}\,(23){-}(25)\\&\text {and constraints}\,(26){-}(28). \end{aligned} \end{aligned}$$
(compact ILP)

Closed-trail cover traversal edit distance

While the (lower bound ILP) and the ILP in (11)–(12) do not solve GTED, the optimal solution to these ILPs is a lower bound of GTED. These ILP formulations also solve an interesting variant of GTED, which is a local similarity measure between two genome graphs. We call this variant Closed-trail cover traversal edit distance (CCTED). In the following, we provide the formal definition of the CCTED problem and then show that the (lower bound ILP) is the correct ILP formulation for solving CCTED.

We first introduce the min-cost item matching problem between two multi-sets. Let two multi-sets of items be \(S_1\) and \(S_2\), and, wlog, let \(|S_1| \le |S_2|\). Let \(c: (S_1 \cup \{\epsilon \}) \times S_2 \rightarrow \mathbb {N}\) be the cost of matching either an empty item \(\epsilon\) or an item in \(S_1\) with an item in \(S_2\). Given \(S_1\), \(S_2\) and the cost function c, min-cost matching problem finds a matching, \(\mathcal {M}_c(S_1, S_2)\), such that each item in \(S_1\cup \{\epsilon \}^{|S_2|-|S_1|}\) is matched with exactly one distinct item in \(S_2\) and the total cost of the matching, \(\sum _{(s_1, s_2)\in \mathcal {M}_c(S_1, S_2)} c(s_1, s_2)\), is minimized.

The min-cost item matching problem is similar to the Earth Mover’s Distance defined in [17], except that only integral units of items can be matched and the cost of matching an empty item with another item is not constant. Similar to the Earth Mover’s Distance, the min-cost item matching problem can be computed using the ILP formulation of the min-cost max-flow problem [13, 14]. When the cost is the edit distance, the cost to match \(\epsilon\) with a string is equal to the length of the string.

Define traversal edit distance, \(edit_t(t_1, t_2)\) as the edit distance between the strings constructed from a pair of trails \(t_1\) and \(t_2\). In other words, \(edit_t(t_1, t_2) = edit(\text {str}(t_1), \text {str}(t_2))\). \(\textsc {CCTED}\) is defined as:

Problem 3

(Closed-trail cover traversal edit distance (\(\textsc {CCTED}\))). Given two unidirectional, edge-labeled Eulerian graphs \(G_1\) and \(G_2\) with closed Eulerian trails, compute

$$\begin{aligned} \textsc {CCTED} (G_1, G_2) \triangleq \min _{\begin{array}{c} C_1\in \text {CC}(G_1),\\ C_2\in \text {CC}(G_2) \end{array}} \sum _{\begin{array}{c} (t_1, t_2)\in \mathcal {M}_{\text {edit}_t}(C_1, C_2) \end{array}}\text {edit}(\text {str}(t_1), \text {str}(t_2)), \end{aligned}$$
(31)

Here, \(\text {CC}(G)\) denotes the collection of all possible sets of edge-disjoint, closed trails in G, such that every edge in G belongs to exactly one of these trails. Each element of \(\text {CC}(G)\) can be interpreted as a cover of G using such trails. \(\mathcal {M}_{\text {edit}_t}(C_1, C_2)\) is a min-cost matching between two covers using the traversal edit distance as the cost.

\(\textsc {CCTED}\) is likely a more suitable metric comparison between genomes that undergo large-scale rearrangements. This analogy is to the relationship between the synteny block comparison [3] and the string edit distance computation, where the former is more often used in interspecies comparisons and in detecting segmental duplications [18, 19] and the latter is more often seen in intraspecies comparisons.

Following similar ideas as Lemma 1, we can compute CCTED by finding a set of closed trails in the alignment graph such that the total cost of alignment edges is minimized, and the projection of all edges in the collection of selected trails is equal to the multi-set of input graph edges.

Lemma 3

For any two edge-labeled Eulerian graphs \(G_1\) and \(G_2\),

$$\begin{aligned} \textsc {CCTED} (G_1, G_2) = \underset{C}{\text {minimize}}\quad&\sum _{c\in C} \delta (c) \end{aligned}$$
(32)
$$\begin{aligned} \text {subject to}\quad&C \text { is a set of closed trails in }\mathcal {A} (G_1,G_2),\nonumber \\&\bigcup _{e\in C}\Pi _i(e) = E_i\quad \text {for }i=1,2, \end{aligned}$$
(33)

where C is a collection of trails and \(\delta (c)\)is the total cost of edges in trail c.

Proof

Given any pair of covers \(C_1\in \text {CC}(G_1)\) and \(C_2\in \text {CC}(G_2)\) and their min-cost matching based on the edit distance \(\mathcal {M}_{\text {edit}_t}(C_1, C_2)\), we can project each pair of matched closed trailed to a closed trail in the alignment graph. For a matching between a trail and the empty item \(\epsilon\), we can project it to a closed trail in the alignment graph with all vertical edges if the trail is from \(G_1\) or horizontal edges if the trail is from \(G_2\). The total cost of the projected edges must be greater than or equal to the objective (32). On the other hand, every collection of trails C that satisfy constraint (33) can be projected to a cover in each of the input graphs, and \(\sum _{c\in C} \delta (c) \ge CCTED(G_1, G_2)\). Hence equality holds. \(\square\)

The ILP formulation for CCTED

We show that the ILP in (5)–(8) proposed by Ebrahimpour Boroojeny et al. [1] solves CCTED.

Theorem 4

Given two input graphs \(G_1\) and \(G_2\), the optimal objective value of the ILP in (5)–(8) based on \(\mathcal {A} (G_1, G_2)\) is equal to \(\textsc {CCTED} (G_1, G_2)\).

Proof

As shown in the proof of Lemma 3, any pair of edge-disjoint, closed-trail covers in the input graph can be projected to a set of closed trails in \(\mathcal {A} (G_1, G_2)\), which satisfied constraints (6)–(8). The objective of this feasible solution, which is the total cost of the projected closed trails, equals \(\textsc {CCTED}\). Therefore, \(\textsc {CCTED} (G_1, G_2)\) is greater than or equal to the objective of the ILP in (5)–(8).

Conversely, we can transform any feasible solutions of the ILP in (5)–(8) to a pair of covers of \(G_1\) and \(G_2\). We can do this by transforming one closed trail at a time from the subgraph of the alignment graph, \(\mathcal {A} '\) induced by edges with ILP variable \(x_{uv} = 1\). Let c be a closed trail in \(\mathcal {A} '\). Let \(c_1 = \Pi _1(c)\) and \(c_2 = \Pi _2(c)\) be two closed trails in \(G_1\) and \(G_2\) that are projected from c. We can construct an alignment between \(\text {str}(c_1)\) and \(\text {str}(c_2)\) from c by adding match or insertion/deletion columns for each match or insertion/deletion edges in c accordingly. The cost of the alignment is equal to the total cost of edges in c by the construction of the alignment graph. We can then remove edges in c from the alignment graph and edges in \(c_1\) and \(c_2\) from the input graphs, respectively. The remaining edges in \(\mathcal {A} '\) and \(G_1\) and \(G_2\) still satisfy the constraints (6)–(8). Repeat this process and we get a total cost of \(\sum _{e\in E} x_{e}\delta (e)\) that aligns pairs of closed trails that form covers of \(G_1\) and \(G_2\). This total cost is greater than or equal to \(\textsc {CCTED} (G_1, G_2)\). \(\square\)

CCTED is a lower bound of GTED

Since the constraints for (lower bound ILP) are a subset of (exponential ILP), a feasible solution to (exponential ILP) is always a feasible solution to (lower bound ILP). Since two ILPs have the same objective function, \(\textsc {CCTED} (G_1, G_2) \le \textsc {GTED} (G_1, G_2)\) for any pair of graphs. Moreover, when the solution to (lower bound ILP) forms only one connected component, the optimal value of (lower bound ILP) is equal to GTED.

Theorem 5

Let \(\mathcal {A} '(G_1, G_2)\) be the subgraph of \(\mathcal {A} (G_1, G_2)\) induced by edges \((u,v)\in E\) with \(x^{opt}_{uv} = 1\) in the optimal solution to (lower bound ILP). There exists \(\mathcal {A} '(G_1, G_2)\) that has exactly one connected component if and only if \(c^{opt} = \textsc {GTED} (G_1, G_2)\).

Proof

We first show that if \(c^{opt}= \textsc {GTED} (G_1, G_2)\), then there exists \(\mathcal {A} '(G_1, G_2)\) that has one connected component. A feasible solution to (exponential ILP) is always a feasible solution to (lower bound ILP), and since \(c^{opt}= \textsc {GTED} (G_1, G_2)\), an optimal solution to (exponential ILP) is also an optimal solution to (lower bound ILP), which can induce a subgraph in the alignment graph that only contains one connected component.

Conversely, if \(x^{opt}\) induces a subgraph in the alignment graph with only one connected component, it satisfies constraints (18)–(21) and therefore is feasible to the ILP for GTED (exponential ILP). Since \(c^{opt} \le \textsc {GTED} (G_1, G_2)\), this solution must also be optimal for \(\textsc {GTED} (G_1, G_2)\). \(\square\)

In practice, we may estimate GTED approximately by the solution to (lower bound ILP). As we show in section Empirical evaluation of the ILP formulations for GTED and its lower bound, the time needed to solve (lower bound ILP) is much less than the time needed to solve GTED. However, in adversarial cases, \(c^{opt}\) could be zero but GTED could be arbitrarily large. We can determine if the \(c^{opt}\) is a lower bound on GTED or exactly equal to GTED by checking if the subgraph induced by the solution to (lower bound ILP) has multiple connected components.

NP-completeness of CCTED

We prove that the CCTED problem (Problem 3) is NP-complete by reducing from the Eulerian Trail Equaling Word problem [7].

Theorem 6

Computing CCTED is NP-complete.

Proof

Let Eulerian graph \(G=(V, E, \ell , \Sigma )\) and s be an instance of the eulerian tour equaling word problem. Construct two graphs, \(G_1\) and \(G_2\). If G contains open Eulerian trails, add an edge directing from the sink of the graph to the source of the graph. Let the label of the added edge be \(\#\) that does not appear in \(\Sigma\). Let the modified graph be \(G_1\). If G contains closed Eulerian trails, let \(G_1\) be the same as G. Let \(G_2\) be a graph that contains one cycle with \(|E_1|\) edges, where \(E_1\) is the edge set of \(G_1\). Assign labels to the edges in \(G_2\) such that the cycle in \(G_2\) spells s if G contains closed Eulerian trails, \(s\#\) otherwise.

If \(\textsc {CCTED} (G_1, G_2) = 0\), \(G_2\) must contain at least one closed Eulerian trail that spells some circular permutation of \(s\#\). If CCTED is not zero, it means that s must not match Eulerian trails in G. \(\square\)

Empirical evaluation of the ILP formulations for GTED and its lower bound

Implementation of the ILP formulations

We implement the algorithms and ILP formulations for  (exponential ILP),  (compact ILP) and  (lower bound ILP). In practice, the multi-set of edges of each input graph may contain many duplicates of edges that have the same start and end vertices due to repeats in the strings. We reduce the number of variables and constraints in the implemented ILPs by merging the edges that share the same start and end nodes and record the multiplicity of each edge. Each x variable is no longer binary but a non-negative integer that satisfies the modified projection constraints (13):

$$\begin{aligned} \sum _{(u,v)\in E} x_{uv}I_{i}((u,v), f) = M_i(f)\quad \text {for all }(u,v)\in E,\,f\in G_i, u\ne s, v\ne t, \end{aligned}$$
(34)

where \(M_i(f)\) is the multiplicity of edge f in \(G_i\). Let C be the strongly connected component in the subgraph induced by positive \(x_{uv}\), now \(\sum _{(u,v)\in E(C)} x_{uv}\) is no longer upper bounded by |E(C)|. Therefore, constraints (19) are changed to

$$\begin{aligned}{}&\sum _{(u,v)\in E(C)}x_{uv}-|E(C)|+1-W(C)\beta _{C}\le 0\quad \text {for all }C\in \mathcal {C}, \nonumber \\&W(C)=\sum _{(u,v)\in E(C)}\max \left( \sum _{f\in G_1}M_1(f)I_{1}((u,v), f),\sum _{f\in G_2}M_2(f)I_{2}((u,v), f)\right) , \end{aligned}$$
(35)

where W(C) is the maximum total multiplicities of edges in the strongly connected subgraph in each input graph that is projected from C.

Likewise, constraints (27) that set the upper bounds on the ordering variables also need to be modified as the upper bound of the ordering variable \(d_{uv}\) for each edge no longer represents the order of one edge but the sum of orders of copies of (uv) that are selected, which is at most \(|E|^2\). Therefore, constraint (27) is changed to

$$\begin{aligned} d_{uv} - |E|^2(1-y_{uv}) \le 0. \end{aligned}$$
(36)

The rest of the constraints remain unchanged.

We ran all our experiments on a server with 48 cores (96 threads) of Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz and 378 GB of memory. The system was running Ubuntu 18.04 with Linux kernel 4.15.0. We solve all the ILP formulations and their linear relaxations using the Gurobi solver [20] using 32 threads.

GTED on simulated TCR sequences

We construct 20 de Bruijn graphs with \(k=4\) using 150-character sequences extracted from the V genes from the IMGT database [21]. We solve the linear relaxation of (compact ILP), (exponential ILP) and (lower bound ILP) and their linear relaxation on all 190 pairs of graphs. We do not show results for solving (compact ILP) for GTED on this set of graphs as the running time exceeds 30 min on most pairs of graphs.

To compare the time to solve the ILP formulations when GTED is equal to the optimal objective of (lower bound ILP), we only include 168 out of 190 pairs where GTED is equal to the lower bound (GTED is slightly higher than the lower bound in the remaining 22 pairs). On average, it takes 26 s wall-clock time to solve (lower bound ILP), and 71 s to solve (exponential ILP) using the iterative algorithm. On average, it takes 9 s to solve the LP relaxation of (compact ILP) and 1 s to solve the LP relaxation of (lower bound ILP). The time to construct the alignment graph for all pairs is less than 0.2 s. The distribution of wall-clock running time is shown in Fig. 5a. The time to solve (exponential ILP) and (lower bound ILP) is generally positively correlated with the GTED values (Fig. 5b). On average, it takes 7 iterations for the iterative algorithm to find the optimal solution that induces one strongly connected subgraph (Fig. 5c).

In summary, it is fastest to compute the lower bound of GTED. Computing GTED exactly by solving the proposed ILPs on genome graphs of size 150 is already time consuming. When the sizes of the genome graphs are fixed, the time to solve for GTED and its lower bound increases as GTED between the two genome graphs increases. In the case where GTED is equal to its lower bound, the subgraph induced by some optimal solutions of (lower bound ILP) contains more than one strongly connected component. Therefore, in order to reconstruct the strings from each input graph that have the smallest edit distance, we generally need to obtain the optimal solution to the ILP for GTED. In all cases, the time to solve the (exponential ILP) is less than the time to solve the (compact ILP).

Fig. 5
figure 5

a The distribution of wall-clock running time for constructing alignment graphs, solving the ILP formulations for GTED and its lower bound, and their linear relaxations on the log scale. b The relationship between the time to solve (lower bound ILP), (exponential ILP) iteratively and GTED. c The distribution of the number of iterations to solve exponential ILP. The box plots in each plot show the median (middle line), the first and third quantiles (upper and lower boundaries of the box), the range of data within 1.5 inter-quantile range between Q1 and Q3 (whiskers), and the outlier data points

GTED on difficult cases

Repeats, such as segmental duplications and translocations [22, 23] in the genomes increase the complexity of genome comparisons. We simulate such structures with a class of graphs that contain n simple cycles of which \(n-1\) peripheral cycles are attached to the n-th central cycle at either a node or a set of edges (Fig. 6a). The input graphs in Fig. 2 belong to this class of graphs that contain 2 cycles. This class of graphs simulates the complex structural variants in disease genomes or the differences between genomes of different species.

We generate pairs of 3-cycle graphs with varying sizes and randomly assign letters from {A,T,C,G} to edges. We compute the CCTED and GTED using (lower bound ILP) and (compact ILP), respectively. We group the generated 3-cycle graph pairs based on the value of \((\textsc {GTED}- \textsc {CCTED})\) and select 20 pairs of graphs randomly for each \((\textsc {GTED}- \textsc {CCTED})\) value ranging from 1 to 5. The maximum number of edges in all selected graphs is 32.

We show the difficulty of computing GTED using the iterative algorithm on the 100 selected pairs of 3-cycle graphs. We terminate the ILP solver after 20 min. As shown in Fig. 6, as the difference between GTED and CCTED increases, the wall-clock time to solve (exponential ILP) for GTED increases faster than the time to solve (compact ILP) for GTED. For pairs of graphs with \((\textsc {GTED}- \textsc {CCTED})= 5\), on average it takes more than 15 min to solve (exponential ILP) with more than 500 iterations. On the other hand, it takes an average of 5 s to solve (compact ILP) for GTED and no more than 1 s to solve for the lower bound. The average time to solve each ILP is shown in Table 1.

In summary, on the class of 3-cycle graphs introduced above, the difficulty of solving GTED via the iterative algorithm increases rapidly as the gap between GTED and \(\textsc {CCTED}\) increases. Although (exponential ILP) is solved more quickly than (compact ILP) for GTED when the sequences are long and the GTED is equal to \(\textsc {CCTED}\) (Section 6.2), (compact ILP) may be more efficient when the graphs contain overlapping cycles such that the gap between GTED and \(\textsc {CCTED}\) is larger.

Fig. 6
figure 6

a An example of a 3-cycle graph. Cycle 1 and 2 are attached to cycle 3. b The distribution of wall-clock time to solve the compact ILP and the iterative exponential ILP on 100 pairs of 3-cycle graphs

Table 1 The average wall-clock time to solve lower bound ILPexponential ILPcompact ILP and the number of iterations for pairs of 3-cycle graphs for each \(\textsc {GTED}-\textsc {CCTED}\)

Conclusion

We point out the contradictions in the result on the complexity of labeled graph comparison problems and resolve the contradictions by showing that GTED, as opposed to the results in Ebrahimpour Boroojeny et al. [1], is NP-complete. On one hand, this makes GTED a less attractive measure for comparing graphs since it is unlikely that there is an efficient algorithm to compute the measure. On the other hand, this result better explains the difficulty of finding a truly efficient algorithm for computing \(\textsc {GTED}\) exactly. In addition, we show that the previously proposed ILP of GTED [1] does not solve GTED and give two new ILP formulations of GTED.

While the previously proposed ILP of GTED does not solve GTED, it solves for a lower bound of GTED, and we show that this lower bound can be interpreted as a more “local” measure, CCTED, of the distance between labeled graphs. Further, we characterize the LP relaxation of the ILP in (11)–(12) and show that, contrary to the results in Ebrahimpour Boroojeny et al. [1], the LP in (11)–(12) does not always yield optimal integer solutions.

As shown previously [1, 13], it takes more than 4 h to solve (lower bound ILP) for graphs that represent viral genomes that contain \(\approx 3000\) bases with a multi-threaded LP solver. Likewise, we show that computing GTED using either (exponential ILP) or (compact ILP) is already slow on small genomes, especially on pairs of simulated genomes that are different due to segmental duplications and translations. The empirical results show that it is currently impossible to solve GTED or CCTED, its lower bound, directly using this approach for bacterial- or eukaryotic-sized genomes on modern hardware. The results here should increase the theoretical interest in GTED along the directions of heuristics or approximation algorithms as justified by the NP-hardness of finding GTED.