Abstract
Background
Two genomes \(\mathbb {A}\) and \(\mathbb {B}\) over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by \(n_*\) the number of common families of \(\mathbb {A}\) and \(\mathbb {B}\). Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let \(c_i\) and \(p_j\) be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes \(\mathbb {A}\) and \(\mathbb {B}\). Then, the breakpoint distance of \(\mathbb {A}\) and \(\mathbb {B}\) is equal to \(n_*\left( c_2+\frac{p_0}{2}\right)\). Similarly, when the considered rearrangements are those modeled by the doublecutandjoin (DCJ) operation, the rearrangement distance of \(\mathbb {A}\) and \(\mathbb {B}\) is \(n_*\left( c+\frac{p_e }{2}\right)\), where c is the total number of cycles and \(p_e\) is the total number of paths of even length.
Motivation
The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NPhard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a \(\sigma _k\) distance, defined to be \(n_*\left( c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k2}}{2}\right)\), and increasingly investigate the complexities of median and double distance for the \(\sigma _4\) distance, then the \(\sigma _6\) distance, and so on.
Results
While for the median much effort was done in our and in other research groups but no progress was obtained even for the \(\sigma _4\) distance, for solving the double distance under \(\sigma _4\) and \(\sigma _6\) distances we could devise linear time algorithms, which we present here.
Similar content being viewed by others
Introduction
In genome comparison, the most elementary problem is that of computing a distance between two given genomes [1], each one being a set of chromosomes. Usually a highlevel view of a chromosome is adopted, in which each chromosome is represented by a sequence of oriented genes and the genes are classified into families. The simplest model in this setting is the breakpoint model, whose distance consists of somehow quantifying the distinct adjacencies between the two genomes, an adjacency in a genome being the oriented neighborhood between two genes in one of its chromosomes [2]. Other models rely on largescale genome rearrangements, such as inversions, translocations, fusions and fissions, yielding distances that correspond to the minimum number of rearrangements required to transform one genome into another [3,4,5].
Independently of the underlying model, the distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction [2]. The median problem, for example, has three genomes as input and asks for an ancestor genome that minimizes the sum of its distances to the three given genomes. Other models are related to the whole genome duplication (WGD) event [6]. Let the doubling of a genome duplicate each of its chromosomes. The double distance is the problem that has a duplicated genome and a singular genome as input and computes the distance between the former and a doubling of the latter. The halving problem has a duplicated genome as input and asks for a singular genome whose double distance to the given duplicated genome is minimized. Finally, the guided halving problem has a duplicated and a singular genome as input and asks for another singular genome that minimizes the sum of its double distance to the given duplicated genome and its distance to the given singular genome.
Our study relies on the breakpoint graph, a structure that represents the relation between two given genomes [7]. When the two genomes are over the same set of gene families and form a canonical pair, that is, when each of them has exactly one gene from each family, their breakpoint graph is a collection of cycles of even length and paths. Assuming that both genomes have \(n_*\) genes, if we call kcycle a cycle of length k and kpath a path of length k, the corresponding breakpoint distance is equal to \(n_*\left( c_2+\frac{p_0}{2}\right)\), where \(c_2\) is the number of 2cycles and \(p_0\) is the number of 0paths [2]. Similarly, when the considered rearrangements are those modeled by the doublecutandjoin (DCJ) operation [5], the rearrangement distance is \(n_*\left( c+\frac{p_e}{2}\right)\), where c is the total number of cycles and \(p_e\) is the total number of even paths [8].
While the halving problem under both breakpoint and rearrangement distances can be solved in polynomial time [2, 6, 9, 10], median, double distance and guided halving problems can be solved in polynomial time only under the breakpoint distance, but are NPhard under the rearrangement distance [2]. One way of exploring the complexity space between these two extremes is to consider a \(\sigma _k\) distance [11], defined to be \(n_*\left( c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k2}}{2}\right)\), and increasingly investigate the complexities of median, guided halving and double distance under the \(\sigma _4\) distance, then under the \(\sigma _6\) distance, and so on. Note that the \(\sigma _2\) distance is the breakpoint distance and the \(\sigma _\infty\) distance is the DCJ distance. To the best of our knowledge, the guided halving problem has not been studied for this class of problems, while for the median under \(\sigma _4\) distance much effort has been done in our group and in other research groups (e.g. [11]) but no progress was obtained so far.
In contrast, for the double distance, while \(\sigma _8\) and higher were not yet studied, we succeeded in devising efficient algorithms for \(\sigma _4\) and \(\sigma _6\). Our results, which we present here, are built on a variation of the breakpoint graph, called ambiguous breakpoint graph [2] and have three main parts. First we show that in any \(\sigma _k\) double distance, including the NPhard DCJ double distance, all 2cycles and 0paths are fulfilled, meaning that the common adjacencies and common telomeres between the compared genomes are always conserved. Then we show that the \(\sigma _4\) double distance can be computed by a greedy linear time algorithm. Finally we present a nongreedy but still linear time algorithm for the \(\sigma _6\) double distance.
This paper is an extended version of our two recent works, one restricted to genomes composed exclusively of circular chromosomes [12], whose solution is straightforward, and the second allowing linear chromosomes[13], which is more intricate but can be reduced to particular lineartime solvable instances of the maximum independent set problem. Besides putting together the results mentioned above, here we give more details and illustrations concerning the special situations that arose from the inclusion of linear chromosomes and the related instances of the maximum independent set problem.
Background
A chromosome is an oriented DNA molecule and can be either linear or circular. We represent a chromosome by its sequence of genes, where each gene is an oriented DNA fragment. We assume that each gene belongs to a family, which is a set of homologous genes. A gene that belongs to a family \(\texttt{X}\) is represented by the symbol \(\texttt{X}\) itself if it is read in forward orientation or by the symbol \(\overline{\texttt{X}}\) if it is read in reverse orientation. For example, the sequences \([\texttt{1}\,\overline{\texttt{3}}\,\texttt{2}]\) and \(({\texttt{4}})\) represent, respectively, a linear (flanked by square brackets) and a circular chromosome (flanked by parentheses), both shown in Fig. 1, the first composed of three genes and the second composed of a single gene. Note that if a sequence s represents a chromosome K, then K can be equally represented by the reverse complement of s, denoted by \(\overline{s}\), obtained by reversing the order and the orientation of the genes in s. Moreover, if K is circular, it can be equally represented by any circular rotation of s or of \(\overline{s}\). Recall that a gene is an occurrence of a family, therefore distinct genes from the same family are represented by the same symbol.
We can also represent a gene from family \({\texttt {X}}\) referring to its extremities \({\texttt {X}}^h\) (head) and \({\texttt {X}}^t\) (tail). The adjacencies in a chromosome are the neighboring extremities of distinct genes. The remaining extremities, that are at the ends of linear chromosomes, are telomeres. In linear chromosome \([{\texttt {1}}\,\,\overline{{\texttt {3}}}\,\,{\texttt {2}}]\), the adjacencies are \(\{{\texttt {1}}^h{\texttt {3}}^h, {\texttt {3}}^t{\texttt {2}}^t\}\) and the telomeres are \(\{{\texttt {1}}^t,{\texttt {2}}^h\}\). Note that an adjacency has no orientation, that is, an adjacency between extremities \({\texttt {1}}^h\) and \({\texttt {3}}^h\) can be equally represented by \({\texttt {1}}^h{\texttt {3}}^h\) and by \({\texttt {3}}^h{\texttt {1}}^h\). In the particular case of a singlegene circular chromosome, e.g. \(({\texttt {4}})\), an adjacency exceptionally occurs between the extremities of the same gene (here \({\texttt {4}}^h{\texttt {4}}^t\)).
A genome is then a multiset of chromosomes and we denote by \(\mathcal {F}(\mathbb {G})\) the set of gene families that occur in genome \(\mathbb {G}\). In addition, we denote by \(\mathcal {A}(\mathbb {G})\) the multiset of adjacencies and by \(\mathcal {T}(\mathbb {G})\) the multiset of telomeres that occur in \(\mathbb {G}\). A genome \(\mathbb {S}\) is called singular if each gene family occurs exactly once in \(\mathbb {S}\). Similarly, a genome \(\mathbb {D}\) is called duplicated if each gene family occurs exactly twice in \(\mathbb {D}\). The two occurrences of a family in a duplicated genome are called paralogs. A doubled genome is a special type of duplicated genome in which each adjacency or telomere occurs exactly twice. These two copies of the same adjacency (respectively same telomere) in a doubled genome are called paralogous adjacencies (respectively paralogous telomeres). Observe that distinct doubled genomes with circular chromosomes can have exactly the same adjacencies and telomeres, as we show in Table 1, where we also give examples of singular and duplicated genomes.
Comparing canonical genomes
Two genomes \(\mathbb {S}_1\) and \(\mathbb {S}_2\) are said to be a canonical pair when they are singular and have the same gene families, that is, \(\mathcal {F}(\mathbb {S}_1)=\mathcal {F}(\mathbb {S}_2)\). Denote by \(\mathcal {F}_*\) the set of families occurring in canonical genomes \(\mathbb {S}_1\) and \(\mathbb {S}_2\), and by \(n_* = \mathcal {F}_*\) its cardinality. For example, genomes \(\mathbb {S}_1=\{({\texttt {1}}\,\,\overline{{\texttt {3}}}\,\,{\texttt {2}})\,\,({\texttt {4}})\}\) and \(\mathbb {S}_2=\{({\texttt {1}}\,\,{\texttt {2}})\,\,({\texttt {3}}\,\,\overline{{\texttt {4}}})\}\) are canonical with \(\mathcal {F}_*=\{{\texttt {1}},{\texttt {2}},{\texttt {3}},{\texttt {4}}\}\) and \(n_* = 4\).
Breakpoint graph
The relation between two canonical genomes \(\mathbb {S}_1\) and \(\mathbb {S}_2\) can be represented by their breakpoint graph \(BG(\mathbb {S}_1, \mathbb {S}_2) = (V,E)\), that is a multigraph representing the adjacencies of \(\mathbb {S}_1\) and \(\mathbb {S}_2\) [7]. The vertex set V comprises, for each family \(\texttt{X}\) in \(\mathcal {F}_*\), one vertex for the extremity \(\texttt{X}^h\) and one vertex for the extremity \(\texttt{X}^t\). The edge multiset E represents the adjacencies. For each adjacency in \(\mathbb {S}_1\) there exists one \(\mathbb {S}_1\)edge in E linking its two extremities. Similarly, for each adjacency in \(\mathbb {S}_2\) there exists one \(\mathbb {S}_2\)edge in E linking its two extremities. Clearly, \(BG(\mathbb {S}_1, \mathbb {S}_2)\) can easily be constructed in linear \(O(n_*)\) time.
The degree of each vertex can be 0, 1 or 2 and each connected component alternates between \(\mathbb {S}_1\) and \(\mathbb {S}_2\)edges. As a consequence, the components of the breakpoint graph of canonical genomes can be cycles of even length or paths. An even path has one endpoint in \(\mathbb {S}_1\) (\(\mathbb {S}_1\)telomere) and the other in \(\mathbb {S}_2\) (\(\mathbb {S}_2\)telomere), while an odd path has either both endpoints in \(\mathbb {S}_1\) or both endpoints in \(\mathbb {S}_2\). A vertex that is not a telomere in \(\mathbb {S}_1\) nor in \(\mathbb {S}_2\) is said to be nontelomeric. In the breakpoint graph a nontelomeric vertex has degree 2. We call icycle a cycle of length i and jpath a path of length j. We also denote by \(c_i\) the number of icycles, by \(p_j\) the number of jpaths, by c the total number of cycles and by \(p_e\) the total number of even paths. Since the number of telomeres in each genome is even (2 telomeres per linear chromosome), the total number of even paths in the breakpoint graph must be even. An example of a breakpoint graph is given in Fig. 2.
Breakpoint distance
For canonical genomes \(\mathbb {S}_1\) and \(\mathbb {S}_2\) the breakpoint distance, denoted by \(d _\textsc {bp}\), is defined as follows [2]:
For \(\mathbb {S}_1=\{({\texttt {1}}\,\,\overline{{\texttt {3}}}\,\,{\texttt {2}})\,\,[{\texttt {4}}]\}\) and \(\mathbb {S}_2=\{({\texttt {1}}\,\,{\texttt {2}})\,\,[{\texttt {3}}\,\,\overline{{\texttt {4}}}]\}\), we have \(n_*=4\). The set of common adjacencies is \(\mathcal {A}(\mathbb {S}_1)\cap \mathcal {A}(\mathbb {S}_2)=\{{\texttt {1}}^t{\texttt {2}}^h\}\) and the set of common telomeres is \(\mathcal {T}(\mathbb {S}_1)\cap \mathcal {T}(\mathbb {S}_2)=\{{\texttt {4}}^t\}\), giving \(d _\textsc {bp}(\mathbb {S}_1, \mathbb {S}_2)=2.5\). Since a common adjacency of \(\mathbb {S}_1\) and \(\mathbb {S}_2\) corresponds to a 2cycle and a common telomere corresponds to a 0path in \(BG(\mathbb {S}_1, \mathbb {S}_2)\), the breakpoint distance can be rewritten as
DCJ distance
Given a genome, a double cut and join (DCJ) is the operation that breaks two of its adjacencies or telomeres^{Footnote 1} and rejoins the open extremities in a different way [5]. For example, consider the chromosome \(K=[\,\,{\texttt {1}}\,\,{\texttt {2}}\,\,{\texttt {3}}\,\,{\texttt {4}}\,\,]\) and a DCJ that cuts K between genes \({\texttt {1}}\) and \({\texttt {2}}\) and between genes \({\texttt {3}}\) and \({\texttt {4}}\), creating segments \({\texttt {1}}\bullet\), \(\bullet {\texttt {2}}\,\,{\texttt {3}}\bullet\) and \(\bullet {\texttt {4}}\) (where the symbols \(\bullet\) represent the open ends). If we join the first with the third and the second with the fourth open end, we get \(K'=[\,\,{\texttt {1}}\,\,\overline{{\texttt {3}}}\,\,\overline{{\texttt {2}}}\,\,{\texttt {4}}\,\,]\), that is, the described DCJ operation is an inversion transforming K into \(K'\). Besides inversions, DCJ operations can represent several rearrangements, such as translocations, fissions and fusions. The DCJ distance \(d _\textsc {dcj}\) is then the minimum number of DCJs that transform one genome into the other and can be easily computed with the help of their breakpoint graph [8]:
If \(\mathbb {S}_1=\{({\texttt {1}}\,\,\overline{{\texttt {3}}}\,\,{\texttt {2}})\,\,[{\texttt {4}}]\}\) and \(\mathbb {S}_2=\{({\texttt {1}}\,\,{\texttt {2}})\,\,[{\texttt {3}}\,\,\overline{{\texttt {4}}}]\}\), then \(n_*=4\), \(c=1\) and \(p_e =2\) (see Fig. 2). Consequently, their DCJ distance is \(d _\textsc {dcj}(\mathbb {S}_1, \mathbb {S}_2)=2\).
The class of \(\sigma _k\) distances
Given the breakpoint graph of two canonical genomes \(\mathbb {S}_1\) and \(\mathbb {S}_2\), for \(k \in \{2,4,6,\ldots ,\infty \}\), we denote by \(\sigma _k\) the cumulative sums \(\sigma _k=c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k2}}{2}\). Then the \(\sigma _k\) distance of \(\mathbb {S}_1\) and \(\mathbb {S}_2\) is defined to be [11]:
It is easy to see that the \(\sigma _2\) distance equals the breakpoint distance and that the \(\sigma _\infty\) distance equals the DCJ distance, and that the distance decreases monotonously between these two extremes. Moreover, the \(\sigma _k\) distance of two genomes that form a canonical pair can easily be computed in linear time for any \(k \ge 2\).
Comparing a singular and a duplicated genome
Let \(\mathbb {S}\) be a singular and \(\mathbb {D}\) be a duplicated genome over the same \(n_*\) gene families, that is, \(\mathcal {F}(\mathbb {S})=\mathcal {F}(\mathbb {D})\) and \(n_* = \mathcal {F}(\mathbb {S}) = \mathcal {F}(\mathbb {D})\). The number of genes in \(\mathbb {D}\) is twice the number of genes in \(\mathbb {S}\) and we need to somehow equalize the contents of these genomes, before searching for common adjacencies and common telomeres of \(\mathbb {S}\) and \(\mathbb {D}\) or transforming one genome into the other with DCJ operations. This can be done by doubling \(\mathbb {S}\), with a rearrangement operation mimicking a whole genome duplication: it simply consists of doubling each adjacency and each telomere of \(\mathbb {S}\). However, when \(\mathbb {S}\) has one or more circular chromosomes, it is not possible to find a unique layout of its chromosomes after the doubling: indeed, each circular chromosome can be doubled into two identical circular chromosomes, or the two copies are concatenated to each other in a single circular chromosome. Therefore, in general the doubling of a genome \(\mathbb {S}\) results in a set of doubled genomes denoted by \(\texttt{2}\mathbb {S}\). Note that \(\texttt{2}\mathbb {S}=2^r\), where r is the number of circular chromosomes in \(\mathbb {S}\). For example, if \(\mathbb {S}=\{(\mathtt {1\,2})\,\,[{\texttt {3}}\,{\texttt {4}}]\}\), then \(\texttt{2}\mathbb {S}=\{\mathbb {B}_1,\mathbb {B}_2\}\) with \(\mathbb {B}_1=\{(\mathtt {1\,2})\,\,(\mathtt {1\,2})\,\,[{\texttt {3}}\,{\texttt {4}}]\,\,[{\texttt {3}}\,{\texttt {4}}]\}\) and \(\mathbb {B}_2=\{(\mathtt {1\,2\,1\,2})\,\,[{\texttt {3}}\,{\texttt {4}}]\,\,[{\texttt {3}}\,{\texttt {4}}]\}\) (see Table 1). All genomes in \(\texttt{2}\mathbb {S}\) have exactly the same multisets of adjacencies and of telomeres, therefore we can use a special notation for these multisets: \(\mathcal {A}(\texttt{2}\mathbb {S})=\mathcal {A}(\mathbb {S})\!\cup \!\mathcal {A}(\mathbb {S})\) and \(\mathcal {T}(\texttt{2}\mathbb {S})=\mathcal {T}(\mathbb {S})\!\cup \!\mathcal {T}(\mathbb {S})\).
Each family in a duplicated genome can be \(\left(\rule{0em}{1.5ex}^\texttt{a}_\texttt{b}\right)\)singularized by adding the index \(\texttt{a}\) to one of its occurrences and the index \(\texttt{b}\) to the other. A duplicated genome can be entirely singularized if each of its families is singularized. Let \(\mathfrak {S}^\texttt{a}_\texttt{b}(\mathbb {D})\) be the set of all possible genomes obtained by all distinct ways of \(\left(\rule{0em}{1.5ex}^\texttt{a}_\texttt{b}\right)\)singularizing the duplicated genome \(\mathbb {D}\). Similarly, we denote by \(\mathfrak {S}^\texttt{a}_\texttt{b}(\texttt{2}\mathbb {S})\) the set of all possible genomes obtained by all distinct ways of \(\left(\rule{0em}{1.5ex}^\texttt{a}_\texttt{b}\right)\)singularizing each doubled genome in the set \(\texttt{2}\mathbb {S}\).
The class of \(\sigma _k\) double distances
The class of \(\sigma _k\) double distances of a singular genome \(\mathbb {S}\) and duplicated genome \(\mathbb {D}\) for \(k=2,4,6,\ldots\) is defined as follows:
Observe that \(d _{\sigma _k}^2(\mathbb {S},\check{\mathbb {D}})=d _{\sigma _k}^2(\mathbb {S},\check{\mathbb {D}}')\) for any \(\check{\mathbb {D}}, \check{\mathbb {D}}'\!\!\in \mathfrak {S}^\texttt{a}_\texttt{b}(\mathbb {D})\).
\(\sigma _2\) (breakpoint) double distance
The breakpoint double distance of \(\mathbb {S}\) and \(\mathbb {D}\), denoted by \(d _\textsc {bp}^2(\mathbb {S},\mathbb {D})\), is equivalent to the \(\sigma _2\) double distance. For this case the solution can be found easily with a greedy algorithm [2]: each adjacency or telomere of \(\mathbb {D}\) that occurs in \(\mathbb {S}\) can be fulfilled. If an adjacency or telomere that occurs twice in \(\mathbb {D}\) also occurs in \(\mathbb {S}\), it can be fulfilled twice in any genome from \(\texttt{2}\mathbb {S}\). Then,
\(\sigma _\infty\) (DCJ) double distance
For the DCJ double distance, that is equivalent to the \(\sigma _\infty\) double distance, the solution space cannot be explored greedily. In fact, computing the DCJ double distance of genomes \(\mathbb {S}\) and \(\mathbb {D}\) was proven to be an NPhard problem [2].
The complexity of \(\sigma _k\) double distances
The exploration of the complexity space between the greedy linear time \(\sigma _2\) (breakpoint) double distance and the NPhard \(\sigma _\infty\) (DCJ) double distance is the main motivation of this study. In the remainder of this paper we show that both \(\sigma _4\) and \(\sigma _6\) double distances can be solved in linear time.
Equivalence of \({\varvec{\sigma }}_k\) double distance and \(\sigma _k\) disambiguation
A nice way of representing the solution space of the \(\sigma _k\) double distance is by using a modified version of the breakpoint graph [2].
Ambiguous breakpoint graph
Given a singular genome \(\mathbb {S}\) and a duplicated genome \(\mathbb {D}\), their ambiguous breakpoint graph \(ABG(\mathbb {S}, \check{\mathbb {D}}) = (V,E)\) is a multigraph representing the adjacencies of any element in \(\mathfrak {S}^\texttt{a}_\texttt{b}(\texttt{2}\mathbb {S})\) and a genome \(\check{\mathbb {D}} \in \mathfrak {S}^\texttt{a}_\texttt{b}(\mathbb {D})\). The vertex set V comprises, for each family \(\texttt{X}\) in \(\mathcal {F}(\mathbb {S})\), the two pairs of paralogous vertices \(\texttt{X}_\texttt{a}^{h}\), \(\texttt{X}_\texttt{b}^{h}\) and \(\texttt{X}_\texttt{a}^{t}\), \(\texttt{X}_\texttt{b}^{t}\). We can use the notation \(\hat{u}\) to refer to the paralogous counterpart of a vertex u. For example, if \(u=\texttt{X}_\texttt{a}^{h}\), then \(\hat{u}=\texttt{X}_\texttt{b}^{h}\).
The edge set E represents the adjacencies. For each adjacency in \(\check{\mathbb {D}}\) there exists one \(\check{\mathbb {D}}\)edge in E linking its two extremities. The \(\mathbb {S}\)edges represent all adjacencies occurring in all genomes from \(\mathfrak {S}^\texttt{a}_\texttt{b}(\texttt{2}\mathbb {S})\): for each adjacency \(\upgamma \upbeta\) of \(\mathbb {S}\), we have the pair of paralogous edges \(\mathcal {E}(\upgamma \upbeta )=\{\upgamma _\texttt{a}\upbeta _\texttt{a}, \upgamma _\texttt{b}\upbeta _\texttt{b}\}\) and the complementary pair of paralogous edges \(\widetilde{\mathcal {E}}(\upgamma \upbeta )=\{\upgamma _\texttt{a}\upbeta _\texttt{b}, \upgamma _\texttt{b}\upbeta _\texttt{a}\}\). Note that \({\mathop {\mathcal {E}}\limits ^{\approx }}(\upgamma \upbeta )=\mathcal {E}(\upgamma \upbeta )\). The square of \(\upgamma \upbeta\) is then \(\mathcal {Q}(\upgamma \upbeta )=\mathcal {E}(\upgamma \upbeta )\cup \widetilde{\mathcal {E}}(\upgamma \upbeta )\). The \(\mathbb {S}\)edges in the ambiguous breakpoint graph are therefore the squares of all adjacencies in \(\mathbb {S}\). Let \(a_*\) be the number of squares in \(ABG(\mathbb {S}, \check{\mathbb {D}})\). Obviously we have \(a_*=\mathcal {A}(\mathbb {S})=n_*\kappa (\mathbb {S})\), where \(\kappa (\mathbb {S})\) is the number of linear chromosomes in \(\mathbb {S}\). Again, we can use the notation \(\hat{e}\) to refer to the paralogous counterpart of an \(\mathbb {S}\)edge e. For example, if \(e=\upgamma _\texttt{a}\upbeta _\texttt{a}\), then \(\hat{e}=\upgamma _\texttt{b}\upbeta _\texttt{b}\). An example of an ambiguous breakpoint graph is shown in Fig. 3 (i).
Each linear chromosome in \(\mathbb {S}\) corresponds to four telomeres, called \(\mathbb {S}\)telomeres, in any element of \({\texttt {2}}\mathbb {S}\). These four vertices are not part of any square. In other words, the number of \(\mathbb {S}\)telomeres in \(ABG(\mathbb {S}, \check{\mathbb {D}})\) is \(4\kappa (\mathbb {S})\). If \(\kappa (\mathbb {D})\) is the number of linear chromosomes in \(\mathbb {D}\), the number of telomeres in \(\check{\mathbb {D}}\), also called \(\check{\mathbb {D}}\)telomeres, is \(2\kappa (\mathbb {D})\).
The class of \(\sigma _k\) disambiguations
Resolving a square \(\mathcal {Q}(\cdot )=\mathcal {E}(\cdot )\cup \widetilde{\mathcal {E}}(\cdot )\) corresponds to choosing in the ambiguous breakpoint graph either the edges from \(\mathcal {E}(\cdot )\) or the edges from \(\widetilde{\mathcal {E}}(\cdot )\), while the complementary pair is masked. Resolving all squares is called disambiguating the ambiguous breakpoint graph. If we number the squares of \(ABG(\mathbb {S},\check{\mathbb {D}})\) from 1 to \(a_*\), a solution can be represented by a tuple \(\tau =(\mathcal {L}_1,\mathcal {L}_2, \ldots , \mathcal {L}_{a_*})\), where each \(\mathcal {L}_i\) contains the pair of paralogous edges (either \(\mathcal {E}_i\) or \(\widetilde{\mathcal {E}}_i\)) that are chosen (kept) in the graph for square \(\mathcal {Q}_i\). The graph induced by \(\tau\) is a simple breakpoint graph, which we denote by \(BG(\tau ,\check{\mathbb {D}})\). Figure 3 (ii) shows an example.
Given a solution \(\tau\), let \(c_i\) and \(p_j\) be, respectively, the number of cycles of length i and of paths of length j in \(BG(\tau ,\check{\mathbb {D}})\). The kscore of \(\tau\) is then the sum \(\sigma _k=c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k2}}{2}\). The minimization problem of computing the \(\sigma _k\) double distance of \(\mathbb {S}\) and \(\mathbb {D}\) is equivalent to finding a solution \(\tau\) so that the kscore of \(\tau\) is maximized [2]. We call the latter (maximization) problem \(\sigma _k\) disambiguation. As already mentioned, for \(\sigma _2\) the double distance can be solved in linear time and for \(\sigma _\infty\) the double distance is NPhard. Therefore the same is true, respectively, for the \(\sigma _2\) and the \(\sigma _\infty\) disambiguations. Conversely, if we determine the complexity of solving the \(\sigma _k\) disambiguation for any \(k \ge 4\), this will automatically determine the complexity of solving the \(\sigma _k\) double distance.
An optimal solution for the \(\sigma _k\) disambiguation of \(ABG(\mathbb {S},\check{\mathbb {D}})\) gives its kscore, denoted by \(\sigma _k(ABG(\mathbb {S},\check{\mathbb {D}}))\). Note that, since an optimal \(\sigma _k\) disambiguation is also a \(\sigma _{k+2}\) disambiguation, although possibly not optimal, the kscore of \(ABG(\mathbb {S},\check{\mathbb {D}})\) can not decrease as k increases.
Approach for solving the \(\sigma _k\) disambiguation
A player of the \(\sigma _k\) disambiguation is either a valid cycle whose length is at most k or a valid even path whose length is at most \(k2\). In order to solve the \(\sigma _k\) disambiguation, a natural approach is to visit \(ABG(\mathbb {S},\check{\mathbb {D}})\) and search for players. For describing how the graph can be screened, we need to introduce the following concepts. Two \(\mathbb {S}\)edges in \(ABG(\mathbb {S},\check{\mathbb {D}})\) are incompatible when they belong to the same square and are not paralogous. A component in \(ABG(\mathbb {S},\check{\mathbb {D}})\) is valid when it does not contain any pair of incompatible edges. Note that a valid component necessarily alternates \(\mathbb {S}\)edges and \(\check{\mathbb {D}}\)edges. Two valid components \(C\ne C'\) in \(ABG(\mathbb {S},\check{\mathbb {D}})\) are either intersecting, when they share at least one vertex, or disjoint. It is obvious that any solution \(\tau\) of \(ABG(\mathbb {S},\check{\mathbb {D}})\) is composed of disjoint valid components.
Given a solution \(\tau =(\mathcal {L}_1,\mathcal {L}_2, \ldots , \mathcal {L}_i \ldots , \mathcal {L}_{a_*})\), the switching operation of the ith element of \(\tau\) is denoted by \(\widetilde{s}(\tau ,i)\) and replaces value \(\mathcal {L}_i\) by \(\widetilde{\mathcal {L}_i}\) resulting in \(\tau '=(\mathcal {L}_1,\mathcal {L}_2, \ldots , \widetilde{\mathcal {L}_i} \ldots , \mathcal {L}_{a_*})\). A choice of paralogous edges resolving a given square \(\mathcal {Q}_i\) can be fixed for any solution, meaning that \(\mathcal {Q}_i\) can no longer be switched. In this case, \(\mathcal {Q}_i\) is itself said to be fixed.
First steps to solve the \({\varvec{\sigma }}_k\) disambiguation
In this section we describe a greedy linear time algorithm for the \(\sigma _4\) disambiguation and give some general results related to any \(\sigma _k\) disambiguation.
Common adjacencies and telomeres are conserved
Let \(\tau\) be an optimal solution for \(\sigma _k\) disambiguation of \(ABG(\mathbb {S},\check{\mathbb {D}})\). If a player \(C\in BG(\tau ,\check{\mathbb {D}})\) is disjoint from any player distinct from C in any other optimal solution, then C must be part of all optimal solutions and is itself said to be optimal.
Lemma 1
For any \(\sigma _k\) disambiguation, all existing 0paths and 2cycles in \(ABG(\mathbb {S},\mathbb {D})\) are optimal.
Proof
While any 0path is an isolated vertex and obviously optimal, the optimality of every 2cycle is less obvious but still holds, as illustrated in Fig. 4. \(\square\)
This lemma is a generalization of the (breakpoint) \(\sigma _2\) disambiguation and guarantees that all common adjacencies and telomeres are conserved in any \(\sigma _k\) double distance, including the NPhard (DCJ) \(\sigma _\infty\) case. All 0paths are isolated vertices that do not integrate squares, therefore they are selected independently of the choices for resolving the squares. A 2cycle, in its turn, always includes one \(\mathbb {S}\)edge from some square (such as square 1 in Fig. 3). From now on we assume that squares that have at least one \(\mathbb {S}\)edge in a 2cycle are fixed so that all existing 2cycles are induced.
Symmetric squares can be fixed arbitrarily
Let a symmetric square in \(ABG(\mathbb {S},\check{\mathbb {D}})\) either (i) have a \(\check{\mathbb {D}}\)edge connecting a pair of paralogous vertices, or (ii) have \(\check{\mathbb {D}}\)telomeres in one pair of paralogous vertices, or (iii) have \(\check{\mathbb {D}}\)edges directly connected to \(\mathbb {S}\)telomeres inciding in one pair of paralogous vertices, as illustrated in Fig. 5. Note that, for any \(\sigma _k\) disambiguation, the two ways of resolving each of these squares would lead to solutions with the same score, therefore each of them can be fixed arbitrarily. From now on we assume that \(ABG(\mathbb {S},\check{\mathbb {D}})\) has no symmetric squares.
A linear time greedy algorithm for the \(\sigma _4\) disambiguation
Differently from 2cycles, two valid 4cycles can intersect with each other. But, since our graph is free of symmetric squares, two valid 2paths cannot intersect with each other. Moreover, since a 2path has no \(\mathbb {\check{D}}\)edge connecting squares, a 4cycle and a 2path cannot intersect with each other. In this setting, it is clear that, for the \(\sigma _4\) disambiguation, any valid 2path is always optimal. Furthermore, a 4cycle that does intersect with another one is always optimal and two intersecting 4cycles are always part of two cooptimal solutions:
Lemma 2
Any valid 4cycle that is disjoint from a 2cycle in \(ABG(\mathbb {S},\mathbb {D})\) is induced by an optimal solution of \(\sigma _4\) disambiguation.
Proof
All possible patterns are represented in Fig. 6: A valid 4cycle C (in the center) connecting two squares and the three distinct possibilities of linking the four open ends. In all cases the valid 4cycle C is either optimal or cooptimal. \(\square\)
An optimal solution of \(\sigma _4\) disambiguation can then be obtained greedily: after fixing squares containing edges that are part of 2cycles, traverse the remainder of the graph and, for each valid 2path or 4cycle C that is found, fix the square(s) containing \(\mathbb {S}\)edges that are part of C, so that C is induced. When this part is accomplished the remaining squares can be fixed arbitrarily.
Pruning \(ABG(\mathbb {S},\check{\mathbb {D}})\) for the \(\sigma _6\) disambiguation
A player in the \(\sigma _6\) disambiguation can be either a \(\{2,\!4\}\)path, that is a valid 2 or 4path, or a \(\{4,\!6\}\)cycle, that is a valid 4 or 6cycle. It is easy to see that players can intersect with each other. Moreover, for the \(\sigma _6\) disambiguation, not every player is induced by at least one optimal solution. For that reason, a greedy algorithm does not work here and a more elaborated procedure is required. The first step is a linear time preprocessing in which from \(ABG(\mathbb {S},\check{\mathbb {D}})\) first all edges are removed that are incompatible with the existing 2cycles, and then all remaining edges that cannot be part of a player. This results in a \(\{6\}\)pruned ambiguous breakpoint graph \(PG(\mathbb {S},\check{\mathbb {D}})\).
The first step is easily achieved by a simple graph traversal in which for each \(\check{\mathbb {D}}\)edge uv it is tested whether both ends connect to the same \(\mathbb {S}\)edge uv. If this is the case, the two incident \(\mathbb {S}\)edges \(u\hat{v}\) and \(v\hat{u}\) are removed from the graph, separating the 2cycle (uv). Then, in the second step, for any remaining edge e, its 6neighborhood (which has constant size in a graph of degree at most three) is exhaustively explored for the existence of a player involving e. If no such player is found, e is deleted. Each of these two steps clearly takes linear time \(O(ABG(\mathbb {S},\check{\mathbb {D}}))\), and what remains is exactly the desired graph \(PG(\mathbb {S},\check{\mathbb {D}})\).
The edges that are not pruned and are therefore present in \(PG(\mathbb {S},\check{\mathbb {D}})\) are said to be preserved. As shown in Fig. 7, for any given square the pruned graph might preserve either (a1–a2) all edges, or (b1–b4) only three edges, or (c1–c3) only two edges each one from a distinct pair of paralogous edges, or (d1–d3) only two edges from the same pair of paralogous edges, or (e1–e2) a single edge. While the squares are still ambiguous in cases (a1–a2), (b1–b4) and (c1–c3), in cases (d1–d3) and (e1–e2) they are already resolved and can be fixed according to the preserved paralogous edges in cases (d1–d3) and (e1–e2). Additionally, if none of its edges is part of a player, a square is completely pruned out and is arbitrarily fixed in \(ABG(\mathbb {S},\check{\mathbb {D}})\).
The smaller pruned graph \(PG(\mathbb {S},\check{\mathbb {D}})\) has all relevant parts required for finding an optimal solution of \(\sigma _6\) disambiguation, therefore the 6scores of both graphs are the same: \(\sigma _6(ABG(\mathbb {S},\check{\mathbb {D}}))=\sigma _6(PG(\mathbb {S},\check{\mathbb {D}}))\). A clear advantage here is that the pruned graph might be split into smaller connected components, and it is obvious that the disambiguation problem can be solved independently for each one of them. Any square that is still ambiguous in \(PG(\mathbb {S},\check{\mathbb {D}})\) is called a \(\{6\}\)square. Each connected component G of \(PG(\mathbb {S},\check{\mathbb {D}})\) is of one of the two types:

1
Ambiguous: G includes at least one \(\{6\}\)square;

2
Resolved (trivial): G is either a simple valid 0, 2 or 4path or a simple valid 2, 4 or 6cycle.
Let \(\mathcal {C}\) and \(\mathcal {P}\) be the sets of resolved components, so that \(\mathcal {C}\) has all resolved cycles and \(\mathcal {P}\) has all resolved paths. Furthermore, let \(\mathcal {M}\) be the set of ambiguous components of \(PG(\mathbb {S},\check{\mathbb {D}})\). If we denote by \(\sigma _6(M)\) the 6score of an ambiguous component \(M \in \mathcal {M}\), the 6score of \(PG(\mathbb {S},\check{\mathbb {D}})\) can be computed with the formula:
Solving the \(\sigma _6\) disambiguation corresponds then to finding, for each ambiguous component \(M\in \mathcal {M}\), an optimal solution including only the \(\{6\}\)squares of M. From now on, by \(\mathbb {S}\)edge, \(\mathbb {S}\)telomere, \(\check{\mathbb {D}}\)edge and \(\check{\mathbb {D}}\)telomere, we are referring only to the elements that are preserved in \(PG(\mathbb {S},\check{\mathbb {D}})\).
Intersection between players of the \({\varvec{\sigma }}_6\) disambiguation
Let a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path be a subpath of three edges, starting and ending with a \(\check{\mathbb {D}}\)edge. This is the largest segment that can be shared by two players: although there is no room to allow distinct \(\{2,4\}\)paths and/or valid 4cycles to share a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path in a graph free of symmetric squares, a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path can be shared by at most two valid 6cycles. Furthermore, if distinct \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)paths intersect at the same \(\check{\mathbb {D}}\)edge e and each of them occurs in two distinct 6cycles, then the \(\check{\mathbb {D}}\)edge e occurs in four distinct valid 6cycles.
In Fig. 8 we characterize this exceptional situation, which consists of the occurrence of a triplet, defined to be an ambiguous component composed of exactly three connected ambiguous squares in which at most two vertices, necessarily in distinct squares, are pruned out. In a saturated triplet, the squares in each pair are connected to each other by two \(\check{\mathbb {D}}\)edges connecting paralogous vertices in both squares; if a single \(\check{\mathbb {D}}\)edge is missing, that is, the corresponding vertices have outer connections, we have an unsaturated triplet. This structure and its score can be easily identified, therefore we will assume that our graph is free from triplets. With this condition, \(\check{\mathbb {D}}\)edges can be shared by at most two players:
Proposition 1
Any \(\check{\mathbb {D}}\)edge is part of either one or two (intersecting) players in a graph free of symmetric squares and triplets.
Proof
Recall that a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path is a subpath of three edges, starting and ending with a \(\check{\mathbb {D}}\)edge. It is easy to see that, without symmetric squares, there is no “room” to allow distinct 4paths and/or 4cycles to share a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path. In contrast, at most two valid 6cycles can share a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path as illustrated in Fig. 8. And if the \(\mathbb {S}\)edge in the middle of the shared \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path is in an ambiguous square, we have the exceptional case of a triplet, where a \(\check{\mathbb {D}}\)edge occurs in more than two players. This case can be treated separately in a preprocessing step, so that we can assume that our graph is free of triplets.
Let an \(\mathbb {S}\check{\mathbb {D}}\mathbb {S}\)path be a subpath of three edges, starting and ending with an \(\mathbb {S}\)edge. Obviously there is no “room” to allow two players to share an \(\mathbb {S}\check{\mathbb {D}}\mathbb {S}\)path: (i) there are two ways of adding a \(\check{\mathbb {D}}\)edge to a \(\mathbb {S}\check{\mathbb {D}}\mathbb {S}\)path for obtaining a valid 4path but they are incompatible therefore at most one can exist; or (ii) the two ends of the \(\mathbb {S}\check{\mathbb {D}}\mathbb {S}\)path must incide in the same \(\check{\mathbb {D}}\)edge, giving a single way of obtaining a 4cycle; or (iii) any valid 6cycle including the given \(\mathbb {S}\check{\mathbb {D}}\mathbb {S}\)path needs to have both extra \(\check{\mathbb {D}}\)edges inciding at both ends, then there can be only one way of filling the “gap” with a last \(\mathbb {S}\)edge.
Now let an open 2path be an \(\mathbb {S}\)edge adjacent to a \(\check{\mathbb {D}}\)edge such that at most one of the two includes a telomere. Considering the case of paths, in the absence of symmetric squares there is no possibility of having two 4paths sharing an open 2path. And considering the case of cycles, it is obvious that two \(\{4,6\}\)cycles sharing the same open 2path must share the same \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path, which falls in the same particular case of a triplet mentioned before.
Finally, it is easy to see that a \(\check{\mathbb {D}}\)edge can occur in more than one player (general cases for cycles are illustrated in Fig. 9). However, it can only occur in more than two players if it is part of distinct \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)paths such that each of them occurs in distinct players. By construction we can see that this can only happen in a triplet (Fig. 8) or if the graph has symmetric squares. It follows that, without symmetric squares and triplets, each \(\check{\mathbb {D}}\)edge occurs in at most two distinct players. \(\square\)
Proposition 2
Any \(\mathbb {S}\)edge of a \(\{6\}\)square is part of exactly one player in a graph free of symmetric squares and triplets.
Proof
If an \(\mathbb {S}\)edge e is in a \(\{6\}\)square \(\mathcal {Q}\), it “shares” either the same \(\check{\mathbb {D}}\)edge or the same \(\check{\mathbb {D}}\)telomere d with another \(\mathbb {S}\)edge \(e'\) from the same square \(\mathcal {Q}\). In this case the \(\check{\mathbb {D}}\)edge/telomere d is part of exactly two players and each of the \(\mathbb {S}\)edges e and \(e'\) must be part of exactly one player. \(\square\)
In the next sections we present the most relevant contribution of this work: an algorithm to solve the \(\sigma _6\) disambiguation in linear time.
Solving the \({\varvec{\sigma }}_6\) disambiguation for circular genomes
For the case of circular genomes, which are those exclusively including circular chromosomes, the ambiguous breakpoint graph has no telomeres, therefore all players are cycles. In this case, we call each ambiguous component a cyclebubble.
Two \(\{6\}\)squares \(\mathcal {Q}\) and \(\mathcal {Q}'\) are neighbors when a vertex of \(\mathcal {Q}\) is connected to a vertex of \(\mathcal {Q}'\) by a \(\check{\mathbb {D}}\)edge. Any \(\mathbb {S}\)edge e of a \(\{6\}\)square \(\mathcal {Q}\) in a cyclebubble M is part of exactly one \(\{4,\!6\}\)cycle (Proposition 2) and both \(\check{\mathbb {D}}\)edges inciding at the endpoints of e would clearly induce the same \(\{4,\!6\}\)cycle. For that reason, the choice of e (and its paralogous edge \(\hat{e}\)) implies a unique way of resolving all neighbors of \(\mathcal {Q}\), and, by propagating this to the neighbors of the neighbors and so on, all squares of M are resolved, resulting in what we call straight solution \(\tau _M\) (see Algorithms 1 and 2). Then we can immediately obtain the complementary alternative solution \(\widetilde{\tau }_M\), by switching all ambiguous squares of \(\tau _M\). A cyclebubble is said to be unbalanced if \(\sigma_6(\tau _M) \ne \sigma_6(\widetilde{\tau }_M)\) or balanced if \(\sigma_6(\tau _M) = \sigma_6(\widetilde{\tau }_M)\). If M is unbalanced, its score is given either by \(\tau _M\) or by \(\widetilde{\tau }_M\) (the maximum among the two). If M is balanced, its score is given by both \(\tau _M\) and \(\widetilde{\tau }_M\) (cooptimality). Examples are given in Fig. 10.
Solving the \({\varvec{\sigma }}_6\) disambiguation with linear chromosomes
For genomes with linear chromosomes, the ambiguous components might include paths besides cyclebubbles. In the presence of paths, the straight algorithm unfortunately does not work (see Fig. 11). We must then proceed with an additional characterization of each ambiguous component M of \(PG(\mathbb {S},\check{\mathbb {D}})\), splitting the disambiguation of M into smaller subproblems.
As we will present in the following, the solution for arbitrarily large components can be split into two types of problems, which are analogous to solving the maximum independent set of auxiliary subgraphs that are either simple paths or double paths. In both cases, the solutions can be obtained in linear time.
Intersection graph of an ambiguous component
The auxiliary intersection graph \(\mathcal {I}(M)\) of an ambiguous component M has a vertex with weight \(\frac{1}{2}\) for each \(\{2,\!4\}\)path and a vertex with weight 1 for each \(\{4,\!6\}\)cycle of M. Furthermore, if two distinct players intersect, we have an edge between the respective vertices. The intersection graphs of all ambiguous components can be built during the pruning procedure without increasing its linear time complexity.
Note that an independent set of maximum weight in \(\mathcal {I}(M)\) corresponds to an optimal solution of M. Although in general this problem is NPhard, in our case the underlying ambiguous component M imposes a regular structure to its intersection graph, allowing us to find such an independent set in linear time.
If two \(\{2,\!4\}\)paths intersect in their \(\mathbb {S}\)telomere, this intersection must include the incident \(\check{\mathbb {D}}\)edge. Therefore, when we say that an intersection occurs at an \(\mathbb {S}\)telomere, this automatically means that the intersection is the \(\check{\mathbb {D}}\)edge inciding in an \(\mathbb {S}\)telomere. A valid 4cycle has two \(\check{\mathbb {D}}\)edges and a valid 6cycle has three \(\check{\mathbb {D}}\)edges. Besides the one at the \(\mathbb {S}\)telomere, a valid 4path has one \(\check{\mathbb {D}}\)edge while a valid 2path has none  therefore the latter cannot intersect with a \(\{4,\!6\}\)cycle. When we say that 4paths and/or \(\{4,\!6\}\)cycles intersect with each other in a \(\check{\mathbb {D}}\)edge, we refer to an inner \(\check{\mathbb {D}}\)edge and not one inciding in an \(\mathbb {S}\)telomere.
Since the contribution of each cycle in the score is twice as much as the contribution of a path, we make a distinction between two types of subgraphs of an intersection graph \(\mathcal {I}(M)\), which can correspond to cyclebubbles or pathflows.
Pathflows in the intersection graph
A pathflow in \(\mathcal {I}(M)\) is a maximal connected subgraph whose vertices correspond to \(\{2,\!4\}\)paths. A pathline of length \(\ell\) in a pathflow is a series of \(\ell\) paths, such that each pair of consecutive paths intersect at a telomere. Assume that the vertices in a pathline are numbered from left to right with integers \(1,2,\ldots ,\ell\). A doubleline consists of two parallel pathlines of the same length \(\ell\), such that vertices with the same number in both lines intersect in a \(\check{\mathbb {D}}\)edge and are therefore connected by an edge. A 2path has no free \(\check{\mathbb {D}}\)edge, therefore a doubleline is exclusively composed of 4paths. If a pathline is part of a doubleline, it is saturated, otherwise it is unsaturated. Since each 4path of a doubleline has a \(\check{\mathbb {D}}\)edge intersection with another and each 4path can have only one \(\check{\mathbb {D}}\)edge intersection, no vertex of a doubleline can be connected to a cycle in \(\mathcal {I}(M)\). Examples of an unsaturated pathline and a doubleline are given in Fig. 12.
Let us assume that a doubleline is always represented with one upper pathline and one lower pathline. A doubleline of length \(\ell\) has \(2\ell\) vertices and exactly two independent sets of maximum weight, each one with \(\ell\) vertices and weight \(\frac{\ell }{2}\): one includes the paths with odd numbers in the upper line and the paths with even numbers in the lower line, while the other includes the paths with even numbers in the upper line and the paths with odd numbers in the lower line. Since a doubleline cannot intersect with cycles, it is clear that at least one of these independent sets will be part of a global optimal solution for \(\mathcal {I}(M)\). In other words, not only the two possible local optimal solutions and their (common) weight are known, but it is guaranteed that at least one of them will be part of a global optimal solution. A maximal doubleline can be of three different types:

1
Isolated: corresponds to the complete graph \(\mathcal {I}(M)\). Here the double line can be cyclic. If \(\ell\) is even, in both upper and lower lines of a cyclic doubleline, the last vertex intersects at a telomere with the first vertex. If \(\ell\) is odd, this connection of a cyclic doubleline is “twisted”: the last vertex of the upper line intersects at a telomere with the first vertex of the lower line, and the first vertex of the upper line intersects at a telomere with the last vertex of the lower line. Being cyclic or not, any of the two optimal local solutions can be fixed.

2
Terminal: intersects with one unsaturated pathline, and, without loss of generality, the intersection involves the vertex v located at the rightmost end of the lower line. Here at least one of the two optimal local solutions would leave v unselected; we can safely fix this option. (See Fig. 13(i) and (ii).)

3
Link: intersects with unsaturated lines at both ends. The intersections can be:

(a)
Singlesided: both occur at the ends of the same saturated line, or

(b)
Alternate: the left intersection occurs at the end of one saturated line and the right intersection occurs at the end of the other.
Let \(v'\) be the outer vertex connected to a vertex v belonging to the link at the right and \(u'\) be the outer vertex connected to a vertex u belonging to the link at the left. Let a balanced link be alternate of odd length, or singlesided of even length. In contrast, an unbalanced link is alternate of even length, or singlesided of odd length. If the link is unbalanced, one of the two local optimal solutions leaves both u and v unselected; we can safely fix this option. (See Fig. 13iii and iv.) If the link is balanced, we cannot fix the solution beforehand, but we can reduce the problem, by removing the connections \(uu'\) and \(vv'\) and adding the connection \(u'v'\). Since both \(u'\) and \(v'\) must be the ends of unsaturated lines, this procedure simply concatenates these two lines into a single unsaturated pathline. (See Fig. 13v and vi.) Finding a maximum independent set of the remaining unsaturated pathlines is a trivial problem that will be solved last; depending on whether one of the vertices \(u'\) and \(v'\) is selected in the end, we can fix the solution of the original balanced link.

(a)
Intersection between pathflows and cyclebubbles
If an ambiguous component has only cycles, its solution can be easily obtained with the straight algorithm presented in the previous section. More intricate is when an ambiguous component M includes cycles and paths. In this case we redefine a cyclebubble as corresponding to a maximal connected subgraph of \(\mathcal {I}(M)\) whose vertices correspond to \(\{4,\!6\}\)cycles. Let H be the subgraph of M including all edges that compose the cycles of a cyclebubble. An optimal solution for H is either the straight solution \(\tau _H\), given by Algorithm 1, or its alternative \(\widetilde{\tau }_H\). Recall that if both \(\tau _H\) and \(\widetilde{\tau }_H\) have the same score, then H is said to be balanced, otherwise it is said to be unbalanced.
Proposition 3
Let an ambiguous component M have cyclebubbles \(H_1\),..., \(H_q\). There is an optimal solution for M including, for each \(i=1,...,q\): (1) the optimal solution for \(H_i\), if \(H_i\) is unbalanced; or (2) either \(\tau _{H_i}\) or \(\widetilde{\tau }_{H_i}\), if \(H_i\) is balanced.
Proof
We will analyze the cases by increasing the size of the maximal subgraph containing intersecting cycles:

1
A \(\{4,6\}\)cycle C that does not intersect with any other \(\{4,6\}\)cycle: (a) if C is a 4cycle, it can intersect with at most two valid 4paths; therefore there is an optimal solution including C; (b) if C is a 6cycle, it can intersect with at most three valid 4paths, but if it intersects with three valid 4paths there will be at least one valid 2path P compatible with C; therefore there is an optimal solution including C and P (see Fig. 14ii).

2
Two \(\{4,6\}\)cycles C and \(C'\) intersecting with each other but not with any other \(\{4,6\}\)cycle: Since valid 4cycles have less edges for intersection, let us assume without loss of generality that both C and \(C'\) are 6cycles. Their intersection (illustrated in Additional file 1: Figs. A7 and A8) can be:

(a)
a \(\check{\mathbb {D}}\mathbb {S}\check{\mathbb {D}}\)path, and in this case each cycle can intersect with at most one valid 4path, therefore there is an optimal solution including either C or \(C'\);

(b)
a single \(\check{\mathbb {D}}\)edge, and in this case each cycle can intersect with two valid 4paths, therefore there is an optimal solution including either C or \(C'\).

(a)
As the size of the bubble grows, there is less space for intersecting paths, and each cycle intersects with at most one path. In general, the best we can get by replacing cycles by paths are cooptimal solutions. \(\square\)
As a consequence of Proposition 3, if a cyclebubble is unbalanced, its optimal solution can be fixed so that the unsaturated pathlines around it can be treated separately. Similarly, if a balanced cyclebubble H has a single intersection involving a cycle C and a path P (that can be the first vertex of an unsaturated pathline), then we can immediately fix the solution of H that does not contain C.
Balanced cyclebubbles intersecting with at least two paths
If a cyclebubble H is balanced and intersects with at least two paths, then it requires a special treatment. However, as we will see, here the only case that can be arbitrarily large is easy to handle. Let a cyclebubble be a cycleline when it consists of a series of valid 6cycles, such that each pair of consecutive cycles intersect at a \(\check{\mathbb {D}}\)edge (see Fig. 15).
Proposition 4
Cyclebubbles involving 9 or more cycles must be a cycleline.
Proof
In Fig. 16 (whose steps are more elaborated in Additional file 1: Figs. S1–S6) we show that, if a bubble is not a line, it reaches its “capacity” with at most 8 cycles. \(\square\)
Besides having its size limited to 8 cycles, the more complex a nonlinear cyclebubble becomes, the less space it has for paths around it. The solutions for these few exceptional bounded cases are described in the end of this section.
Our focus now is the remaining situation of a balanced cycleline with intersections involving at least two cycles. Recall that cycles can only intersect with unsaturated pathlines. An intersection between a cycle and a pathline is a plug connection when it occurs between vertices that are at the ends of both lines.
Proposition 5
Cyclelines of length at least 4 can only have plug connections.
Proof
If a cycleline has length at least four, its underlying graph has only “room” for intersections with 4paths next to its leftmost or rightmost cycles. See the illustration in Fig. 17.
For arbitrarily large instances, the last missing case is that of a balanced cycleline with plug connections at both sides, called a balanced link. The procedure here is the same as that for doublelines that are balanced links, where the local solution can only be fixed after fixing those of the outer connections (see Fig. 17ii).
Exceptional bounded cases.
Balanced cyclelines with two cycles can have connections to pathlines that are not plugs, but the number of cases is again limited. In most of them (shown in Additional file 1: Fig. S7) the bubble is saturated and the paths around cannot be connected to extendable pathlines. For these bubbles all paths are over the same squares of the cycles, therefore the straight algorithm would give the two overall alternatives including the paths around each of these bubbles, and the best solution can be immediately fixed.
In another case (shown in Additional file 1: Fig. S8i) there is one extendable pathline, but the local solution (including the bubble and the paths that are over the same squares) is unbalanced, therefore also here we can fix the best among the two overall alternatives given by the straight algorithm.
In the last two cases (shown in Additional file 1: Fig. S8ii, iii) there are extendable pathlines, and the local solutions (including the bubble and the paths that are over the same squares) are balanced. In the first case, there is only one extendable pathline and we can fix the solution that does not include the last “visible” path of the pathline. The second case is analogous to cyclelines of type balanced link, with the difference that here the lines are already concatenated; the local solution can then only be fixed after fixing those of the outer connections.
Concerning nonlinear cyclebubbles, there are only four distinct cases that need to be considered: one case of a nonlinear bubble with two 6cycles (Additional file 1: Fig. S7iii) and three cases of nonlinear bubbles with four 6cycles (Additional file 1: Fig. S9). In all of these four cases, the bubble is saturated and the paths around cannot be connected to extendable pathlines. Indeed, also for these bubbles all paths are over the same squares of the cycles, therefore the straight algorithm would give the two overall alternatives including the paths around each of these bubbles, and the best among these solutions can be immediately fixed.
What remains is a set of independent unsaturated pathlines
For each remaining unsaturated pathline, an optimal solution can be trivially found as follows. First assume that in an unsaturated pathline of length \(\ell\) the paths are numbered from one end to the other with \(1,2,\ldots ,\ell\). The solution that selects all paths with odd numbers must be optimal, with some particularities if the pathline is cyclic: in this case the initial vertex for the sequential numbering is arbitrarily chosen; furthermore, if \(\ell\) is odd, the path whose number is \(\ell\) must be excluded from the solution. Then, depending on the connections between the selected vertices of the unsaturated pathline and vertices from balanced links that are doublelines or cyclelines, the compatible solutions for the latter ones are also fixed. See examples in Fig. 18.
Final remarks and discussion
Given a singular genome \(\mathbb {S}\) and a duplicated genome \(\mathbb {D}\) over the same set of gene families, the double distance of \(\mathbb {S}\) and \(\mathbb {D}\) aims to find the smallest distance between \(\mathbb {D}\) and any element from the set \({\texttt {2}}\mathbb {S}\), that contains all possible genome configurations obtained by doubling the chromosomes of \(\mathbb {S}\). Different underlying genomic distance measures give rise to different double distances: the breakpoint double distance of \(\mathbb {S}\) and \(\mathbb {D}\) is an easy problem that can be greedily solved in linear time, while computing the DCJ double distance of \(\mathbb {S}\) and \(\mathbb {D}\) is NPhard. Our study is an exploration of the complexity space between these two extremes.
We considered a class of genomic distance measures called \(\sigma _k\) distances, for \(k=2,4,6,\ldots ,\infty\), which are between the breakpoint (\(\sigma _2\)) and the DCJ (\(\sigma _\infty\)) distance. In this work we presented linear time algorithms for computing the double distance under the \(\sigma _4\), and under the \(\sigma _6\) distance. Our solution relies on a variation of the breakpoint graph called ambiguous breakpoint graph.
The solutions we found so far are greedy with all players being optimal in \(\sigma _2\), greedy with all players being cooptimal in \(\sigma _4\) and nongreedy with nonoptimal players in \(\sigma _6\), all of them running in linear time. More specifically for the \(\sigma _6\) case, after a preprocessing that fixes symmetric squares and triplets, at most two players share an edge. However we can already observe that, as k grows, the number of players sharing a same edge also grows. For that reason, we believe that, if for some \(k\ge 8\) the complexity of the \(\sigma _k\) double distance is found to be NPhard, the complexity is also NPhard for any \(k'>k\). We expect that when we find the smallest k for which the \(\sigma _k\) double distance is NPhard we will be able to confirm this conjecture. In any case, the natural next step in our research is to study the \(\sigma _8\) double distance.
Besides the double distance, other combinatorial problems related to genome evolution and ancestral reconstruction, including median and guided halving, have the distance problem as a basic unit. And, analogously to the double distance, these problems can be solved in polynomial time (but differently from the double distance, not greedy and linear) when they are built upon the breakpoint distance, while they are NPhard when they are built upon the DCJ distance [2]. Therefore, a challenging avenue of research is doing the same exploration for both median and guided halving problems under the class of \(\sigma _k\) distances. In both cases it seems possible to adopt variations of the breakpoint graph. To the best of our knowledge, the guided halving problem has not yet been studied for any \(\sigma _k\) distance except \(k=2\) and \(k=\infty\), while for the median much effort for the \(\sigma _4\) distance has been done but no progress was obtained so far. A reason for this difference of progress between double distance and median is probably related to the underlying approaches. While the double distance can be solved by removing paralogous edges from the ambiguous breakpoint graph, solving the median requires adding new edges (representing the adjacencies of the median genome) to an extended (multiple) breakpoint graph, and the combinatorial space of the distinct possibilities of doing that could not yet be described.
Availability of data and materials
Not applicable.
Notes
A broken adjacency has two open ends and a broken telomere has a single one.
References
Sankoff D. Edit distance for genome comparison based on nonlocal operations. In: Manber U, editor. Proceedings of CPM 1992, LNCS, vol. 644. Berlin: Springer; 1992. p. 121–35. https://doi.org/10.1007/3540560246_10.
Tannier E, Zheng C, Sankoff D. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009;10:120. https://doi.org/10.1186/1471210510120.
Hannenhalli S, Pevzner PA. Transforming men into mice polynomial algorithm for genomic distance problem. In: Hannenhalli S, editor. Proceedings of FOCS 1995. Milwaukee: IEEE; 1995. p. 581–92. https://doi.org/10.1109/SFCS.1995.492588.
Hannenhalli S, Pevzner PA. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999;46(1):1–27. https://doi.org/10.1145/300515.300516.
Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005;21(16):3340–6. https://doi.org/10.1093/bioinformatics/bti535.
ElMabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM J Comput. 2003;32(3):754–92. https://doi.org/10.1137/S0097539700377177.
Bafna V, Pevzner PA. Genome rearrangements and sorting by reversals. In: Bafna V, editor. Proceedings of FOCS 1993. Palo Alto: IEEE; 1993. p. 148–57. https://doi.org/10.1109/SFCS.1993.366872.
Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. In: Bucher P, Moret BM, editors. Proceedings of WABI 2006, LNBI, vol. 4175. Zurich: Springer; 2006. p. 163–73. https://doi.org/10.1007/11851561_16.
Alekseyev M, Pevzner PA. Colored de Bruijn graphs and the genome halving problem. IEEE/ACM Trans Comput Biol Bioinform. 2008;4(1):98–107. https://doi.org/10.1109/TCBB.2007.1002.
Mixtacki J. Genome halving under DCJ revisited. In: Hu X, Wang J, editors. Proceedings of COCOON 2008, LNCS, vol. 5092. Dalian: Springer; 2008. p. 276–86. https://doi.org/10.1007/9783540697336_28
Chauve C. Personal communication in Dagstuhl Seminar no. 18451—genomics, pattern avoidance, and statistical mechanics. 2018.
Braga MDV, Brockmann LR, Klerx K, Stoye J. A linear time algorithm for an extended version of the breakpoint double distance. Proceedings of WABI 2022, LIPIcs 242(13). Dagstuhl Publishing; 2022. https://doi.org/10.4230/LIPIcs.WABI.2022.13.
Braga MDV, Brockmann LR, Klerx K, Stoye J. On the class of double distance problems. In: Jahn K, Vinai T, editors. Proceedings of RecombCG 2023, LNBI, vol. 13883. Istanbul: Springer; 2023. p. 35–50. https://doi.org/10.1007/9783031369117_3
Acknowledgements
We would like to thank Cedric Chauve for bringing our attention to the class of \(\sigma _k\) distances as a means for studying the hardness bound between the breakpoint distance and the DCJ distance in combinatorial problems related to genome evolution. Thanks also to Eloi Araujo, Daniel Doerr and Fábio H. V. Martinez for helping us studying the median problem under this class.
Funding
Open Access funding enabled and organized by Projekt DEAL. Not applicable.
Author information
Authors and Affiliations
Contributions
MDVB and JS devised the study. MDVB, LRB and KK worked out the technical parts. All authors wrote and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
This file contains additional figures for supporting the proofs of statements in this manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Braga, M.D.V., Brockmann, L.R., Klerx, K. et al. Investigating the complexity of the double distance problems. Algorithms Mol Biol 19, 1 (2024). https://doi.org/10.1186/s1301502300246y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1301502300246y