Keywords

1 Introduction

The analysis and understanding of genetic variation encoded in the genome of an organism lies at the center of computational biology and medicine. Variation is usually identified through matching sequences obtained from DNA/RNA-sequencing back to a reference (genome) sequence in the process of variant calling, making the alignment task a core problem in sequence bioinformatics.

Historically, a single linear reference sequence has been used to represent the most common variants in a population. While providing a working abstraction for most cases, rare or sub-population specific variation is especially hard to model in this setting, creating a reference allele bias [4, 35]. Consequently, in the last few years, the field has shifted first towards using sets of reference sequences, and more recently to graph data structures (so-called genome graphs), to represent many genomes or haplotypes simultaneously [7, 9, 25].

Both for sequence-to-sequence alignment and sequence-to-graph alignment, heuristics are employed to keep alignment tractable [2, 9, 21], especially for large populations of human-sized genomes. While such heuristics find the correct alignment for simple references, they often perform poorly in regions of very high complexity, such as in the human major histocompatibility complex (MHC) [7], in complex but rare genotypes arising from somatic-subclones in tumor sequencing data [10], or in the presence of frequent sequencing errors [29]. Importantly, these cases can be of specific clinical or biological interest, and incorrect alignment can cause severe biases for downstream analyses. For instance, the combination of high variability of MHC sequences in humans and small differences between alleles [5] leads to a risk of misclassifications due to suboptimal alignment. Guaranteeing optimal alignment against all variations represented in a graph is a major step towards alleviating those biases.

Formally, we consider the optimal sequence-to-graph alignment problem, the task of finding an optimal base-to-base correspondence between a query sequence and a (possibly cyclic) walk in the graph. Related alignment problems have already been formulated as graph shortest path problems [3, 16].

1.1 Related Work

Seed-and-Extend. Since optimal alignment is often intractable, many aligners use heuristics, most commonly the seed-and-extend paradigm [2, 21, 22]. In this approach, alignment initiation sites (seeds) are determined, which are then extended to form the alignments of the query sequence. The fundamental issue with this approach, however, is that the seeding and extension phases are mostly decoupled during alignment. Thus, an algorithm with a provably optimal extension phase may not result in optimal alignments due to the selection of a suboptimal seed in the first phase. In cases of high sequence variability, the seeding phase may even fail to find an appropriate seed from which to extend.

Accounting for Variation. First attempts to include variation into the reference data structure were made by augmenting the local alignment method to consider alternative walks during the extend step [17, 30]. This approach has since been extended from the linear reference case to graph references. To represent non-reference variation of multiple references during the seeding stage, HISAT2 uses generalized compressed suffix arrays [33] to index walks in an augmented reference sequence, forming a local genome graph [19]. VG [9] uses a similar technique [32] to index variation graphs representing a population of references.

BrownieAligner, another recent work developed for local alignment of sequences to de Bruijn graph representations of genomic variation, features an optimal extension phase using a branch-and-bound-based early cutoff, while employing a heuristic maximal-exact-match approach for seeding [11].

Optimal Alignment. Current optimal sequence-to-graph alignment algorithms reach their worst-case \(\mathcal {O}(nm)\) runtime [16]. In this light, approaches for improving the efficiency of optimal alignment have taken advantage of specialized features of modern CPUs to improve the practical runtime of the Smith-Waterman dynamic programming (DP) algorithm [34] considering all possible starting nodes. These use modern SIMD instructions (e.g. VG  [9] and PaSGAL  [15]) or reformulations of edit distance computation to allow for bit-parallel computations in GraphAlignerFootnote 1 [27]. Many of these, however, are designed only for specific types of genome graphs, such as de Bruijn graphs [11, 23, 24] and variation graphs [9]. A compromise often made when aligning sequences to cyclic graphs using algorithms reliant on directed acyclic graphs involves the computationally expensive “DAG-ification” of graph regions [9, 18].

\(\mathbf {A}^\star \) Algorithm. We aim to guarantee optimal alignment while optimizing the average runtime to not reach its worst case complexity. While Dijkstra is an algorithm that explores graph nodes in the order of their distance from the start, A\(^\star \) is a generalization of Dijkstra that also accounts for their distance from the target. A\(^\star \) prioritizes the exploration of nodes that seem to be closer to the target nodes. This way, A\(^\star \) can sometimes dramatically improve on the performance of Dijkstra while remaining optimal.

There has been one attempt to apply A\(^\star \) for optimal alignment [8] which uses a heuristic function that accounts only for the length of the remaining query sequence to be aligned. However, it does not significantly outperform Dijkstra (in fact, it is equivalent for a zero matching cost). In contrast, the heuristic function we introduce is more informative and consistently outperforms Dijkstra.

1.2 Main Contributions

We introduce a novel approach, called AStarix, for optimal sequence-to-graph alignment based on A\(^\star \). As with any A\(^\star \) instantiation, the core difficulty lies in developing an accurate domain-specific heuristic which is fast to compute. We design a heuristic that accounts for the content of the upcoming query letters to be aligned, which more effectively guides the search. Our proposed heuristic has two advantages: (i) it is correctness-preserving, that is, it preserves the fact that AStarix finds the best alignment, yet (ii) it is practically effective in that the algorithm performs a near-optimal number of steps. Overall, this heuristic enables AStarix to compute the best alignment while also scaling to larger reference graph sizes when compared to existing state-of-the-art optimal aligners.

Our main contributionsFootnote 2 include:

  1. 1.

    AStarix . An algorithm for optimal sequence-to-graph alignment based on a novel instantiation of A\(^\star \) with an accurate domain-specific heuristic that accounts for the upcoming query letters to be aligned (Sect. 3).

  2. 2.

    Algorithmic optimizations. To ensure that AStarix is practical, we introduce a number of algorithmic optimizations which increase performance and decrease memory footprint (Sect. 4). We also prove that all optimizations are correctness-preserving.

  3. 3.

    Thorough experimental evaluation of AStarix . We demonstrate that AStarix is up to 2 orders of magnitude faster than other optimal aligners on various reference graphs (Sect. 5).

2 Task Description: Alignment to Reference Graphs

We now describe the task of aligning a query to a reference graph. To this end, we (i) introduce the task of optimal alignment on a reference graph, (ii) formalize this task in terms of an edit graph, and (iii) introduce an alternative formulation in terms of an alignment graph, which is the basis for shortest path formulations of the optimal alignment. Figure 1 summarizes these different graph types.

Reference Graph. We encode the collection of references to which we want to align in a reference graph, which captures genomic variation that a linear reference cannot express [9, 25]. We formalize a reference graph as a tuple \(G_\texttt {r}=(V_\texttt {r},E_\texttt {r})\) of nodes \(V_\texttt {r}\) and directed, labeled edges \(E_\texttt {r}\subseteq V_\texttt {r}\times V_\texttt {r}\times \varSigma \), where the alphabet \(\varSigma =\{\texttt {A},\texttt {C},\texttt {G},\texttt {T}\}\) represents the four different nucleotides. Note that in contrast to sequence graphs [28], we label edges instead of nodes.

Path, Spelling. Any path \(\pi =(e_1,\dots ,e_k) \text{ in } G_\texttt {r}\) induces a spelling \(\sigma (\pi ) \in \varSigma ^*\) defined by \(\sigma (e_1)\cdots \sigma (e_k)\), where \(\sigma (e_i)\) is the label of edge \(e_i\) and \(\varSigma ^* := \bigcup _{k \in \mathbb {N}} \varSigma ^k\). We note that our approach naturally handles cyclic walks and does not require cycle unrolling, a feature shared with BitParallel  [27] and BrownieAligner  [11] but missing from VG  [9], PaSGAL  [15] and V-ALIGN  [18].

Alignment on Reference Graph. An alignment of query \(q \in \varSigma ^*\) to a reference graph \(G_\texttt {r}=(V_\texttt {r},E_\texttt {r})\) consists of (i) a path \(\pi \text { in } G_\texttt {r}\) and (ii) a sequence of edit operations (matches, substitutions, insertions, deletions) transforming \(\sigma (\pi )\) to q.

Fig. 1.
figure 1

Starting from the reference graph (left), we can construct the edit graph (middle) and the alignment graph \(G_\texttt {a}^{q}\) for query \(q=``{} \texttt {A}\)” (right). Edges are annotated with labels and/or costs, where sets of labels represent multiple edges, one for each letter in the set (indicated by \(``\text {x}3\)” and \(``\text {x}4\)”).

Optimal Alignment, Edit Distance. Each edit operation is associated with a real-valued cost (\(\varDelta _\text {match}\), \(\varDelta _\text {subst}\), \(\varDelta _\text {ins}\), and \(\varDelta _\text {del}\), respectively). An optimal alignment minimizes the total cost of the edit operations converting \(\sigma (\pi )\) to q. For optimal alignments, this total cost is equal to the edit distance between \(\sigma (\pi )\) and q, i.e., the cheapest sequence of edit operations transforming \(\sigma (\pi )\) into q.

We make the (standard) assumption that \(0 \le \varDelta _\text {match}\le \varDelta _\text {subst}, \varDelta _\text {ins}, \varDelta _\text {del}\), which will be a prerequisite for the correctness of our approach.

Edit Graph. Instead of representing alignments as pairs of (i) paths in the reference graph and (ii) sequences of edit operations on these paths, we introduce edit graphs whose paths intrinsically capture both. This way, we can formally define an alignment more conveniently as a path in an edit graph.

Formally, an edit graph \(G_\texttt {e}:=(V_\texttt {e},E_\texttt {e})\) has directed, labeled edges \(E_\texttt {e}\subseteq V_\texttt {e}\times V_\texttt {e}\times \varSigma _\epsilon \times \mathbb {R}_{\ge 0}\) with associated costs that account for edits. Here, \(\varSigma _\epsilon := \varSigma \cup \{\epsilon \}\) extends the alphabet \(\varSigma \) by \(\epsilon \) to account for deleted characters (see Fig. 1). The edit and reference graphs consist of the same vertices, i.e., \(V_\texttt {e}=V_\texttt {r}\). However, \(E_\texttt {e}\) contains more edges than \(E_\texttt {r}\) to account for edits. Concretely, for each edge \((u,v,\ell ) \in E_\texttt {r}\), \(E_\texttt {e}\) contains edges to account for (i) matches, by an edge \((u,v,\ell ,\varDelta _\text {match})\), (ii) substitutions, by edges \((u,v,\ell ',\varDelta _\text {subst})\) for each \(\ell ' \in \varSigma \backslash \ell \), (iii) deletions, by an edge \((u,v,\epsilon ,\varDelta _\text {del})\), and (iv) insertions, by edges \((u,u,\ell ',\varDelta _\text {ins})\) for each \(\ell ' \in \varSigma \). The spelling \(\sigma (\pi ) \in \varSigma ^*\) of a path \(\pi \in G_\texttt {e}\) is defined analogously to reference graphs, except that deleted letters (represented by \(\epsilon \)) are ignored. The cost \(\text {cost}(\pi )\) of a path \(\pi \in G_\texttt {e}\) is the sum of all its edge costs.

Alignment on Edit Graph. An alignment of query q to \(G_\texttt {r}\) is a path \(\pi \text { in } G_\texttt {e}\) spelling q, i.e., \(q=\sigma (\pi )\). An optimal alignment is an alignment of minimal cost.

Alignment Graph. To find an optimal alignment of q to the edit graph \(G_\texttt {e}\) using shortest path finding algorithms, we must ensure that only paths spelling q are considered. To this end, we introduce an alternative but equivalent formulation of alignments in terms of an alignment graph \(G_\texttt {a}^{q}=(V_\texttt {a}^{q},E_\texttt {a}^{q})\).

Here, each state \(\langle v,i \rangle \in V_\texttt {a}^{q}\) consists of a vertex \(v \in V_\texttt {e}\) and a query position \(i \in \{0,\dots ,|q|\}\) (equivalent to [28]). Traversing a state \(\langle v,i \rangle \in V_\texttt {a}^{q}\) represents the alignment of the first i query characters ending at node v. In particular, query position \(i=0\) indicates that we have not yet matched any letters from the query. We note that the alignment graph explicitly depends on the query q. In particular, the example alignment graph \(G_\texttt {a}^{``{} \texttt {A}\text {''}}\) in Fig. 1 lacks substitution edges from \(G_\texttt {e}\), as their labels (\(\texttt {C}\), \(\texttt {G}\), \(\texttt {T}\)) do not match the query \(q=``A\)”.

We construct the alignment graph \(G_\texttt {a}^{q}\) to guarantee that any walk from a source \(\langle u,0 \rangle \) to a state \(\langle v,i \rangle \) corresponds to an alignment of the first i letters of query q to \(G_\texttt {r}\). As a consequence, there is a one-to-one correspondence between alignments \(\pi _\texttt {e}\) of q to \(G_\texttt {e}\) and paths \(\pi _\texttt {a}^{q} \in G_\texttt {a}^{q}\) from sources \(S:=V_\texttt {r}\times \{0\}\) to targets \(T:=V_\texttt {r}\times \{|q|\}\), with \(\text {cost}(\pi _\texttt {r})=\text {cost}(\pi _\texttt {a}^{q})\). To find the best alignment in \(G_\texttt {e}\), only paths in \(G_\texttt {a}^{q}\) (walks without repeating nodes) can be considered, since repeating a node in \(G_\texttt {a}^{q}\) cannot lead to a lower cost (\(\varDelta _\text {del}\ge 0\)) for the same state.

The edges \(E_\texttt {a}^{q}\subseteq V_\texttt {a}^{q}\times V_\texttt {a}^{q}\times \varSigma _\epsilon \times \mathbb {R}_{\ge 0}\) are built based on the edges in \(E_\texttt {e}\), except that the former (i) keep track of the position in the query i, and (ii) only contain empty edges or edges whose label matches the next query letter:

$$\begin{aligned}&(u,v,\ell ,w) \in E_\texttt {e}\implies (\langle u, i \rangle , \langle v, i+1 \rangle ,\ell ,w) \in E_\texttt {a}^{q}\quad \text { for } 0 \le i < |q| \text { with } q[i]=\ell \end{aligned}$$
(1)
$$\begin{aligned}&(u,v,\epsilon ,w) \in E_\texttt {e}\implies (\langle u, i \rangle , \langle v, i \quad \,\,\,\,\,\rangle ,\epsilon ,w) \in E_\texttt {a}^{q}\quad \text { for } 0 \le i < |q| \end{aligned}$$
(2)

Here, assuming 0-indexing, q[i] is the next letter to be matched after matching i letters. Then, Eq. (1) represents matches, substitutions, and insertions (which advance the position in the query by 1), while Eq. (2) represents deletions (which do not advance the position in the query).

Dynamic Construction. As the size of the alignment graph is \(\mathcal {O}(|G_\texttt {r}|{\cdot }|q |)\), it is expensive to build it fully for every new query. Therefore, our implementation constructs the alignment graph \(G_\texttt {a}^{q}\) on-the-fly: the outgoing edges of a node are only generated on demand and are freed from memory after alignment.

3 AStarix: Finding Optimal Alignments Using A\(^\star \)

In this section, we first introduce the general A\(^\star \) algorithm for finding shortest paths, and the notion of an optimistic heuristic, a sufficient condition for instantiations of A\(^\star \) to be correct (i.e., to indeed find shortest paths). Then we instantiate A\(^\star \) with our domain-specific heuristic that accounts for upcoming subsequences to be aligned, and prove that this heuristic is optimistic.

figure a

3.1 Background: General A\(^\star \) Algorithm

Given a weighted graph \(G=(V,E)\) with \(E \subseteq V \times V \times \mathbb {R}_{\ge 0}\), the A\(^\star \) algorithm (abbreviated as A\(^\star \)) searches for the shortest path from sources \(S \subseteq V\) to targets \(T \subseteq V\). It is an extension of Dijkstra’s algorithm that additionally leverages a heuristic function \(h :V \rightarrow \mathbb {R}_{\ge 0}\) to decide which paths to explore first. If \(h(u) \equiv 0\), A\(^\star \) is equivalent to Dijkstra’s algorithm. We provide an implementation of A\(^\star \) and Dijkstra in App. A.1, but do not assume knowledge of either algorithm in the following. At a high level, A\(^\star \) maintains the set of all explored states, initialized with the set of sources S. Then, A\(^\star \) iteratively expands the explored state with lowest estimated cost by exploring all its neighbors, until it finds a target. Here, the cost for node u is estimated by the distance from source, called g(u), plus the estimate from the heuristic h(u).

Heuristic Function. The heuristic function h(u) estimates the cost \(h^*(u)\) of a shortest path in G from u to a target \(t \in T\). Intuitively, a good heuristic correlates well with the distance from u to t.

To ensure that A\(^\star \) indeed finds the shortest path, h should be optimistic:

Definition 1 (Optimistic heuristic)

A heuristic h is optimistic if it provides a lower bound on the distance to the closest target: \(\forall u{.}h(u) \le h^*(u)\).

While any optimistic h ensures that A\(^\star \) finds optimal alignments [6, Res. 3], the specific choice of h is critical for performance. In particular, decreasing the error \(\delta (u) = h^*(u)-h(u)\) can only improve the performance of A\(^\star \)  [6, Res. 6]. Thus, a key contribution of ours is a domain-specific heuristic h.

3.2 AStarix: Instantiating A\(^\star \)

Algorithm 1 shows an unoptimized version of AStarix and its heuristic function. AStarix expects a reference graph (Line 1) and a query (Line 3) as input, and returns an optimal alignment (Line 7) by searching for a shortest path from S to T in the alignment graph \(G_\texttt {a}^{q}\). It is parameterized by hyper-parameters (d in Line 2, more in Sect. 4) and edit costs (implicitly provided).

The function Heuristic (Lines 8–11) computes a lower bound on the remaining cost of a best alignment: the minimum cost h(us) of aligning the upcoming sequence s (where \(|s |\le d\)) starting from node u. Importantly, s is limited to the next \(d' \le d\) letters of q, starting from query position i. Thus, computing h(us) is substantially cheaper than aligning all remaining letters of q.

To compute h(us) we leverage a simple branch-and-bound algorithm, provided in App. A.2. In the following, for convenience, we refer to the heuristic as h (which is parameterized by (us)) instead of Heuristic (which is parameterized by \(\langle u, i \rangle \)). Further, we say that h is optimistic if h(us) is a lower bound on the cost for aligning all remaining letters (i.e., q[i : |q|]) starting from node u (note that s is a prefix of q[i : |q|]).

Theorem 1

h is optimistic.

Proof

h only considers the next \(d'\) letters of q instead of all remaining letters. Since all costs are non-negative, the theorem follows.    \(\square \)

Fig. 2.
figure 2

The benefit of using our heuristic over Dijkstra. Alignment graph \(G_\texttt {a}^{``{} \texttt {ATAA}\text {''}}\) (right) is based on reference graph \(G_\texttt {r}\) (left), but omits insertion and deletion edges for simplicity. The pink boxes \(g+h\) indicate the distance from the sources \(S=\{\langle u,0 \rangle , \langle v,0 \rangle \}\) (in g) and the cost of aligning the next \(d=2\) letters (in h). Dijkstra (resp. A\(^\star \)) expands states circled in (resp. ).

Benefit of \(\mathbf {A}^\star \) Heuristic over Dijkstra. Figure 2 shows the benefit of using our heuristic function compared to Dijkstra. Here, Dijkstra expands states based on their distance g from the origin nodes \(\langle u, 0 \rangle \) and \(\langle v, 0 \rangle \). Hence, depending on tie-breaking, Dijkstra may expand all states with \(h \le 1\), as shown in Fig. 2. By contrast, A\(^\star \) chooses the next state to expand by the sum of the distance from the origin g and the heuristic h, expanding only states with \(g+h \le 1\).

Memoization. Recall that the return value of h in Line 8 only depends on u and the upcoming sequence s (which in turn depends on i and d). Thus, h(us) can be reused for different positions across different queries in \(\mathcal {O}(1)\) time, if it was computed for a previous query.

4 AStarix Algorithm: Optimizations

We now discuss several optimizations we developed to speed up AStarix while preserving its optimality. These optimizations reduce preprocessing and alignment runtime as well as memory footprint (in particular for memoization).

4.1 Reducing Semi-global to Local Alignment Using a Trie

To find an optimal alignment, we generally need to consider all reference graph nodes \(u \in G_\texttt {r}\) as possible starting nodes. Thus, optimal aligners PaSGAL  [15] and BitParallel  [27] brute-force through all possible starting nodes \(u \in G_\texttt {r}\).

To more efficiently handle arbitrary starting positions for alignments, we extend the reference graph with a trie (referred to as suffix tree in [8]) to effectively align from all possible starting nodes simultaneously.

Single Starting State. In the trie approach, abstraction nodes are added to the graph, each of which corresponds to a set of nodes in \(G_\texttt {r}\) that correspond to the same prefix. In the following, we formalize this approach.

Concretely, we extend \(G_\texttt {r}\) by a trie of depth D, resulting in graph \(G_\texttt {r}^\texttt {+}=(V_\texttt {r}^\texttt {+},E_\texttt {r}^\texttt {+})\). Our goal is that all paths in \(G_\texttt {r}\) that have length D and end in \(v \in V_\texttt {r}\) correspond to paths in \(G_\texttt {r}^\texttt {+}\) starting from a single source \(\epsilon \) to \(v \in V_\texttt {r}^\texttt {+}\), where \(\epsilon \) represents the empty string. This correspondence ensures that it suffices to consider only paths in \(G_\texttt {r}^\texttt {+}\) starting from the source \(\epsilon \). In particular, each alignment on \(G_\texttt {r}^\texttt {+}\) can be translated into an alignment on \(G_\texttt {r}\) (we omit this translation here).

Fig. 3.
figure 3

\(G_\texttt {r}^\texttt {+}\) enables semi-global alignment by extending \(G_\texttt {r}\) with a trie.

Figure 3 shows an example trie. To construct it, we first associate with every node \(v \in V_\texttt {r}\) the set \(\mathcal {S}_v\) of its D-mers (orange boxes in Fig. 3): spells of paths ending in v and of length D. Our goal is then to use paths in the trie to spell these D-mers.

Second, we construct the trie nodes from all prefixes of these D-mers:

Third, we add edges within the trie, which ensure that paths from \(\epsilon \) to any trie node s spell s. Formally, whenever \(s {\cdot }\ell \in V_\texttt {r}^\texttt {+}\), we add an edge \((s,s {\cdot }\ell , \ell )\) to \(E_\texttt {r}^\texttt {+}\), where “\({\cdot }\)” denotes string concatenation. Finally, we add edges between the trie and the reference graph, which ensure that any D-mer of any node \(v \in V_\texttt {r}\) can be spelled by a walk from \(\epsilon \) to v. Formally, if \(s {\cdot }\ell \in \mathcal {S}_v\), then \((s,v,\ell ) \in E_\texttt {r}^\texttt {+}\).

Importantly, extending \(G_\texttt {r}\) to \(G_\texttt {r}^\texttt {+}\) is compatible with the construction of the edit graph \(G_\texttt {e}\), the construction of the alignment graph and all other optimizations. In particular, when searching for a shortest path in the alignment graph constructed from \(G_\texttt {r}^\texttt {+}\), it suffices to only consider starting node \(\langle \epsilon , 0 \rangle \).

Reducing Size of Trie. We can reduce the size of the trie by removing specific trie nodes. In particular, we iteratively remove each trie leaf node \(s \cdot \ell \in V_\texttt {r}^\texttt {+}\) with a unique outgoing edge \((s \cdot \ell , v, \ell ')\) to a reference graph node \(v \in V_\texttt {r}\). To compensate for removing node \(s \cdot \ell \), we introduce a new edge \((s, u, \ell )\) to a node \(u \in V_\texttt {r}\) with an edge \((u,v,\ell ')\) (such a node must exist according to the construction of \(G_\texttt {r}^\texttt {+}\)). For example, in Fig. 3, we (i) remove node AT including its edges \((\texttt {A},\texttt {AT},\texttt {T})\) and \((\texttt {AT},u_3,\texttt {C})\), but (ii) introduce an edge \((\texttt {A},u_2,T)\).

This optimization is lossless, as the D-mer \(s \cdot \ell \cdot \ell ' \in \mathcal {S}_v\) can still be spelled by the path from \(\epsilon \) to s, extended by \((s, u, \ell )\) and \((u, v, \ell ')\).

4.2 Greedy Match Optimization

We also employ an optimization originally developed for computing the edit distance between two strings [1, 31], but which has also been used in the context of string to graph alignment [8]. We omit the correctness proof of this optimization, which is already covered in [31], and only explain the intuition behind it.

Suppose there is only one outgoing edge \(e = (u, v, \ell ) \in E_\texttt {r}\) from a node \(u \in V_\texttt {r}\). Suppose also that while aligning a query q, we explore state \(\langle u, i \rangle \) for which the next query letter q[i] matches the label \(\ell \). In this case, we do not need to consider the edit outgoing edges, because any edit at this point can be postponed without additional cost, as \(\varDelta _\text {match}\le \min (\varDelta _\text {subst}, \varDelta _\text {ins}, \varDelta _\text {del})\). Thus, we can greedily explore state \(\langle v, i+1 \rangle \), aligning \(q[i+1]\) to e by using the edge \((\langle u, i \rangle , \langle v, i+1 \rangle , \ell , \varDelta _\text {match})\) before continuing with the A\(^\star \) search. We note that this optimization is only applicable when aligning in non-branching regions of the reference graph. In particular, it is not applicable for most trie nodes (Sect. 4.1).

4.3 Speeding up Evaluation of Heuristic

In the following, we show how to reduce the runtime of evaluating the heuristic h(us), by introducing two separate optimizations that compose naturally.

Capping Cost. We cap h(us) at \(c\), replacing it by \(h_{c}(u,s):=\min (h(u,s),c)\). To achieve this, we allow RecursiveAlign to ignore paths costing more than \(c\). For large enough \(c\), this speeds up computation without significantly decreasing the benefit of the heuristic, since nodes associated with a high heuristic value are typically not explored anyways. We investigate the effect of \(c\) in App. A.3.

Theorem 2

\(h_{c}\) is optimistic.

Proof

We have \(h_c(u,s) \le h(u,s)\) and that h(us) is optimistic (Theorem 1).    \(\square \)

Capping Depth. We reduce the number of nodes that need to be considered by h(us). To this end, we define a modified heuristic \(h_d(u,s)\) that only considers nodes \(R_u \subseteq V_\texttt {e}\) at distance at most d from u (cp. Line 2 in Algorithm 1): \( R_u := \{ v \in V_\texttt {r}\mid \exists \; \text {path } \pi \in G_\texttt {e}\text { from } u \text { to } v \text { with } |\pi |\le d \} \).

If an alignment of s reaches the boundary of \(R_u\), defined as

$$B(R_u) := \{v \in R_u \mid \exists (v,v',\ell ) \in E_\texttt {e}\text { with } v' \notin R_u \},$$

it is allowed to only spell a prefix of s, and the remaining unaligned letters of s are considered aligned with zero cost:

Theorem 3

\(h_d\) is optimistic.

Proof

It suffices to show \(h_d(u,s) \le h(u, s)\) since h(us) is optimistic. In the case where all of s is aligned, \(h_d(u,s) = h(u, s)\). Otherwise, the unaligned letters of s are not penalized, so \(h_d(u,s) \le h(u, s)\).    \(\square \)

4.4 Partitioning Nodes into Equivalence Classes

We have shown in Sect. 3.2 how to reuse an already computed h(us) for repeating s across different queries and query positions. In the following, we additionally aim to reuse h(us) across different nodes u, so that h(us) does not need to be computed for all nodes u. Intuitively, we want to assign two nodes u and v to the same equivalence class when the graph region considered by h(us) is equivalent to the graph region considered by h(vs), up to renaming of nodes.

Thus, \(h(u,s)=h(v,s)\) if u and v are from the same equivalence class. Therefore, we can (arbitrarily) choose a representative node \(r \in V_\texttt {r}\) for every equivalence class, and evaluate h(rs) instead of h(us), where r is the representative of the equivalence class of u. To look up representative nodes in \(\mathcal {O}(1)\), we define a helper array \( repr \) with \( repr[u] = r\).

Identifying Equivalence Classes. To identify the nodes belonging to the same equivalence class, we assume the optimization from Sect. 4.3, i.e., that our heuristic only considers nodes up to a distance d from u. Moreover, for performance reasons, our implementation detects only the equivalence classes of nodes u with a single outgoing path of length at least d. In this case, u and \(u'\) are in the same equivalence class if their outgoing paths spell the same sequence. In contrast, we leave nodes with forking paths in separate equivalence classes.

Note that for smaller d, the number of equivalence classes gets smaller, the reuse of the heuristic gets higher, and the memoization table has a lower memory footprint. At the same time, however, the heuristic \(h_d(u,s)\) is less informative.

5 Evaluation

In this section we present a thorough experimental evaluationFootnote 3 of AStarix on simulated Illumina reads. Our evaluation demonstrates that:

  1. 1.

    AStarix is faster than Dijkstra because the heuristic reduces the number of explored states by an order of magnitude.

  2. 2.

    The runtime of AStarix scales better than state-of-the-art optimal aligners with increasing graph size, on a variety of reference graphs.

5.1 Implementation of AStarix and Dijkstra

Our AStarix implementation uses an adjacency list graph data structure to represent the reference and the trie in a unified way, representing each letter by a separate edge object. To represent the reverse complementary walks in \(G_\texttt {r}\), the vertices are doubled, connected in the opposite direction, and labeled with complementary nucleotides (\(\texttt {A} \leftrightarrow \texttt {T}\), \(\texttt {C} \leftrightarrow \texttt {G}\)). We do not limit the number of memoized heuristic function values (Sect. 3.2), but note we could do so by resetting the memoization table periodically. Our implementation of Dijkstra reuses the same AStarix codebase except the use of a heuristic function (i.e., with \(h \equiv 0\)).

We apply all described optimizations to AStarix and Dijkstra, except Sects. 4.3 and 4.4 which are applicable only to AStarix.

While the optimality of AStarix is not affected by its parameters, its performance is (see App. A.3 for analysis). To compare with other aligners, we use values \(d=5\), \(c=5\), \(D = \lfloor \log _\varSigma |G_\texttt {r}|\rfloor \).

5.2 Compared Aligners: PaSGAL and BitParallel

We compare the performance of AStarix to that of two state-of-the-art optimal aligners: PaSGAL and BitParallel, with their default parameters. We do not compare to the exact aligner of VG as (i) its optimal alignment is intended for testing purposes only, (ii) it does not provide an interface for aligning a set of reads, and (iii) it has been consistently outperformed by PaSGAL  [15].

PaSGAL is compiled with AVX2 SIMD support. The resulting alignments are not expected to match exactly between the local aligner PaSGAL and the semi-global aligners (AStarix and BitParallel) as they solve different tasks with different edit costs. Nevertheless, in analogy with the evaluations of PaSGAL  [15], it is still meaningful to compare performance, assuming that the dynamic programming approach of PaSGAL can be adapted to semi-global alignment with similar performance.

Both BitParallel and PaSGAL reach their worst-case runtime complexity independent of the edit costs \(\varDelta =(\varDelta _\text {match},\varDelta _\text {subst},\varDelta _\text {ins},\varDelta _\text {del})\). PaSGAL is evaluated using its default costs  \(\varDelta =(-1,1,1,1)\) and BitParallel is evaluated using the only supported costs \(\varDelta =(0,1,1,1)\).

5.3 Setting

All evaluations were executed singled-threaded on an Intel Core i7-6700 CPU running at 3.40 GHz.

Reference Graphs and Reads. We designed three experiments utilizing three different reference graphs (in Table 1). The first is a linear graph without variation based on the E. coli reference genome (strain: K-12 substr. MG1655, ASM584v2 [13]). The other two are variation graphs taken from the PaSGAL evaluations [15]: they are based on the Leukocyte Receptor Complex (LRC, with 1 099 856 nodes and 1 144 498 edges), and the Major Histocompatibility Complex (MHC1, with 5 138 362 nodes and 5 318 019 edges). We note that we do not evaluate on de Brujin graphs, since PaSGAL does not support cyclic graphs.

For the E. coli dataset we used the ART tool [14] to simulate an Illumina single-end read set with 10 000 reads of length 100. For the LCR and MHC1 datasets, we sampled 20 000 single-end reads of length 100 from the already generated sets in [15] using the Mason2 [12] simulator.

For Dijkstra and AStarix, the runtime complexity depends not only on the data size, but also on the data content, including edit costs. More accurate heuristics lead to better A\(^\star \) performance [26], which is why we evaluate AStarix with costs corresponding more closely to Illumina error profiles: \(\varDelta =(0,1,5,5)\).

Table 1. Performance of optimal aligners for different reference graphs.

Metrics. As all aligners evaluated in this work are provably optimal, we are mostly interested in their performance. To study the end-to-end performance of the optimal aligners, we use the Snakemake [20] pipeline framework to measure the execution time of every aligner (including the time spent on reading and indexing the reference graph input and outputting the resulting alignments). We note that the alignment phase dominates for all tools and experiments.

To judge the potential of heuristic functions, we measure not only the runtime but also the number of states explored by AStarix and Dijkstra. This number reflects the quality of the heuristic function rather than the speed of computation of the heuristic, the implementation and the system parameters.

5.4 Comparison of Optimal Aligners

Different Reference Graphs. Table 1 shows the performance of optimal aligners across various references. On all references, AStarix is consistently faster than Dijkstra, which is consistently faster than PaSGAL and BitParallel. The memory usage of Dijkstra is within a factor of 3 compared to PaSGAL and BitParallel. Due to the heuristic memoization, the memory usage of AStarix can grow several times compared to Dijkstra.

Scaling with Reference Graph Size. Figure 4 compares the performance of existing optimal aligners. BitParallel and PaSGAL always explore all states, thus their average-case reaches the worst-case complexity of \(\mathcal {O}(|G_\texttt {a}^{q}|) = \mathcal {O}(m {\cdot }G_\texttt {r})\). Due to the trie indexing, the runtime of AStarix and Dijkstra scales in the reference size with a polynomial of power around 0.2 versus the expected linear dependency of BitParallel and PaSGAL.

The heuristic function of AStarix demonstrates a 2-fold speed-up over Dijkstra. This is possible due to the highly branching trie structure, which allows skipping the explicit exploration for the majority of starting nodes.

Fig. 4.
figure 4

Comparison of overall runtime and memory usage of optimal aligners with increasing prefixes of E. coli as references.

Fig. 5.
figure 5

Comparison of A\(^\star \) and Dijkstra in terms of mean alignment runtime per read and mean explored states depending on the best alignment cost on MHC1.

5.5 A\(^\star \) Speedup

To measure the speedup caused by the heuristic function, we compare the number of not only the expanded, but also of explored states (the latter number is never smaller, see Sect. 3.1 and the example in Fig. 2) between AStarix and Dijkstra on the MHC1 dataset.

Figure 5 demonstrates the benefit of the heuristic function in terms of both alignment time and number of explored states. Most importantly, AStarix scales much better with increasing number of errors in the read, compared to Dijkstra. More specifically, the number of states explored by Dijkstra, as a function of alignment cost, grows exponentially with a base of around 10, whereas the base for AStarix is around 3 (the empirical complexity is estimated as a best exponential fit \( exploredStates \sim a \cdot score ^b\)).

The horizontal black line in Fig. 5 denotes the total number of states \(|G_\texttt {r}|\cdot |q |\), which is always explored by BitParallel and PaSGAL. On the other hand, any aligner must explore at least \(m = |q |\) states, which we show as a horizontal dashed line. This lower bound is determined by the fact that at least the states on a best alignment need to be explored.

6 Conclusion

We presented AStarix, an A\(^\star \) algorithm to find optimal alignments, based on a domain-specific heuristic and enhanced by multiple algorithmic optimizations. Importantly, our approach allows for both cyclic and acyclic graphs including variation and de Bruijn graphs.

We demonstrated that AStarix scales exponentially better than Dijkstra with increasing (but small) number of errors in the reads. Moreover, for short reads, both AStarix and Dijkstra scale better and outperform current state-of-the-art optimal aligners with increasing genome graph size. Nevertheless, scaling optimal alignment of long reads on big graphs remains an open problem.

We expect that AStarix can be scaled further, to both (i) bigger graphs and (ii) longer and noisier reads. Scaling AStarix may require a combination of (i) the development of more clever heuristic functions (by leveraging existing work on A\(^\star \) and edit distance) and (ii) algorithmic optimizations. We note that if desired, a (sub-optimal) seeding step could speed up AStarix by pre-filtering the starting positions, analogously to other practical aligners.