Keywords

1 Reduction of Input Data in Genome Assembly

Sequencing is a chemical and physical process in which DNA is ‘crushed’ into very small parts (‘fragments’) which are ‘read’ to strings called reads, containing information of the sequence of nucleotides. Reads are of limited length and contain errors (Fig. 1).

Sequencing of big genomes and other samples is a computationally challenging recent trend for two main reasons:

  • sequencing became much cheaper (price decreased by more than \(100.000\times \) since year 2000), so researchers can afford to create much bigger data sets than ever before (Fig. 2)

  • it was discovered that most bacteria (90%–99%) can’t be cultivated, so metagenomic sequencing is (nearly) the only way to assess them .

1.1 Reads, Coverage and Assembly

A read is a string over the alphabet \(\varSigma =\{\textsf{A}, \textsf{C}, \textsf{G}, \textsf{T}, \textsf{N}\}\) where ACGT are the four nuclobases and N is a place holder for an unknown nucleotide. The maximal read length depends on the sequencing technology used. For the Illumina sequencing technology the read length initially was 34 and now can be as high as 300. Other sequencers allow for longer reads at the price of a higher error rate (and higher costs). Illumina produces substitution errors with an error rate of roughly 1%. In most cases, so called paired reads are generated where in a first step pieces of DNA of a known length are produced which are than sequenced from both sides (e.g.: a paired read with a read length of 150 contains one string over \(\varSigma \) for the first 150 nucleotides and a second string over \(\varSigma \) for the last 150 nucleotides).

Fig. 1.
figure 1

A shortened paired (Illumina) read

The sequencer also outputs so-called phred scores quantifying the error probability of each nucleotide read (Quality \(Q=-10\log _{10}P\) where P is the error probability).

Genome assembly (or just assembly) is the task to reconstruct the complete genome of the sequenced species using the reads only (de novo assembly) or the reads and a reference genome (mapping or reference based assembly). It’s like a puzzle with millions of small parts, unknown overlaps and a lot of the parts containing errors.

In our work we focus on de novo assembly, or just assembly for the rest of this chapter.

For a sequencing data set, the coverage of a genomic position A is the number of reads in the data set which contain A. The coverage of the whole data set is the average over the coverages of all genomic positions. The empirical ‘optimal’ coverage for a de novo assembly is about 20 at every position. A coverage higher than 20 means redundant data. Some sequencing protocols, especially single cell MDA (multiple displacement amplification), produce read sets with an extreme uneven coverage distribution. Metagenomic data sets may have an uneven coverage distribution, too, when both abundant and rare species are sequenced. Given a string \(\sigma \) over the nucleotide alphabet, a k –mer is a sub-string of \(\sigma \) of length k.

figure a
Fig. 2.
figure 2

The cost of sequencing a human genome, source: NIH

Most bacteria can’t be cultivated in the lab. Therefore, it is not possible to create a homogeneous sample of thousands or millions of equal cells as in a ‘normal’ sequencing setting.

As a consequence, single cell sequencing protocols, like the multiple displacement amplification (MDA) have been developed which are able to amplify the genome of a single bacterial cell. A drawback of these methods is a strong amplification bias (called ‘Preferential amplification’ and ‘Allelic dropout’) between different regions of the genome, meaning that the coverage of some regions of the genome might overshot 100.000X, while other regions are not covered at all.

A metagenome, introduced by [14], is, according to wiktionary, ’All the genetic material present in an environmental sample, consisting of the genomes of many individual organisms’. In other words, in a metagenomic experiment, you are interested in

  • all the genes/DNA

  • of everything living

  • at a specific location

The experiment is conducted by collecting a sample from the desired environment, isolating the DNA from it and sequencing it with a Next Generation Sequencing (NGS) system. There are three different types of metagenomic experiments with different goals:

  • phylogenetic profiling: based upon the 16S ribosomal RNA found in the sample, reconstruct which families of bacteria live in the probed environment (and how abundant they are). Basis: each (bacterial) cell has ribosomes. The coding genes for these essential proteins are widely conserved (which makes it possible to identify these genes), but they also include less conserved regions which differ between different families or even species.

  • directed/guided assembly of specific genes: based upon some known variants of a gene (or even whole genomes), all existing variants in a specific environment are to be assembled.

  • de novo assembly of all species in the sample: all genomes of all species in the probed environment are to be assembled using the output of the sequencer only.

In our work, we focus on de novo assembly.

The main problem of metagenome assembly is non-uniform coverage: some species in the sample are much more abundant than others. The goal is to assemble all their genomes. The following issues may arise:

  • to be able to assemble the less abundant species in the sample, a high number of reads have to be generated (\(\rightarrow \) high coverage sequencing).

  • the huge input files force the assembler programs to use huge amounts of RAM and running time. For bigger projects, even 1TB of RAM might not be enough.

  • for the assembler, its often hard to tell whether a rare sequence belongs to a rare species or whether it is a sequencing error.

1.2 The Bignorm Algorithm

The basic idea of read filtering is to remove reads from a single cell or metagenome data set without losing information, and in this way to reduce the size of the problem, possibly escaping the ‘big data curse’. This is possible if only those reads which have overlapping genomic regions with high coverage are removed. A good read filter should remove as many reads as possible, without lowering the coverage of the sequenced genome below the desired threshold at any position and without increasing the error rate of the data set.

Highly memory efficient algorithms are sought to solve this problem. Brown et al. invented an algorithm named Diginorm [1] for read filtering that rejects or accepts reads based on the abundance of their k–mers. The name Diginorm is a short form for digital normalization: the  goal  is  to  normalize  the  coverage  over  all  loci, using a computer algorithm after sequencing. The idea is to remove those reads from the input which mainly consist of k–mers that have already been observed many times in other reads. Diginorm processes reads one by one, splits them into k–mers, and counts these k–mers. In order to save RAM, Diginorm does not keep track of those numbers exactly, but instead keeps appropriate estimates using the count-min sketch CMS [4]. A read is accepted if the median of its k–mer counts is below a fixed threshold, usually 20. It was demonstrated that successful assemblies are still possible after Diginorm removed high amount of the data.

Diginorm is a pioneering work. However, the following points, which are important from the biological or computational point of view, are not covered by Diginorm. We have included them in our algorithm called Bignorm [29 SPP]:

  1. (i)

    we incorporate the important phred quality score into the decision whether to accept or to reject a read, using a quality threshold. This allows a tuning of the filtering process towards high-quality assemblies by using different thresholds.

  2. (ii)

    when deciding whether to accept or to reject a read, we do a detailed analysis of the numbers in the count vectors. Diginorm merely considers their medians.

  3. (iii)

    we offer a better handling of the \(\textsf{N}\) case, that is, when the sequencing machine could not decide for a particular nucleotide. Diginorm simply converts all \(\textsf{N}\) to \(\textsf{A}\), which can lead to false k–mer counts.

  4. (iv)

    we provide a substantially faster implementation. For example, we include fast hashing functions (see [10, 30]) for counting k–mers through the count-min sketch data structure (CMS), and we use the C programming language and OpenMP.

Let us fix the following parameters:

  • \(\textsf{N}\)-count threshold \(N_0 \in \mathbb {N}\), which is 10 by default;

  • quality threshold \(Q_0 \in \mathbb {Z}\), which is 20 by default;

  • rarity threshold \(c_0 \in \mathbb {N}\), which is 3 by default;

  • abundance threshold \(c_1 \in \mathbb {N}\), which is 20 by default;

  • contribution threshold \(B \in \mathbb {N}\), which is 3 by default.

When our algorithm has to decide whether to accept or reject a read \(i \in \mathbb {N}\), it performs the following steps: If the number of \(\textsf{N}\) symbols counted over all read positions is larger than \(N_0\), the read is rejected. Otherwise, those parts of the read having phred scores of or above \(Q_0\) are converted into a vector H of high-quality k–mers.

Using the CMS, it is then checked how many times these k–mers have been seen in the accepted reads so far (function \(\widehat{c}(\mu )\)) and two counters hold the results:

$$\begin{aligned} b_0&:= |\{ \mu \in H \,;\,\, \widehat{c}(\mu )< c_0 \}|, \\ b_1&:= |\{ \mu \in H \,;\,\, c_0 \le \widehat{c}(\mu ) < c_1 \}|\end{aligned}$$
figure b

Note that the frequencies are determined via CMS counters and do not consider the position p at which the k–mer is found in the read string. The read is accepted if and only if at least one of the following conditions is met:

$$\begin{aligned} b_0&> k,\end{aligned}$$
(1)
$$\begin{aligned} \sum _{s=1}^{m(i)} b_1&\ge B. \end{aligned}$$
(2)

The motivation for condition (1) is as follows. According to [15], most errors of the Illumina sequencing platform are single substitution errors and the probability of appearance of an erroneous k–mer in the genome, caused by an incorrect reading of a nucleotide, is quite low. Thus, k–mers produced by single substitution errors are likely to have very small counter values in the CMS (less than \(c_0\) times) and can be considered as rare k–mers. One such error can only effect at most k k–mers. So if we count more than k rare k–mers, they most likely are not a result of one single substitution error. If we assume that the probability of multiple single substitution errors in a read is smaller than the probability of error-free rare k–mers, we should accept this read.

Condition (2) says that in the read, there are enough (namely at least B) k–mers where each of them appears too frequently to be a read error (CMS counters at least \(c_0\)), but not that abundant that it should be considered redundant (CMS counters less than \(c_1\)).

Results for Single-Cell Assemblies. We tested Bignorm on 13 bacterial single-cell data sets and were able to remove up to \(90\%\) of the reads without significant loss of the assembly quality. Some results (median of all samples) (Fig. 3):

Measurement

Filtered/Unfiltered (%)

Read count

2.85

Run time SPAdes Assembler

3.57

Largest Contig

97.56

N50

90.84

Mean Phred Score

103.00

Fig. 3.
figure 3

Reads kept

Bignorm heavily cuts away redundant reads (mean, Fig. 4, left-hand side) but is careful in critical regions (P10, Fig. 4, right-hand side).

Results for Metagenomic Assemblies. We tested Bignorm on metagenomic data sets. For data sets with reads of length about 250 base pairs, the results are quite promising and stable. Compared to the single cell case, the results are not that impressive, but compared to the State–of–the–art approach of sub-sampling data sets which are too big to be assembled on the given hardware (this means a certain proportion of reads is selected randomly), we could show that by read filtering it is possible to get results which are nearly as good as those of assembling the complete data set, using about the same amount of RAM and in run time as using the sub-sampling approach. The following table gives an impression on the results:

 

Raw

Filtered

Sub-sampled (3x)

Largest Contig

2183

1731

1356 ± 143

Total length

385282

358036

136552 ± 8406

Genome fraction (%)

15.0

14.0

5.4 ± 0.3

Predicted genes

689

648

262 ± 12

RAM needed (GB)

212

100

96 ± 0.6

Run time (h)

151

52

52 ±2

Fig. 4.
figure 4

Coverage: mean and critical region

2 Counting k–mers in External Memory (EM)

Many bioinformatics algorithms (e.g., assemblers, error correctors, read normalization) are based on k–mers, and that requires to count them (mostly for \(21\le k\le 127\)). As bioinformatics data sets are growing much faster than RAM sizes, new computational models are needed. (We could show that hash–based counting, which is state of the art in current software, will produce \(\mathscr {O}\,(n^{2})\) hash table dumps when the number of different k–mers is much bigger than the number of slots in the hash table.)

Some examples of recent k–mer counting algorithms are:

  • jellyfish (2) [19]: the standard, hash-based k–mer counter

  • dsk [26]: the first EM-based counter

  • kmc (2/3) [8, 9, 17]: the state–of–the–art EM-based counter.

  • bloomfish [12]: MPI-based Map–Reduce framework for counting

  • squeakr [21]: based on counting quotient filter (a probabilistic data structure)

  • turtle [27]: using a Bloomfilter and sort–and–compact algorithm

We need some notations:

  • All strings are based on the biological alphabet \(\varSigma =\{\textsf{A}, \textsf{C}, \textsf{G}, \textsf{T}\}\).

  • So the base set for a k–mer is \(\mathscr {M}:=\varSigma ^k\) and \(m:=|\mathscr {M}|=4^k\).

  • The input of a k–mer counter is \(\eta \in \underbrace{\mathscr {M}\times \mathscr {M}\times \cdots \times \mathscr {M}}_{n\ times}=\mathscr {M}^n\).

  • Denote by \(\mathscr {C}:=\{p\in \mathscr {M}\mid \exists _{i\le n}:p=\eta _i\}\) the set of k–mers occurring in the input at least once, \(c:=|\mathscr {C}|\).

  • Let R be the size (in bytes) of the RAM available.

  • Let B the number of bytes needed to count each element of \(\mathscr {C}\).

2.1 Counting in RAM

The straightforward algorithm for small values of k, counting can be done in RAM. If \(m\le R/B\), the following \(\mathscr {O}\,({n+m})\) algorithm can be used:

figure c

For \(k=19\) and one Byte per counter, \(4^{13}\approx 275GB\) of RAM is needed.

Hash–Based Counting. Most state–of–the–art k–mer counting programs are based on hash algorithms using open addressing:

  • Hash table with h entries of size \(B+\lceil {\frac{\log _2m}{8}}\rceil =B+\lceil {\frac{\log _2 4^k}{8}}\rceil =B+\lceil {\frac{k}{4}}\rceil \rightarrow h(B+\lceil {\frac{k}{4}}\rceil )\le R\)

  • For hash size \(h\gg c\) the time complexity is \(\mathscr {O}\,({n+h})\). But if \(h \approx c\) the run time may increase to \(\mathscr {O}\,({nh})\) and if \(c>h\), the program will fail (or dump to external memory)

  • Most existing programs will dump the full hash tables and merge them afterwards — for bigger data sets, this merging phase may need days and terabytes of external memory. The runtime depends linearly on the expected value \(\mathbb {E}[d(h,n)]\) of the number of hash table dumps. We can show the following formula for the expected value.

Theorem 1

(Gallus, Srivastav, Wedemeyer 2021). Counting a set of n elements of a population with K different, normally distributed types using a hash table of size h, the expected value of hash table dumps d(hn) is

$$\begin{aligned} \mathbb {E}[d(h,n)]=n \text { }\log _{(1-\frac{h}{c_n})}\left( 1-\frac{1}{c_n}\right) , \end{aligned}$$
(3)

where \(c_n=K(1-(1-\frac{1}{K})^n)\) gives the number of different (normally distributed) types in a set of size n.

This formula is the basis for further quantifying the log-term in (3). If one can show that this log-term behaves linearly or sublinearly in n in case of including singletons in the the set of k-mers, it would match experimental observations. In fact, a constant portion of the k–mers can be assumed as sequencing errors of which each occures exactly once.

2.2 Counting in External Memory

kmc3 is the presently leading program using the external memory model. It works as follows:

  • the input is parsed into \(k-x\)–mers (a combination of up to 3 k–mers)

  • they are split into a prefix and a suffix, the suffixes are written to one temporary file per prefix

  • each temporary file is loaded one by one into RAM, sorted (radix sort, library radul)

  • the sorted \(k-x\)–mers are unified and counted

  • written to a pair of special binary files (one index-file for the prefixes and one with the suffixes and the counts)

Drawback of kmc3: The output files of kmc3 are not completely sorted (due to the introduction of \(k-x\)–mers in kmc2). Therefore,

  • they need to be loaded into RAM completely for read out

  • exporting to other formats takes more time than counting

  • no compression is in place (although the suffix–files are highly compressible)

As a result, even though kmc3 is the fastest EM k–mer counter available (and the fastest k–mer counter overall under RAM restriction), it is not the perfect choice to be used as a counting module for an EM assembler.

Based on STXXL 1.4.1 [5], in 2018 Christopher Nehls [20] from Kiel University developed a k–mer counter called xsc which uses a sorting based approach:

  • generate k–mers from input

  • sort the k–mers (using the STXXL EM sorter)

  • count the k–mers

For \(k\le 32\), xsc outperformed jellyfish and was at least competitive to dsk, but kmc3 was always faster. For \(k>33\) (using uint128 and uint256 classes), xsc was not competitive to the existing counters. The main bottleneck of xsc is the overloaded relational operator (operator<).

2.3 Counting Using a Bloomfilter

Roy et.al. [27] stated that more than 50% of all k–mers in a sequencing data set may be singletons — which are not of interest as they were probably introduced by errors. To utilise this, their k–mer counter Turtle uses a upstream Bloomfilter to save space and time in a sorting based approach named ‘sort–and–compact’.

Fig. 5.
figure 5

Comparison of run times using or not using a bloomfilter

We developed a program which combines the ideas of kmc and the usage of a Bloomfilter, experiments show that the cost of running a bloomfilter is higher than the savings (Fig. 5). What is wrong? Say, we have 100 k–mers,

  • 50 singletons (occurring once)

  • 50 ‘good’ k–mers occurring \(100\times \) on average

Our input contains 5050 k–mers, the bloomfilter removes 100\(\rightarrow \approx 2\%\) of the input, not enough to compensate for the running time of the bloomfilter.

Our Current Approach. We have developed the following algorithm which combines sorting and kmc. Experiments are ongoing work:

figure d

3 A Streaming Algorithm for the Longest Path Problem

In de novo genome assembly, finding a large genome sequence called contig is the fundamental problem. It can be understood as computing a very long path in the associated graph, for example the de Bruijn graph ([3]). Unfortunately, computing the longest path in a graph is an NP-hard problem and the situation is even more worse if the graph is very large. In this chapter, we present a new algorithm for computing a long path, which is surprisingly competitive with RAM-based algorithms.

Graph streaming is a very efficient concept to handle big graphs, where the number of edges is far too large for computations in the main memory. The semi-streaming model was introduced by Feigenbaum et al. [11], and can be briefly described as follows:

In the semi-streaming model, the algorithm is allowed to use at most \(\mathscr {O}\,({n\cdot \textrm{polylog}(n)})\) bits of RAM where \(n\) is the number of vertices of the input graph. Because of this restriction, dense graphs where the number of edges is in the order of \(\omega {n\cdot \textrm{polylog}(n)}\), cannot be processed entirely in RAM. Instead, the edges are presented in a stream where the edges are in no particular order. Typically, it is desired to call only a small number of passes (over the input stream).

3.1 Our Tree-Based Algorithm

We give a streaming algorithm for the longest path problem in undirected graphs with a proven per-edge processing time of \(\mathscr {O}\,({n})\) published in the proceedings of the European Symposium on Algorithms in 2016 [16 SPP]. Our algorithm works in two phases, which we outline here briefly and explain in detail in Sect. 3.1. In the first phase, global information on the graph is gathered in form of a constant number of spanning trees \(T_1,\ldots ,T_\tau \). This is possible in the streaming model since roughly speaking, for a spanning tree we can “take edges as they come”. A spanning tree can be constructed in just one pass—we however use multiple passes and limit the maximum degree during the first passes in order to favor path-like structures and avoid clusters of edges. Experiments clearly indicate that this degree-limiting is essential for solution quality. The spanning trees fit into RAM, since we consider \(\tau \) as constant (we will in fact have \(\tau =1\) or \(\tau =2\) in the experiments). After construction of the \(\tau \) trees, they are merged into one graph U by taking the union of their edges. Then we use standard algorithms to determine a long path P in U, isolate P, and finally add enough edges around P to obtain a tree T.

Then, in the second phase, we conduct further passes during which we test if the exchange of single edges of T can improve the longest path in it. (A longest path in a tree can be found by conducting DFS two times [2]; the length of a longest path in a tree is its diameter.) The main challenge in the second phase is to quickly determine which edges should be exchanged. We show that this decision can be made in linear time, hence yielding a per-edge processing time of \(\mathscr {O}\,({n})\).

For a set X, we write \(x\ \textrm{unif}\ X\) to express that x is drawn uniformly at random from X.

figure e
figure f
figure g

An example run of the Algorithm is shown in Fig. 6.

Fig. 6.
figure 6

Example run of the algorithm’s steps.

3.2 Linear Complexity of the Streaming Algorithm

If the cycle C is of length \(\varOmega \,(n)\), then a naive implementation requires \(\varOmega \,(n^2)\) to find an edge \(e'\) to remove (temporarily remove each edge on the cycle and invoke the Dijkstra algorithm). However, we have:

Theorem 2

(Kliemann, Schielke, Srivastav 2016). Phase 2 can be implemented with per-edge processing time \(\mathscr {O}\,({n})\).

Proof

An \(\mathscr {O}\,({n})\) bound is clear for all lines of Algorithm 4, except Line 9 and Line 11. Denote

$$\begin{aligned} \ell ' := \max _{f \in E(C) \setminus \{e\}} \max \{|P| : P\text { is path in }T'-f\text { and }e \in E(P)\} \end{aligned}$$

and let \(R' \subseteq E(C) \setminus \{e\}\) be the set of edges where this maximum is attained. Then the following implications hold: \(\ell ' \le |P| \implies \ell ^* \le |P|\) and \(\ell ' > |P| \implies \ell ' = \ell ^*\). This is because if a longest path in \(T'-f\) is supposed to be longer than P, it must use e (since otherwise it would be a path in T). Hence it suffices to determine \(\ell '\), and if \(\ell ' > |P|\), to find an element of \(R'\).

Denote \(C=(v_i,\ldots ,v_k)\) the fundamental cycle for some \(k \in \mathbb {N}\) written so that \(e = v_1v_{k}\). When computing \(\ell '\), we can restrict to paths in \(T'\) of the form

$$\begin{aligned} (\ldots ,v_s,v_{s-1},\ldots ,v_1,v_k,v_{k-1},\ldots ,v_t,\ldots ) \end{aligned}$$
(4)

for \(1 \le s < t \le k\), where \(v_s\) is the first and \(v_t\) is the last common vertex, respectively, of the path and C. For each i, let \(T_i\) be the connected component of \(v_i\) in \(T-E(C)\), i.e., \(T_i\) is the part of T that is reachable from \(v_i\) without using the edges of C. Denote \(\ell (T_i)\) the length of a longest path in \(T_i\) that starts at \(v_i\) and denote \(c_i := \ell (T_i) + i - 1\) and \(a_i := \ell (T_i) + k-i\). Then a longest path entering C at \(v_s\) and leaving it at \(v_t\), as in (4), has length exactly \(c_s + a_t\). Hence we have to determine a pair (st) such that \(c_s + a_t\) is maximum (this maximum value is \(\ell '\)); we call such a pair an optimal pair. If the so determined value \(\ell '\) is not greater than |P|, then nothing further has to be done (the edge e cannot give an improvement). Otherwise, having constructed our optimal pair (st), we pick an arbitrary edge (e.g., uniformly at random) from \(\{v_i v_{i+1} : s \le i < t\}\), which are the edges between \(v_s\) and \(v_t\) on C. We show that the following algorithm computes the value \(\ell '\) and an optimal pair in \(\mathscr {O}\,({n})\).

figure h

The total of computations in Line 1 can be done by DFS in \(\mathscr {O}\,({n})\), and the loop in \(\mathscr {O}\,({k}) \le \mathscr {O}\,({n})\). We prove that the final (st) is optimal. For fixed t, the best possible length \(c_s + c_t\) is obtained if t is combined with an \(s < t\) where \(c_s \ge c_j\) for all \(j < t\). In the algorithm, for each t (when \(t=i+1\) in the loop) we combine \(a_t\) with the maximum \(\max _{j<t} c_j\) (stored in the variable M). Thus, when the algorithm terminates, \(L=\ell '\) and \(c_s + c_t = \ell '\).

Corollary 1

Our streaming algorithm (with the two phases as in Algorithm 3 and Algorithm 4) can be implemented with a per-edge processing time of \(\mathscr {O}\,({n})\).

We turn to the memory requirement. Denote by b the amount of RAM required to store one vertex or one pointer (e.g., \(b=32{bit}\) or \(b=64{bit}\)) and call \(n \cdot b\) one unit.

Theorem 3

Our streaming algorithm (with the two phases as in Algorithm 3 and Algorithm 4) conducts at most \(2 q_1 + q_2\) passes. Moreover, the algorithm can be implemented such that the RAM requirement is at most \(( \max \{4 \tau , \, 2\tau + 4\} \cdot n + c) \cdot b\) with a constant c.

The proof can be found in [16 SPP].

An experimental study was conducted on randomly generated instances with different structure, including ones created with the generator for hyperbolic geometric random graphs [18 SPP]. Different variants of our streaming algorithm are compared with four RAM algorithms: Warnsdorf and Pohl-Warnsdorf (two related classical heuristics [23, 24]), Pongrácz (a recently published heuristic [25]), and a simple randomized DFS. Experiments show that although we never do more than 11 passes, results delivered by our algorithm are competitive. We deliver at least \(71\%\) of the best result delivered by any of the tested RAM algorithms, with the exception of preferential attachment graphs. By considering low percentiles, we observe a similar quality without any restriction on the graph class. This is a good result also in absolute terms, since we observe that for each graph class and set of parameters, there is one algorithm that on average gives a path of length \(0.84 \cdot n\), i.e., \(84\%\) of a Hamilton path. On some graph classes, we outperform any of the tested RAM algorithms, which makes our algorithm interesting even outside of the streaming setting.

4 An One Pass Streaming Algorithm for Computing the Euler Tour in Graphs

Large genome sequences (contigs) can be computed in de novo genome assembly with so-called de Bruijn graphs on k-mers ([3, 22]). Such graphs are directed. For very large graphs, the computation of an Euler tour cannot be done with known RAM-based algorithms and techniques like semi-streaming or external memory algorithms are sought. In this chapter, we present a survey on our optimal one-pass streaming algorithm for computing an Euler tour in an undirected graph. Our algorithm might be helpful to design a semi-streaming algorithm to compute Euler tours in a directed graph, which is an open problem.

Let G be a graph on n nodes and m edges given in the form of a data stream. We study the problem of finding an Euler tour in G. We present a survey on the first one-pass streaming algorithm computing an Euler tour of G in the form of an edge successor function with only \(\mathscr {O}(n\log (n))\) RAM based on our paper [13 SPP]. The memory requirement is optimal for this setting according to Sun and Woodruff [28].

4.1 The W-Streaming Model and a Lower Bound

The W-streaming model was introduced by Demetrescu et al. [7]. It is a relaxation of the classical streaming model. At each pass, an output stream is written, which becomes the input stream of the next pass. For an Euler tour the successor of each edge in the tour is uniquely defined by its successor function, say \(\delta \). Then the output stream has the following form, where the edges are unordered.

...

e

\(\delta (e)\)

\(\delta (\delta (e))\)

...

Finding an Euler tour in trees in W-streaming has been studied in multiple papers (e.g.,  [6]), but the general Euler tour problem has hardly been considered in a streaming model. There are some general results for transferring PRAM algorithms to the W-streaming model. In general, lower bounds for the complexity of streaming algorithms are hard to prove. Interestingly, Sun and Woodruff [28] showed that even a one-pass streaming algorithm for verifying whether a graph is Eulerian needs \(\varOmega (n \log (n))\) RAM, and this amount of RAM is also required for a one pass streaming algorithm for finding an Euler tour.

4.2 The Problem of Cycle Merging

The Euler tour problem in the RAM model can be easily solved by computing edge-disjoint cycles and merging them. We will see, why this is a problem with limited RAM. A cycle is a closed walk on the edges of G such that every node is visited at most once. The following result is well-known in graph theory.

Theorem 4

If a graph with m edges contains an Euler Tour, it can be decomposed into at most \(\frac{m}{3}\) pairwise edge-disjoint cycles.

figure i

In fact, this can be accomplished in one pass.

Theorem 5

During the pass, the edges from the input-stream can be ordered in form of a sequence of edge-disjoint cycles.

Proof

  1. 1.

    Start with \(T:=\emptyset \)

  2. 2.

    While T is cycle-free, add edges from the input stream to T

  3. 3.

    When a cycle occurs in T, store it and delete all its edges from T. Go to Step 2.

At every time, T contains at most n edges.

If \(T\ne \emptyset \) at the end, there are some nodes of odd degree, thus G does not contain an Euler Tour.

Obviously and unfortunately, we cannot store all the cycles in the semi-streaming model. The challenge is to merge cycles, when they are appearing with respect to the memory limitation of \(\mathscr {O}(n log(n))\). We will use the notion of tours or subtours for cycles, too.

The merging of two tours at one node is easy. We just flip edges in canonical way and get the new tour:

figure j

Similarly, one can merge several tours at one common node.

The problematic case is the simultaneous merging at two nodes. There is an example.

figure k

Unfortunately, the result of this merging is two tours, and the merging failed. A problem only occurs if the cycle shares more than one node with an already existing tour. In this case, we have to make sure that edge-swapping is performed at exactly one of these nodes. Every node belongs to at most one tour at a time, thus all nodes of a tour can get the same label.

4.3 The W-Streaming Algorithm and Its Analysis

We proceed to the pseudo-code statement of our streaming algorithm.

figure l
figure m

The output stream is a successor function, i.e. \(e_1\), \(\delta (e_1)\), \(e_2\), \(\delta (e_2)\), \(\ldots \) For \(a,b,c \in V\) with (ab); \((b,c) \in \vec {E}\) the triple (abc) represents the successor function \((a,b) \rightarrow \delta \left( (a,b) \right) = (b,c)\). So, edge (bc) is the successor of edge (ab). The output stream is not necessarily an ordered trail!

The main result is the following theorem [13 SPP].

Theorem 6

(Glazik, Schiemann, Srivastav, 2017). There exists an one-pass W-Streaming algorithm with \(\textrm{O}(n \log n)\) RAM that outputs an Euler tour on the input graph G (if G contains an Euler tour).

We sketch the proof. Let \(\delta \) be a successor function. Equivalence classes: \(e \in E: [e]_{\delta } = \{f \in E; \underbrace{e \equiv _\delta f}_{\exists k: \delta ^k(e) = f}\}\) We identify the successor function with equivalence classes on \(\vec {E}\).

Lemma 1

(Algebraic Representation [13 SPP], Lemma 1). Let \(\delta \) be a bijective successor function on a directed graph \(\vec {G}=(V,\vec {E})\). Then \(\equiv _\delta \) is an equivalence relation on \(\vec {E}\).

Lemma 2

Let \(\vec {G}=(V,\vec {E})\) be a directed graph with bijective successor function \(\delta \) and the related equivalence relation \(\equiv _\delta \). Then we have:

  1. (i)

    Let \(e\in \vec {E}\) and \(k_1,k_2\in \mathbb {N}\) with \(k_1\ne k_2\) and \(\delta ^{k_1}(e)=\delta ^{k_2}(e)\). Then \(|k_1-k_2|\ge |{[e]}_\delta |\).

  2. (ii)

    For any \(e\in \vec {E}\) we have \(\delta ^{|{[e]}_\delta |}(e)=e\).

Proof

  1. (i):

    \(F_s: \vec {E} \rightarrow \vec {E}\), \(s \in \mathbb {N}\), \(F_s(e') = \delta ^{s(k_1-k_2)}(e')\).

    • \(\delta ^{k_2}(e)\) fixpoint of \(F_s\).

    • \(M:=\{\delta ^\ell (e); k_2 \le \ell < k_1\}\), \(|M| \le k_1 - k_2\)

    • \([e]_{\delta } \subseteq M\) by fixpoint property of \(F_s\)

    The assumption \(k_1 - k_2 < |[e]_{\delta }|\) implies \(|M| < |[e]_{\delta }| \le |M|\rightarrow contradiction\)

  2. (ii):

    \(r:= |[e]_{\delta }|\). Lets assume for a moment that \(\delta ^r(e_0) \ne e_0\) for some \(e_0\).

    • \(M:= \{\delta ^\ell (e_0); 1 \le \ell \le r\} \subseteq [e_0]_{\delta }\)

Case 1: \(e_0 \in M\). Then

$$\begin{aligned} \delta ^0(e_0) = e_0 = \delta ^\ell (e_0) \text { for some }\ell < r. \end{aligned}$$

By (i): \(\ell -0 \ge r\rightarrow contradiction\)

Case 2: \(e\notin M\). Then \(|M|<|{[e]}_\delta |\). By the pigeonhole principle, there exist \(1\le k_1,k_2\le |{[e]}_\delta |\) with \(\delta ^{k_1}(e)=\delta ^{k_2}(e)\) in contradiction to (i).

Further, a structured theorem is needed. For an edge \(e=(v,w)\) let \(e_{(1)} := v\), \(e_{(2)} := w\).

Theorem 7

(Successor function generates Euler tour [13 SPP], Theorem 3). Let \(=\vec {G}(V,E)\) be a directed graph with bijective successor function \(\delta \) such that \(e\equiv _\delta e'\) for all \(e,e'\in \vec {E}\). Then \(\delta \) is the successor function of an Euler tour for G.

Let \(\delta ^0\) be the successor function of an edge disjoint cycle decomposition of G. The algorithm computes a sequence of successor functions \(\delta _0^* = \delta ^0, \delta _1^*, \ldots , \delta _N^* := \delta ^*\)

Theorem 8

If G is Eulerian, \(\delta ^*\) determines an Euler tour on G.

The following lemma is the backbone of the proof and requires substantial work.

Lemma 3

([13 SPP], Lemma 9). Let \(k\in \{0,\ldots ,N\}\). Then, \(\delta ^*_k\) is bijective and for any \((u,v),(u',v') \in R^*(E)\), we have

  1. (i)

    If \((u,v),(u',v')\) are processed edges, then \((u,v)\equiv _{\delta ^*_k}(u',v')\Leftrightarrow t_k(u)=t_k(u')\).

  2. (ii)

    If (uv) is a processed edge, then \(t_k(u)=t_k(v)\).

  3. (iii)

    If \(t_k(u)=0\), then \((u,v)\equiv _{\delta ^*_k}(u',v')\Leftrightarrow (u,v)\equiv _{\delta }(u',v')\).

Proof

(Proof of Theorem 8). We show: If \(\delta ^*\) is bijective and \(e \equiv _{\delta ^*} e'\) for all \(e, e'\), then \(\delta ^*\) is an Euler tour by Theorem 3. Then, by Lemma 3, \(\delta ^* = \delta _N^*\) is bijective. For the second property let \(e, e' \in E\), \(e=(u,v)\) and \(e'=(u',v')\). We show \(e \equiv _{\delta ^*} e'\). Now, there exists an u\(u'\)–path P in G because G is Eulerian. Let \(P=u~ x_1~ x_2 \ldots x_k~ u'\) such a path.

figure n

By Lemma 3 (ii), label \(t_N\) propagates through P:

$$\begin{aligned}&t_N(u) = t_N(x_1) = t_N(x_2) = \cdots = t_N(x_k) = t_N(u')\\ \Rightarrow ~~~~&t_N(u) = t_N(u')\\ \underset{\text {Lemma }3 \text {(i)}}{\Rightarrow }&e \equiv _{\delta _N^*} e' \end{aligned}$$

In future work we may investigate other routing problems and applications for streaming algorithms using Euler tours.