From Indexing Data Structures to de Bruijn Graphs
New technologies have tremendously increased sequencing throughput compared to traditional techniques, thereby complicating DNA assembly. Hence, assembly programs resort to de Bruijn graphs (dBG) of k-mers of short reads to compute a set of long contigs, each being a putative segment of the sequenced molecule. Other types of DNA sequence analysis, as well as preprocessing of the reads for assembly, use classical data structures to index all substrings of the reads. It is thus interesting to exhibit algorithms that directly build a dBG of order k from a pre-existing index, and especially a contracted version of the dBG, where non branching paths are condensed into single nodes. Here, we formalise the relationship between suffix trees/arrays and dBGs, and exhibit linear time algorithms for constructing the full or contracted dBGs. Finally, we provide hints explaining why this bridge between indexes and dBGs enables to dynamically update the order k of the graph.
KeywordsLinear Time Algorithm Initial Node Grey Node String Graph Eulerian Path
Unable to display preview. Download preview PDF.
- 1.Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer (1985)Google Scholar
- 4.Cazaux, B., Lecroq, T., Rivals, E.: From Indexing Data Structures to de Bruijn Graphs. Technical report, lirmm-00950983 (February 2014)Google Scholar
- 5.Chikhi, R., Limasset, A., Jackman, S., Simpson, J., Medvedev, P.: On the representation of de Bruijn graphs. ArXiv e-prints (January 2014)Google Scholar
- 17.Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)Google Scholar