Construction of a de Bruijn Graph for Assembly from a Truncated Suffix Tree

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8977)

Abstract

In the life sciences, determining the sequence of bio-molecules is essential step towards the understanding of their functions and interactions inside an organism. Powerful technologies allows to get huge quantities of short sequencing reads that need to be assemble to infer the complete target sequence. These constraints favour the use of a version de Bruijn Graph (DBG) dedicated to assembly. The de Bruijn Graph is usually built directly from the reads, which is time and space consuming. Given a set \(R\) of input words, well-known data structures, like the generalised suffix tree, can index all the substrings of words in \(R\). In the context of DBG assembly, only substrings of length \(k+1\) and some of length \(k\) are useful. A truncated version of the suffix tree can index those efficiently. As indexes are exploited for numerous purposes in bioinformatics, as read cleaning, filtering, or even analysis, it is important to enable the community to reuse an existing index to build the DBG directly from it. In an earlier work we provided the first algorithms when starting from a suffix tree or suffix array. Here, we exhibit an algorithm that exploits a reduced version of the truncated suffix tree and computes the DBG from it. Importantly, a variation of this algorithm is also shown to compute the contracted DBG, which offers great benefits in practice. Both algorithms are linear in time and space in the size of the output.

Keywords

Stringology Text Algorithms Indexing Data Structures De Bruijn Graph Assembly Space Complexity Dynamic Update 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes. Series F, vol. 12, pp. 85–96. Springer (1985)Google Scholar
  2. 2.
    Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  3. 3.
    de Bruijn, N.: On bases for the set of integers. Publ. Math. Debr. 1, 232–242 (1950)MATHGoogle Scholar
  4. 4.
    Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de Bruijn graphs. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 89–99. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  5. 5.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology 8, 22 (2013)CrossRefGoogle Scholar
  6. 6.
    Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)CrossRefGoogle Scholar
  7. 7.
    Golovnev, A., Kulikov, A.S., Mihajlin, I.: Approximating shortest superstring problem using de Bruijn graphs. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 120–129. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  8. 8.
    Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997) CrossRefMATHGoogle Scholar
  9. 9.
    McCreight, E.: A space-economical suffix tree construction algorithm. J. of Association for Computing Machinery 23(2), 262–272 (1976)CrossRefMATHMathSciNetGoogle Scholar
  10. 10.
    Na, J.C., Apostolico, A., Iliopoulos, C.S., Park, K.: Truncated suffix trees and their application to data compression. Theoretical Computer Science 304(1–3), 87–101 (2003)CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – A practical iterative de Bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  12. 12.
    Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98(17), 9748–9753 (2001)CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology 14(3), R30 (2013)CrossRefGoogle Scholar
  14. 14.
    Rizk, G., Gouin, A., Chikhi, R., Lemaitre, C.: Mindthegap: integrated detection and assembly of short and long insertions. Bioinformatics (2014)Google Scholar
  15. 15.
    Salmela, L.: Correction of sequencing errors in a mixed set of reads. Bioinformatics 26(10), 1284–1290 (2010)CrossRefGoogle Scholar
  16. 16.
    Schulz, M.H., Bauer, S., Robinson, P.N.: The generalised k-truncated suffix tree for time-and space-efficient searches in multiple DNA or protein sequences. International J. of Bioinformatics Research and Applications 4(1), 81–95 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.L.I.R.M.M. and Institut Biologie ComputationnelleUniversité de Montpellier II, CNRS U.M.R. 5506MontpellierFrance
  2. 2.LITIS EA 4108NormaStic CNRS FR 3638, Université de RouenRouenFrance

Personalised recommendations