Superstring Graph: A New Approach for Genome Assembly

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9778)

Abstract

With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatus yield millions of short sequencing reads that need to be merged based on their overlaps. Overlap graph based algorithms were used with the first generation of sequencers, while de Bruijn graph (DBG) based methods were preferred for the second generation. Because the sequencing coverage varies locally along the molecule, state-of-the-art assembly programs now follow an iterative process that requires the construction of de Bruijn graphs of distinct orders (i.e., sizes of the overlaps). The set of resulting sequences, termed unitigs, provide an important improvement compared to single DBG approaches. Here, we present a novel approach based on a digraph, the Superstring Graph, that captures all desired sizes of overlaps at once and allows to discard unreliable overlaps. With a simple algorithm, the Superstring Graph delivers sequences that includes all the unitigs obtained from multiple DBG as substrings. In linear time and space, it combines the efficiency of a greedy approach to the advantages of using a single graph. In summary, we present a first and formal comparison of the output of state-of-the-art genome assemblers.

References

  1. 1.
    Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol. 19(5), 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de bruijn graphs CoRR abs/1411.2718 (2014)Google Scholar
  3. 3.
    Cazaux, B., Cánovas, R., Rivals, E.: Shortest DNA cyclic cover in compressed space. In: Data Compression Conference DCC, pp. 536–545. IEEE Computer Society Press (2016)Google Scholar
  4. 4.
    Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de bruijn graphs. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 89–99. Springer, Heidelberg (2014)Google Scholar
  5. 5.
    Cazaux, B., Rivals, E.: A linear time algorithm for shortest cyclic cover of strings. J. Discrete Algorithms (2016). doi:10.1016/j.jda.2016.05.001
  6. 6.
    Cazaux, B., Rivals, E.: The power of greedy algorithms for approximating Max-ATSP, cyclic cover, and superstrings. Discrete Appl. Math. (2015). doi:10.1016/j.dam.2015.06.003
  7. 7.
    Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20, 50–58 (1980)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Gusfield, D., Landau, G.M., Schieber, B.: An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett. 41(4), 181–185 (1992)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Lin, Y., Pevzner, P.A.: Manifold de bruijn graphs. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 296–310. Springer, Heidelberg (2014)Google Scholar
  10. 10.
    Mestre, J.: Greedy in approximation algorithms. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 528–539. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    G. K. C. of Scientists: Genome 10K a proposal to obtain whole-genome sequence for 10 000 vertebrate species. J. Hered. 100(6), 659–674 (2009)CrossRefGoogle Scholar
  12. 12.
    Ott, S.: Lower bounds for approximating shortest superstrings over an alphabet of size 2. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 55–64. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  13. 13.
    Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – a practical iterative de bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Tarhio, J., Ukkonen, E.: A greedy approximation algorithm for constructing shortest common superstrings. Theor. Comp. Sci. 57, 131–145 (1988)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Bastien Cazaux
    • 1
    • 2
  • Gustavo Sacomoto
    • 3
  • Eric Rivals
    • 1
    • 2
  1. 1.LIRMM, Université de Montpellier, CNRS UMR 5506MontpellierFrance
  2. 2.Institut Biologie ComputationnelleMontpellierFrance
  3. 3.INRIA Rhône-Alpes and Université Lyon 1, CNRS, UMR 5558LyonFrance

Personalised recommendations