Skip to main content

Superstring Graph: A New Approach for Genome Assembly

  • Conference paper
  • First Online:
Algorithmic Aspects in Information and Management (AAIM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9778))

Included in the following conference series:

Abstract

With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatus yield millions of short sequencing reads that need to be merged based on their overlaps. Overlap graph based algorithms were used with the first generation of sequencers, while de Bruijn graph (DBG) based methods were preferred for the second generation. Because the sequencing coverage varies locally along the molecule, state-of-the-art assembly programs now follow an iterative process that requires the construction of de Bruijn graphs of distinct orders (i.e., sizes of the overlaps). The set of resulting sequences, termed unitigs, provide an important improvement compared to single DBG approaches. Here, we present a novel approach based on a digraph, the Superstring Graph, that captures all desired sizes of overlaps at once and allows to discard unreliable overlaps. With a simple algorithm, the Superstring Graph delivers sequences that includes all the unitigs obtained from multiple DBG as substrings. In linear time and space, it combines the efficiency of a greedy approach to the advantages of using a single graph. In summary, we present a first and formal comparison of the output of state-of-the-art genome assemblers.

This work is supported by ANR Colib’read (ANR-12-BS02-0008), the Institut de Biologie Computationnelle (ANR-11-BINF-0002).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol. 19(5), 455–477 (2012)

    Article  MathSciNet  Google Scholar 

  2. Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de bruijn graphs CoRR abs/1411.2718 (2014)

    Google Scholar 

  3. Cazaux, B., Cánovas, R., Rivals, E.: Shortest DNA cyclic cover in compressed space. In: Data Compression Conference DCC, pp. 536–545. IEEE Computer Society Press (2016)

    Google Scholar 

  4. Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de bruijn graphs. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 89–99. Springer, Heidelberg (2014)

    Google Scholar 

  5. Cazaux, B., Rivals, E.: A linear time algorithm for shortest cyclic cover of strings. J. Discrete Algorithms (2016). doi:10.1016/j.jda.2016.05.001

    Google Scholar 

  6. Cazaux, B., Rivals, E.: The power of greedy algorithms for approximating Max-ATSP, cyclic cover, and superstrings. Discrete Appl. Math. (2015). doi:10.1016/j.dam.2015.06.003

    Google Scholar 

  7. Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20, 50–58 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  8. Gusfield, D., Landau, G.M., Schieber, B.: An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett. 41(4), 181–185 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  9. Lin, Y., Pevzner, P.A.: Manifold de bruijn graphs. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 296–310. Springer, Heidelberg (2014)

    Google Scholar 

  10. Mestre, J.: Greedy in approximation algorithms. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 528–539. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. G. K. C. of Scientists: Genome 10K a proposal to obtain whole-genome sequence for 10 000 vertebrate species. J. Hered. 100(6), 659–674 (2009)

    Article  Google Scholar 

  12. Ott, S.: Lower bounds for approximating shortest superstrings over an alphabet of size 2. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 55–64. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  13. Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – a practical iterative de bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Tarhio, J., Ukkonen, E.: A greedy approximation algorithm for constructing shortest common superstrings. Theor. Comp. Sci. 57, 131–145 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  15. The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Rivals .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Cazaux, B., Sacomoto, G., Rivals, E. (2016). Superstring Graph: A New Approach for Genome Assembly. In: Dondi, R., Fertin, G., Mauri, G. (eds) Algorithmic Aspects in Information and Management. AAIM 2016. Lecture Notes in Computer Science(), vol 9778. Springer, Cham. https://doi.org/10.1007/978-3-319-41168-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41168-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41167-5

  • Online ISBN: 978-3-319-41168-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics