Constructing String Graphs in External Memory

  • Paola Bonizzoni
  • Gianluca Della Vedova
  • Yuri Pirola
  • Marco Previtali
  • Raffaella Rizzi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8701)

Abstract

In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly.

Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest.

We have implemented our algorithm and compared its efficiency against SGA—the most advanced assembly string graph construction program.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 326–337. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling alternative splicing variants from RNA-Seq data with isoform graphs. J. Comput. Biol. 16(1), 16–40 (2014)CrossRefGoogle Scholar
  5. 5.
    Cox, A.J., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.B.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 214–224. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read alignment via bi-directional BWT. In: BIBM 2009, pp. 31–36 (2009)Google Scholar
  9. 9.
    Myers, E.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)Google Scholar
  10. 10.
    Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – A practical iterative de bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Salzberg, S.L., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)CrossRefGoogle Scholar
  12. 12.
    Shi, F.: Suffix arrays for multiple strings: A method for on-line multiple string searches. In: Jaffar, J., Yap, R.H.C. (eds.) ASIAN 1996. LNCS, vol. 1179, pp. 11–22. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  13. 13.
    Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)Google Scholar
  14. 14.
    Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)CrossRefGoogle Scholar
  15. 15.
    Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Paola Bonizzoni
    • 1
  • Gianluca Della Vedova
    • 1
  • Yuri Pirola
    • 1
  • Marco Previtali
    • 1
  • Raffaella Rizzi
    • 1
  1. 1.DISCoUniv. Milano-BicoccaMilanItaly

Personalised recommendations